DocumentCode :
1784901
Title :
Generating features for named entity recognition by learning prototypes in semantic space: The case of de-identifying health records
Author :
Henriksson, Aron ; Dalianis, Hercules ; Kowalski, Stewart
Author_Institution :
Dept. of Comput. & Syst. Sci., Stockholm Univ., Stockholm, Sweden
fYear :
2014
fDate :
2-5 Nov. 2014
Firstpage :
450
Lastpage :
457
Abstract :
Creating sufficiently large annotated resources for supervised machine learning, and doing so for every problem and every domain, is prohibitively expensive. Techniques that leverage large amounts of unlabeled data, which are often readily available, may decrease the amount of data that needs to be annotated to obtain a certain level of performance, as well as improve performance when large annotated resources are indeed available. Here, the development of one such method is presented, where semantic features are generated by exploiting the available annotations to learn prototypical (vector) representations of each named entity class in semantic space, constructed by employing a model of distributional semantics (random indexing) over a large, unannotated, in-domain corpus. Binary features that describe whether a given word belongs to a specific named entity class are provided to the learning algorithm; the feature values are determined by calculating the (cosine) distance in semantic space to each of the learned prototype vectors and ascertaining whether they are below or above a given threshold, set to optimize Fβ-score. The proposed method is evaluated empirically in a series of experiments, where the case is health-record deidentification, a task that involves identifying protected health information (PHI) in text. It is shown that a conditional random fields model with access to the generated semantic features, in addition to a set of orthographic and syntactic features, significantly outperforms, in terms of F1-score, a baseline model without access to the semantic features. Moreover, the quality of the features is further improved by employing a number of slightly different models of distributional semantics in an ensemble. Finally, the way in which the features are generated allows one to optimize them for various Fβ-scores, giving some degree of control to trade off precision and recall. Methods that are ab- e to improve performance on named entity recognition tasks by exploiting large amounts of unlabeled data may substantially reduce costs involved in creating annotated resources for every domain and every problem.
Keywords :
indexing; learning (artificial intelligence); medical information systems; semantic networks; PHI; annotated resources; conditional random field; cosine distance; deidentifying health record; distributional semantics; health-record deidentification; learning algorithm; learning prototype; named entity class; named entity recognition task; orthographic feature; protected health information; prototype vector; prototypical representation; random indexing; semantic features; semantic space; supervised machine learning; syntactic feature; unlabeled data; vector representation; Context; Indexes; Medical services; Prototypes; Semantics; Syntactics; Vectors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on
Conference_Location :
Belfast
Type :
conf
DOI :
10.1109/BIBM.2014.6999199
Filename :
6999199
Link To Document :
بازگشت