Generating features for named entity recognition by learning prototypes in semantic space: The case of de-identifying health records

Author

Henriksson, Aron ; Dalianis, Hercules ; Kowalski, Stewart

Author_Institution

Dept. of Comput. & Syst. Sci., Stockholm Univ., Stockholm, Sweden

fYear

2014

fDate

2-5 Nov. 2014

Firstpage

450

Lastpage

457

Abstract

Creating sufficiently large annotated resources for supervised machine learning, and doing so for every problem and every domain, is prohibitively expensive. Techniques that leverage large amounts of unlabeled data, which are often readily available, may decrease the amount of data that needs to be annotated to obtain a certain level of performance, as well as improve performance when large annotated resources are indeed available. Here, the development of one such method is presented, where semantic features are generated by exploiting the available annotations to learn prototypical (vector) representations of each named entity class in semantic space, constructed by employing a model of distributional semantics (random indexing) over a large, unannotated, in-domain corpus. Binary features that describe whether a given word belongs to a specific named entity class are provided to the learning algorithm; the feature values are determined by calculating the (cosine) distance in semantic space to each of the learned prototype vectors and ascertaining whether they are below or above a given threshold, set to optimize F_β-score. The proposed method is evaluated empirically in a series of experiments, where the case is health-record deidentification, a task that involves identifying protected health information (PHI) in text. It is shown that a conditional random fields model with access to the generated semantic features, in addition to a set of orthographic and syntactic features, significantly outperforms, in terms of F₁-score, a baseline model without access to the semantic features. Moreover, the quality of the features is further improved by employing a number of slightly different models of distributional semantics in an ensemble. Finally, the way in which the features are generated allows one to optimize them for various F_β-scores, giving some degree of control to trade off precision and recall. Methods that are ab- e to improve performance on named entity recognition tasks by exploiting large amounts of unlabeled data may substantially reduce costs involved in creating annotated resources for every domain and every problem.

Keywords

indexing; learning (artificial intelligence); medical information systems; semantic networks; PHI; annotated resources; conditional random field; cosine distance; deidentifying health record; distributional semantics; health-record deidentification; learning algorithm; learning prototype; named entity class; named entity recognition task; orthographic feature; protected health information; prototype vector; prototypical representation; random indexing; semantic features; semantic space; supervised machine learning; syntactic feature; unlabeled data; vector representation; Context; Indexes; Medical services; Prototypes; Semantics; Syntactics; Vectors;

fLanguage

English

Publisher

ieee

Conference_Titel

Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on

Conference_Location

Belfast

Type

conf

DOI

10.1109/BIBM.2014.6999199

Filename

6999199

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=1784901