Protein named entity classification with probabilistic features derived from GENIA corpus and MEDLINE

Author

Sumathipala, Sagara ; Yamada, Koichi ; Unehara, Muneyuki

Author_Institution

Grad. Sch. of Eng., Nagaoka Univ. of Technol., Nagaoka, Japan

fYear

2014

Firstpage

1257

Lastpage

1261

Abstract

Biomédical named entity recognition (BNER) is one of the most essential and initial tasks (discovering relations between biomédical entities, identifying molecular pathways, etc.) of biomédical information retrieval. Although named entity recognition performed well in ordinary text, it still remains challenging in molecular biology domain because of the complex nature of biomédical nomenclature, different kinds of spelling forms and many more reasons. Even though biomédical entities in biological text are found successfully, classifying them into relevant biomédical classes such as genes, proteins, diseases, drug names, etc. is still another challenge and an open question. This paper presents a new method to classify biomédical named entities into protein and non-protein classes. Our approach employs Random Forest, a machine learning algorithm, with a new combination of features. They are orthographic, keyword and morphological, as well as a probabilistic feature called Proteinhood and a Protein-Score feature based on the Medline abstracts cited on the Pubmed, which are the main contributions in the paper. A series of experiments is conducted to compare the proposed approach with other state of the art approaches. Our protein named entity classifier shows significant performance in the experiments on GENIA corpus achieving the highest values of precision 93.8%, recall 83.8% and F-measure 88.5% for protein named entity identification. In this study we showed the effect of new Proteinhood and Protein-Score features as well as adjusting parameters of Random Forest algorithm.

Keywords

classification; information retrieval; learning (artificial intelligence); medical computing; text analysis; BNER; GENIA corpus; MEDLINE; Pubmed; biological text; biomedical classes; biomedical information retrieval; biomedical named entity recognition; biomedical nomenclature; machine learning algorithm; molecular biology domain; molecular pathways; nonprotein classes; probabilistic features; protein named entity classification; protein-score feature; proteinhood; random forest; Biomedical measurement; Protein engineering; Proteins; Radio frequency; Silicon; Training data; Biomédical named entity; Biomédical text mining; Computational molecular biology; Named entity recognition; Protein named entity;

fLanguage

English

Publisher

ieee

Conference_Titel

Soft Computing and Intelligent Systems (SCIS), 2014 Joint 7th International Conference on and Advanced Intelligent Systems (ISIS), 15th International Symposium on

Type

conf

DOI

10.1109/SCIS-ISIS.2014.7044640

Filename

7044640

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=3563617