DocumentCode
3563617
Title
Protein named entity classification with probabilistic features derived from GENIA corpus and MEDLINE
Author
Sumathipala, Sagara ; Yamada, Koichi ; Unehara, Muneyuki
Author_Institution
Grad. Sch. of Eng., Nagaoka Univ. of Technol., Nagaoka, Japan
fYear
2014
Firstpage
1257
Lastpage
1261
Abstract
Biomédical named entity recognition (BNER) is one of the most essential and initial tasks (discovering relations between biomédical entities, identifying molecular pathways, etc.) of biomédical information retrieval. Although named entity recognition performed well in ordinary text, it still remains challenging in molecular biology domain because of the complex nature of biomédical nomenclature, different kinds of spelling forms and many more reasons. Even though biomédical entities in biological text are found successfully, classifying them into relevant biomédical classes such as genes, proteins, diseases, drug names, etc. is still another challenge and an open question. This paper presents a new method to classify biomédical named entities into protein and non-protein classes. Our approach employs Random Forest, a machine learning algorithm, with a new combination of features. They are orthographic, keyword and morphological, as well as a probabilistic feature called Proteinhood and a Protein-Score feature based on the Medline abstracts cited on the Pubmed, which are the main contributions in the paper. A series of experiments is conducted to compare the proposed approach with other state of the art approaches. Our protein named entity classifier shows significant performance in the experiments on GENIA corpus achieving the highest values of precision 93.8%, recall 83.8% and F-measure 88.5% for protein named entity identification. In this study we showed the effect of new Proteinhood and Protein-Score features as well as adjusting parameters of Random Forest algorithm.
Keywords
classification; information retrieval; learning (artificial intelligence); medical computing; text analysis; BNER; GENIA corpus; MEDLINE; Pubmed; biological text; biomedical classes; biomedical information retrieval; biomedical named entity recognition; biomedical nomenclature; machine learning algorithm; molecular biology domain; molecular pathways; nonprotein classes; probabilistic features; protein named entity classification; protein-score feature; proteinhood; random forest; Biomedical measurement; Protein engineering; Proteins; Radio frequency; Silicon; Training data; Biomédical named entity; Biomédical text mining; Computational molecular biology; Named entity recognition; Protein named entity;
fLanguage
English
Publisher
ieee
Conference_Titel
Soft Computing and Intelligent Systems (SCIS), 2014 Joint 7th International Conference on and Advanced Intelligent Systems (ISIS), 15th International Symposium on
Type
conf
DOI
10.1109/SCIS-ISIS.2014.7044640
Filename
7044640
Link To Document