Title :
Signature automation of UMLS concepts: An un-supervised named entity recognition framework for classification of DNA and RNA in biological text
Author :
Niazi, Muhammad Ashraf Khan ; Muzaffar, Abdul Wahab ; Latif, Muhammad ; Qamar, Usman
Author_Institution :
Nat. Univ. of Sci. & Technol., Islamabad, Pakistan
Abstract :
Named entity recognition, a task that represents atomicity as well as granularity is a first step in any language processing system. The advent in typological orientation of literature or text and its availability in the form of annotated and un-annotated corpora have led to a continued research effort directed towards achievement of yet an optimized algorithmic evolution for identifying named entities from text. Recognizing named entities from annotated corpora has matured comprehensively over a period of time while recognition from un-annotated corpora is still a challenge for research community. Furthermore, a challenge exponentially rises if corpora represent an applied literature from biological or biomedical domain. This paper presents an unsupervised named entity recognition framework that automates signature vectors for UMLS concepts. The idea behind it is to provide a vectorised perspective to UMLS concepts, semantic types and semantic groups. Vectored representation of UMLS ensures application of the framework in a generic way. Proposed approach differs with other un-supervised frameworks that employ signature and vector based approaches in a way, that it creates a vector space on the basis of UMLS instead of corpus. Dataset from GENIA was used for framework validation. Framework provided as a result of this research, achieved an accuracy of 68.34% which is far better when compared to 27% by METAMAP, 53.8% by CubNER for the same corpus.
Keywords :
DNA; RNA; biology computing; medical computing; natural language processing; pattern classification; text analysis; unsupervised learning; DNA classification; RNA classification; UMLS signature automation; UMLS vectored representation; annotated corpora; biological text; biomedical domain; language processing system; optimized algorithmic evolution; semantic groups; un-annotated corpora; unified medical language system; unsupervised named entity recognition framework; Biological system modeling; Context; DNA; RNA; Semantics; Unified modeling language; DNAs; Named Entity Recognition; Natural language Processing; RNAs; Seed Generation; Signature Vector; UMLS (Unified Medical Language System); Vector Space;
Conference_Titel :
Science and Information Conference (SAI), 2015
Conference_Location :
London
DOI :
10.1109/SAI.2015.7237223