DocumentCode :
2737368
Title :
Poster: Distinguishing scientific abbreviations and genes in bio-medical literature mining
Author :
Liu, Guozhen ; Zhang, Han ; Quellhorst, George
Author_Institution :
QIAGEN Corp., Frederick, MD, USA
fYear :
2011
fDate :
3-5 Feb. 2011
Firstpage :
253
Lastpage :
253
Abstract :
Summary form only given. The accumulation of biomedical literature makes it increasingly difficult for scientists to keep up with scientific advancements, requiring the development of text mining tools to collect and integrate data in a high-throughput fashion. A major challenge in biomedical text mining is how to recognize genes sensitively and accurately, and translate them to their official gene symbols. Gene symbols and their commonly used aliases and synonyms usually derive from an abbreviation of the gene´s description. However, many gene symbols and alias exactly match other abbreviations commonly used in scientific literature that do not refer to genes. A systematic study on the abbreviations used in biomedical literatures should help improve the accuracy of gene recognition during text mining. A program was developed to extract all abbreviations in all free abstracts available in Pub Med from the 1960s until November 2010. We identified 198013 published abbreviations. Among them, 8602 are identical to a human gene symbol, alias or synonym in a dictionary containing 86615 entries. Of these 8602 abbreviations matching a human gene, 1095 only refer to the gene, 3581 refer to both the gene as well as another scientific term, and 3926 only refer to a another scientific term and not the gene. By checking for descriptions associated with abbreviations, whether the abbreviation refers to a gene or another scientific term can be determined. Compared to the simple use of a human gene dictionary as a guide, the presented method should provide a better solution for gene recognition during text mining by reducing false positive rates.
Keywords :
biology computing; cellular biophysics; data mining; dictionaries; genetics; molecular biophysics; biomedical literature mining; dictionary; false positive rates; gene recognition; official gene symbols; scientific abbreviations; text mining; Dictionaries; Electronic mail; Humans; Natural language processing; Text mining; Text recognition; USA Councils; abbreviation; gene identification; natural language processing; text mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Advances in Bio and Medical Sciences (ICCABS), 2011 IEEE 1st International Conference on
Conference_Location :
Orlando, FL
Print_ISBN :
978-1-61284-851-8
Type :
conf
DOI :
10.1109/ICCABS.2011.5729906
Filename :
5729906
Link To Document :
بازگشت