Title :
Experiments in text-based mining and analysis of biological information from MEDLINE on functionally-related genes
Author :
Moon, Naureen ; Singh, Rahul
Author_Institution :
Dept. of Comput. Sci., San Francisco State Univ., CA, USA
Abstract :
Technological advancements such as microarrays have enabled biologists to generate unprecedented quantities of data about biological entities. This has lead to the development of a large number of algorithms for processing and analysis of biological data. Challenges however remain; for instance, genes that function cooperatively need not have similar expression patterns. This suggests the use of non-numerical sources of information to explore the underlying biology. We experimentally study various factors that are inherent in algorithmic methodologies for text analysis. The proposed method accesses MEDLINE dynamically to account for the latest research, with the available literature corresponding to the genes analyzed to develop lists of keywords. Natural language processing (NLP) techniques such as stop-word filtering and stemming are then applied to the lists, and keyword frequencies weighted using the term frequency-inverse document frequency (TFIDF) scheme. The results are input to a hierarchical clustering algorithm to derive groupings of genes by functionality. The process is repeated using z-score weighting and latent semantic analysis (LSA) to determine which yields the most accurate clustering. The study presented examines the importance of these steps and their influence on the overall efficacy of the system. We believe that the analysis conducted as part of this research is invaluable to development and fine-timing of text mining methodologies for biological literature.
Keywords :
biology computing; data analysis; data mining; genetics; medical information systems; natural languages; pattern clustering; scientific information systems; text analysis; word processing; LSA; MEDLINE; biological data analysis; biological data processing algorithm; biological entity; biological information analysis; biological literature; biology; functionally-related genes; gene analysis; gene expression pattern; gene groupings; hierarchical clustering; keyword frequency; latent semantic analysis; microarray; natural language processing; stemming; stop-word filtering; technological advancement; term frequency-inverse document frequency; text analysis; text mining; text-based mining; z-score weighting; Algorithm design and analysis; Clustering algorithms; Data analysis; Data mining; Filtering; Frequency; Information analysis; Information resources; Natural language processing; Text analysis;
Conference_Titel :
Systems Engineering, 2005. ICSEng 2005. 18th International Conference on
Print_ISBN :
0-7695-2359-5
DOI :
10.1109/ICSENG.2005.41