Title :
Learning from positive and unlabeled documents for automated detection of alternative splicing sentences in medline abstracts
Author :
Chen, Yang ; Torii, Manabu ; Lu, Chang-Tien ; Liu, Hongfang
Author_Institution :
Dept. of Comput. Sci., Virginia Tech, Falls Church, VA, USA
Abstract :
Alternative splicing is considered to be a key factor underlying increased cellular and functional complexity in higher eukaryotes. With the advance of high-throughput genomics technologies, it becomes critical to mine alternative splicing knowledge from biological research literature. Meanwhile, there have been many papers published on DNA splicing and translation and it is time-consuming to find papers specifically relevant to alternative splicing. Observing that documents reporting alternative splicing can be obtained from existing knowledge bases recording literature evidences and also that a large number of unlabeled documents are freely available, we investigated learning from positive and unlabeled data (LPU) for retrieving papers relevant to alternative splicing. The positive documents are from Literature Support for Alternative Transcripts (LSAT) and unlabeled documents are obtained from Gene Reference Into Function (GeneRIF). We generated nine unlabeled datasets different in size or the way documents were sampled, and compared the performance of document classifiers built using different unlabeled datasets and machine learning algorithms. The study shows that LPU is a viable strategy to build document filtering system, while the performance of trained classifiers is affected by the choice of the unlabeled data set. Selection of machine learning algorithms and that of unlabeled documents would be critical in constructing an effective LPU-based system.
Keywords :
document handling; genomics; medical computing; Gene Reference Into Function; LPU-based system; alternative splicing sentence; automated detection; biological research literature; document classifiers; document filtering system; eukaryotes; functional complexity; high-throughput genomics technology; machine learning algorithm; medline abstracts; unlabeled data set; unlabeled datasets; unlabeled documents; Classification algorithms; Data mining; Machine learning algorithms; Reliability; Splicing; Support vector machines; Training; Alternative Splicing; Document Retrieval; LPU;
Conference_Titel :
Bioinformatics and Biomedicine Workshops (BIBMW), 2011 IEEE International Conference on
Conference_Location :
Atlanta, GA
Print_ISBN :
978-1-4577-1612-6
DOI :
10.1109/BIBMW.2011.6112425