Title :
Exploiting unlabeled data for improving accuracy of predictive data mining
Author :
Peng, Kang ; Vucetic, Slobodan ; Han, Bo ; Xie, Hongbo ; Obradovic, Zoran
Author_Institution :
Center for Inf. Sci. & Technol., Temple Univ., Philadelphia, PA, USA
Abstract :
Predictive data mining typically relies on labeled data without exploiting a much larger amount of available unlabeled data. We show that using unlabeled data can be beneficial in a range of important prediction problems and therefore should be an integral part of the learning process. Given an unlabeled dataset representative of the underlying distribution and a K-class labeled sample that might be biased, our approach is to learn K contrast classifiers each trained to discriminate a certain class of labeled data from the unlabeled population. We illustrate that contrast classifiers can be useful in one-class classification, outlier detection, density estimation, and learning from biased data. The advantages of the proposed approach are demonstrated by an extensive evaluation on synthetic data followed by real-life bioinformatics applications for (1) ranking PubMed articles by their relevance to protein disorder and (2) cost-effective enlargement of a disordered protein database.
Keywords :
data mining; learning (artificial intelligence); medical information systems; pattern classification; probability; very large databases; K contrast classifier learning; K-class labeled sample; PubMed article ranking; biased data; disordered protein database; one-class classification; outlier detection; prediction problem; predictive data mining accuracy improvement; real-life bioinformatics application; synthetic data; unlabeled data exploitation; Accuracy; Bioinformatics; Costs; Data mining; Databases; Information science; Labeling; Proteins; Sampling methods; Supervised learning;
Conference_Titel :
Data Mining, 2003. ICDM 2003. Third IEEE International Conference on
Print_ISBN :
0-7695-1978-4
DOI :
10.1109/ICDM.2003.1250929