Title :
Learning to integrate unlabeled data in text classification
Author_Institution :
Univ. of San Diego, San Diego, CA, USA
Abstract :
The paper deals with the text classification problem where labeled training samples are very limited while unlabeled data are readily available in large quantities. The paper proposes an efficient classification algorithm that incorporates a weighted k-means clustering scheme into an Expectation Maximization (EM) process. It aims to balance predictive values between labeled and unlabeled training data and improve classification accuracy. Since the algorithm is based on a fast clustering method, it can be applied to classify documents in large datasets. Preliminary experiments with several text classification collections show that the proper use of unlabeled data built in this proposed text classification algorithm could significantly improve classification accuracy.
Keywords :
expectation-maximisation algorithm; pattern classification; pattern clustering; text analysis; classification algorithm; expectation maximization process; labeled training samples; text classification; weighted k-means clustering scheme; Accuracy; classification; clustering; feature selection;
Conference_Titel :
Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on
Conference_Location :
Chengdu
Print_ISBN :
978-1-4244-5537-9
DOI :
10.1109/ICCSIT.2010.5564473