DocumentCode :
3156648
Title :
Semi-supervised text classification from unlabeled documents using class associated words
Author :
Hong-qi Han ; Dong-Hua Zhu ; Xue-Feng Wang
Author_Institution :
Sch. of Manage. & Econ., Beijing Inst. of Technol., Beijing, China
fYear :
2009
fDate :
6-9 July 2009
Firstpage :
1255
Lastpage :
1260
Abstract :
Automatically classifying text documents is an important field in machine learning. Unsupervised text classification does not need training data but is often criticized to cluster blindly. Supervised text classification needs large quantities of labeled training data to achieve high accuracy. However, in practice, labeled samples are often difficult, expensive or time consuming to obtain. In the meanwhile, unlabeled documents can be collected easily owing to the rapid developing Internet. Class associated words are the words which represent the subject of classes and provide prior knowledge of classification for training a classifier. A learning algorithm, based on the combination of Expectation-Maximization (EM) and a Naive Bayes classifier, is introduced to classify documents from fully unlabeled documents using class associated words. Experimental results show that it has good classification capability with high accuracy, especially for those categories with small quantities of samples. In the algorithm, class associated words are used to set classification constraints during learning process to restrict to classify documents into corresponding class labels and improve the classification accuracy.
Keywords :
Bayes methods; classification; expectation-maximisation algorithm; learning (artificial intelligence); text analysis; automatic text document classification; class associated words; expectation-maximization; machine learning; naive Bayes classifier; semi-supervised text classification; unlabeled documents; unsupervised text classification; Energy management; Knowledge management; Machine learning; Machine learning algorithms; Power generation economics; Technology management; Testing; Text categorization; Training data; Water conservation; Expectation-Maximization; Naïve Bayes; class associated words; semi-supervised; text classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computers & Industrial Engineering, 2009. CIE 2009. International Conference on
Conference_Location :
Troyes
Print_ISBN :
978-1-4244-4135-8
Electronic_ISBN :
978-1-4244-4136-5
Type :
conf
DOI :
10.1109/ICCIE.2009.5223918
Filename :
5223918
Link To Document :
بازگشت