مرکز منطقه ای اطلاع رساني علوم و فناوري - Semi-supervised text classification from unlabeled documents using class associated words

DocumentCode :

3156648

Title :

Semi-supervised text classification from unlabeled documents using class associated words

Author :

Hong-qi Han ; Dong-Hua Zhu ; Xue-Feng Wang

Author_Institution :

Sch. of Manage. & Econ., Beijing Inst. of Technol., Beijing, China

fYear :

2009

fDate :

6-9 July 2009

Firstpage :

1255

Lastpage :

1260

Abstract :

Automatically classifying text documents is an important field in machine learning. Unsupervised text classification does not need training data but is often criticized to cluster blindly. Supervised text classification needs large quantities of labeled training data to achieve high accuracy. However, in practice, labeled samples are often difficult, expensive or time consuming to obtain. In the meanwhile, unlabeled documents can be collected easily owing to the rapid developing Internet. Class associated words are the words which represent the subject of classes and provide prior knowledge of classification for training a classifier. A learning algorithm, based on the combination of Expectation-Maximization (EM) and a Naive Bayes classifier, is introduced to classify documents from fully unlabeled documents using class associated words. Experimental results show that it has good classification capability with high accuracy, especially for those categories with small quantities of samples. In the algorithm, class associated words are used to set classification constraints during learning process to restrict to classify documents into corresponding class labels and improve the classification accuracy.

Keywords :

Bayes methods; classification; expectation-maximisation algorithm; learning (artificial intelligence); text analysis; automatic text document classification; class associated words; expectation-maximization; machine learning; naive Bayes classifier; semi-supervised text classification; unlabeled documents; unsupervised text classification; Energy management; Knowledge management; Machine learning; Machine learning algorithms; Power generation economics; Technology management; Testing; Text categorization; Training data; Water conservation; Expectation-Maximization; Naïve Bayes; class associated words; semi-supervised; text classification;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computers & Industrial Engineering, 2009. CIE 2009. International Conference on

Conference_Location :

Troyes

Print_ISBN :

978-1-4244-4135-8

Electronic_ISBN :

978-1-4244-4136-5

Type :

conf

DOI :

10.1109/ICCIE.2009.5223918

Filename :

5223918

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3156648