DocumentCode :
3026646
Title :
Text categorization using the semi-supervised fuzzy c-means algorithm
Author :
Benkhalifa, Mohammed ; Bensaid, Amine ; Mouradi, Abdelhak
Author_Institution :
Sch. of Sci. & Eng, AlAkhawayn, Morocco
fYear :
1999
fDate :
36342
Firstpage :
561
Lastpage :
565
Abstract :
Text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has become very important in the information retrieval area, where information needs have tremendously increased with the rapid growth of textual information sources such as the Internet. We compare, for text categorization, two partially supervised (or semi-supervised) clustering algorithms: the Semi-Supervised Agglomerative Hierarchical Clustering (ssAHC) algorithm (A. Amar et al., 1997) and the Semi-Supervised Fuzzy-c-Means (ssFCM) algorithm (M. Amine et al., 1996). This (semi-supervised) learning paradigm falls somewhere between the fully supervised and the fully unsupervised learning schemes, in the sense that it exploits both class information contained in labeled data (training documents) and structure information possessed by unlabeled data (test documents) in order to produce better partitions for test documents. Our experiments, make use of the Reuters 21578 database of documents and consist of a binary classification for each of the ten most populous categories of the Reuters database. To convert the documents into vector form, we experiment with different numbers of features, which we select, based on an information gain criterion. We verify experimentally that ssFCM both outperforms and takes less time than the Fuzzy-c-Means (FCM) algorithm. With a smaller number of features, ssFCM´s performance is also superior to that of ssAHC´s. Finally ssFCM results in improved performance and faster execution time as more weight is given to training documents
Keywords :
fuzzy set theory; information retrieval; knowledge based systems; learning (artificial intelligence); text analysis; Internet; Reuters 21578 database; Semi-Supervised Agglomerative Hierarchical Clustering algorithm; Semi-Supervised Fuzzy-c-Means algorithm; automated assignment; binary classification; class information; document contents; information gain criterion; information retrieval; labeled data; predefined categories; semi-supervised fuzzy c-means algorithm; structure information; text categorization; text documents; textual information sources; training documents; unlabeled data; Classification algorithms; Clustering algorithms; Information retrieval; Internet; Partitioning algorithms; Spatial databases; Supervised learning; Testing; Text categorization; Unsupervised learning;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Fuzzy Information Processing Society, 1999. NAFIPS. 18th International Conference of the North American
Conference_Location :
New York, NY
Print_ISBN :
0-7803-5211-4
Type :
conf
DOI :
10.1109/NAFIPS.1999.781756
Filename :
781756
Link To Document :
بازگشت