DocumentCode
3026646
Title
Text categorization using the semi-supervised fuzzy c-means algorithm
Author
Benkhalifa, Mohammed ; Bensaid, Amine ; Mouradi, Abdelhak
Author_Institution
Sch. of Sci. & Eng, AlAkhawayn, Morocco
fYear
1999
fDate
36342
Firstpage
561
Lastpage
565
Abstract
Text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has become very important in the information retrieval area, where information needs have tremendously increased with the rapid growth of textual information sources such as the Internet. We compare, for text categorization, two partially supervised (or semi-supervised) clustering algorithms: the Semi-Supervised Agglomerative Hierarchical Clustering (ssAHC) algorithm (A. Amar et al., 1997) and the Semi-Supervised Fuzzy-c-Means (ssFCM) algorithm (M. Amine et al., 1996). This (semi-supervised) learning paradigm falls somewhere between the fully supervised and the fully unsupervised learning schemes, in the sense that it exploits both class information contained in labeled data (training documents) and structure information possessed by unlabeled data (test documents) in order to produce better partitions for test documents. Our experiments, make use of the Reuters 21578 database of documents and consist of a binary classification for each of the ten most populous categories of the Reuters database. To convert the documents into vector form, we experiment with different numbers of features, which we select, based on an information gain criterion. We verify experimentally that ssFCM both outperforms and takes less time than the Fuzzy-c-Means (FCM) algorithm. With a smaller number of features, ssFCM´s performance is also superior to that of ssAHC´s. Finally ssFCM results in improved performance and faster execution time as more weight is given to training documents
Keywords
fuzzy set theory; information retrieval; knowledge based systems; learning (artificial intelligence); text analysis; Internet; Reuters 21578 database; Semi-Supervised Agglomerative Hierarchical Clustering algorithm; Semi-Supervised Fuzzy-c-Means algorithm; automated assignment; binary classification; class information; document contents; information gain criterion; information retrieval; labeled data; predefined categories; semi-supervised fuzzy c-means algorithm; structure information; text categorization; text documents; textual information sources; training documents; unlabeled data; Classification algorithms; Clustering algorithms; Information retrieval; Internet; Partitioning algorithms; Spatial databases; Supervised learning; Testing; Text categorization; Unsupervised learning;
fLanguage
English
Publisher
ieee
Conference_Titel
Fuzzy Information Processing Society, 1999. NAFIPS. 18th International Conference of the North American
Conference_Location
New York, NY
Print_ISBN
0-7803-5211-4
Type
conf
DOI
10.1109/NAFIPS.1999.781756
Filename
781756
Link To Document