Text categorization using the semi-supervised fuzzy c-means algorithm

Author

Benkhalifa, Mohammed ; Bensaid, Amine ; Mouradi, Abdelhak

Author_Institution

Sch. of Sci. & Eng, AlAkhawayn, Morocco

fYear

1999

fDate

36342

Firstpage

561

Lastpage

565

Abstract

Text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has become very important in the information retrieval area, where information needs have tremendously increased with the rapid growth of textual information sources such as the Internet. We compare, for text categorization, two partially supervised (or semi-supervised) clustering algorithms: the Semi-Supervised Agglomerative Hierarchical Clustering (ssAHC) algorithm (A. Amar et al., 1997) and the Semi-Supervised Fuzzy-c-Means (ssFCM) algorithm (M. Amine et al., 1996). This (semi-supervised) learning paradigm falls somewhere between the fully supervised and the fully unsupervised learning schemes, in the sense that it exploits both class information contained in labeled data (training documents) and structure information possessed by unlabeled data (test documents) in order to produce better partitions for test documents. Our experiments, make use of the Reuters 21578 database of documents and consist of a binary classification for each of the ten most populous categories of the Reuters database. To convert the documents into vector form, we experiment with different numbers of features, which we select, based on an information gain criterion. We verify experimentally that ssFCM both outperforms and takes less time than the Fuzzy-c-Means (FCM) algorithm. With a smaller number of features, ssFCM´s performance is also superior to that of ssAHC´s. Finally ssFCM results in improved performance and faster execution time as more weight is given to training documents

Keywords

fuzzy set theory; information retrieval; knowledge based systems; learning (artificial intelligence); text analysis; Internet; Reuters 21578 database; Semi-Supervised Agglomerative Hierarchical Clustering algorithm; Semi-Supervised Fuzzy-c-Means algorithm; automated assignment; binary classification; class information; document contents; information gain criterion; information retrieval; labeled data; predefined categories; semi-supervised fuzzy c-means algorithm; structure information; text categorization; text documents; textual information sources; training documents; unlabeled data; Classification algorithms; Clustering algorithms; Information retrieval; Internet; Partitioning algorithms; Spatial databases; Supervised learning; Testing; Text categorization; Unsupervised learning;

fLanguage

English

Publisher

ieee

Conference_Titel

Fuzzy Information Processing Society, 1999. NAFIPS. 18th International Conference of the North American

Conference_Location

New York, NY

Print_ISBN

0-7803-5211-4

Type

conf

DOI

10.1109/NAFIPS.1999.781756

Filename

781756