Title :
A Simple and Fast Term Selection Procedure for Text Clustering
Author :
Gonzaga, Luiz ; Grivet, Marco ; TerezaVasconcelos, A.
Author_Institution :
Laboratorio Nacional de Computacao Cientifica, Rio de Janeiro
Abstract :
Text clustering is a theme that is receiving considerable attention nowadays in areas such as text mining and information retrieval. A starting point for clustering methods applied on unstructured document collection is the creation of a vector-space model usually known as bag-ofwords model [1J. Documents are then usually described by a matrix which happens to be huge and extremely sparse which is due to the exceeding number of terms describing the set of documents. Although several techniques can be employed to reduce this number, the final figure is still high thus leading to a feature space of high dimensionality. This paper presents a simple procedure that not only considerably reduces the dimensionality of the feature space and hence the processing time, but also produces clustering performances comparable or even better when confronted with the full set of terms.
Keywords :
data mining; data reduction; information retrieval; pattern clustering; sparse matrices; text analysis; dimensionality reduction; feature space; information retrieval; sparse matrix; text clustering; text mining; unstructured document collection; vector-space model; Abstracts; Broadcasting; Clustering algorithms; Clustering methods; Frequency; Information retrieval; Intelligent systems; Neodymium; Sparse matrices; Text mining;
Conference_Titel :
Intelligent Systems Design and Applications, 2007. ISDA 2007. Seventh International Conference on
Conference_Location :
Rio de Janeiro
Print_ISBN :
978-0-7695-2976-9
DOI :
10.1109/ISDA.2007.15