A Simple and Fast Term Selection Procedure for Text Clustering

Author

Gonzaga, Luiz ; Grivet, Marco ; TerezaVasconcelos, A.

Author_Institution

Laboratorio Nacional de Computacao Cientifica, Rio de Janeiro

fYear

2007

fDate

20-24 Oct. 2007

Firstpage

777

Lastpage

781

Abstract

Text clustering is a theme that is receiving considerable attention nowadays in areas such as text mining and information retrieval. A starting point for clustering methods applied on unstructured document collection is the creation of a vector-space model usually known as bag-ofwords model [1J. Documents are then usually described by a matrix which happens to be huge and extremely sparse which is due to the exceeding number of terms describing the set of documents. Although several techniques can be employed to reduce this number, the final figure is still high thus leading to a feature space of high dimensionality. This paper presents a simple procedure that not only considerably reduces the dimensionality of the feature space and hence the processing time, but also produces clustering performances comparable or even better when confronted with the full set of terms.

Keywords

data mining; data reduction; information retrieval; pattern clustering; sparse matrices; text analysis; dimensionality reduction; feature space; information retrieval; sparse matrix; text clustering; text mining; unstructured document collection; vector-space model; Abstracts; Broadcasting; Clustering algorithms; Clustering methods; Frequency; Information retrieval; Intelligent systems; Neodymium; Sparse matrices; Text mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Intelligent Systems Design and Applications, 2007. ISDA 2007. Seventh International Conference on

Conference_Location

Rio de Janeiro

Print_ISBN

978-0-7695-2976-9

Type

conf

DOI

10.1109/ISDA.2007.15

Filename

4389702