مرکز منطقه ای اطلاع رساني علوم و فناوري - Document Representation and Dimension Reduction for Text Clustering

DocumentCode :

2358890

Title :

Document Representation and Dimension Reduction for Text Clustering

Author :

Shafiei, Mahdi ; Wang, Singer ; Zhang, Roger ; Milios, Evangelos ; Tang, Bin ; Tougas, Jane ; Spiteri, Ray

Author_Institution :

Dalhousie Univ., Halifax

fYear :

2007

fDate :

17-20 April 2007

Firstpage :

770

Lastpage :

779

Abstract :

Increasingly large text damsels and the high dimensionality associated with natural language create a great challenge in text mining, In this research, a systematic study is conducted. in which three different document representation methods for text are used, together with three Dimension Reduction Techniques (DRT), in the context of the text clustering problem. Several standard benchmark datasets are used. The three Document representation methods considered are based on the vector space model, and they include word, multi-word term, and character N-gram representations. The dimension reduction methods are. independent component analysis (ICA). latent semantic indexing (LSI), and a feature selection technique based on Document Frequency (DF). Results are compared in terms of clustering performance, using the k-means clustering algorithm. Experiments show that ICA and LSI are clearly belter than DF on all darascls. For word and N-gram representation. ICA generally gives better results compared with LSI. Experiments also show that the word representation gives better clustering results compared to term and N-gram representation. Finally, for the N-gram representation, it is demonstrated that a profile length (before dimensionality reduction) of 2000 is sufficient to capture the information and in most cases, a -4-gram representation gives better performance than 3-gram representation.

Keywords :

data mining; data reduction; data structures; feature extraction; independent component analysis; natural languages; pattern clustering; text analysis; character N-gram representation; dimension reduction; document frequency; document representation; feature selection; independent component analysis; k-means clustering; latent semantic indexing; multiword term; natural language; text clustering; text mining; vector space model; word representation; Clustering algorithms; Communications technology; Computer science; Data mining; Frequency; Independent component analysis; Indexing; Large scale integration; Natural languages; Text mining;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Engineering Workshop, 2007 IEEE 23rd International Conference on

Conference_Location :

Istanbul

Print_ISBN :

978-1-4244-0832-0

Electronic_ISBN :

978-1-4244-0832-0

Type :

conf

DOI :

10.1109/ICDEW.2007.4401066

Filename :

4401066

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2358890