مرکز منطقه ای اطلاع رساني علوم و فناوري - A New Approach for Clustering Variable Length Documents

DocumentCode :

3075157

Title :

A New Approach for Clustering Variable Length Documents

Author :

Kumar, Niraj ; Srinathan, Kannan

Author_Institution :

IIIT, Hyderabad

fYear :

2009

fDate :

6-7 March 2009

Firstpage :

982

Lastpage :

987

Abstract :

This paper proposes a method to cluster documents of variable length. The main idea is to apply (a) automatic identification of 1, 2, and 3 grams (To reduce the dependency on huge background vocabulary support or learning or complex probabilistic approach), (b) order them by some measure of relevance, which is developed with the help of Tf-Idf and Term-Weighting approach, and finally (c) use them (instead of bag of words based approach) to create vector space model and apply some known clustering methods i. e. Bisecting K-means, K-means, hierarchical method (single link) and Graph based method. Our experimental results with publicly available text dataset (Cogprints and NewsGroup20) show remarkable improvements in the performance of these clustering algorithms with this new approach.

Keywords :

document handling; learning (artificial intelligence); pattern clustering; vocabulary; K-means clustering; automatic identification; background vocabulary support; complex probabilistic approach; learning; term-weighting approach; variable length documents clustering; Classification tree analysis; Clustering algorithms; Clustering methods; Extraterrestrial measurements; Partitioning algorithms; Vocabulary; Bisecting K-means; Clustering algorithms; Document clustering; K-means; Vector Space Modelor; hierarchical methods;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Advance Computing Conference, 2009. IACC 2009. IEEE International

Conference_Location :

Patiala

Print_ISBN :

978-1-4244-2927-1

Electronic_ISBN :

978-1-4244-2928-8

Type :

conf

DOI :

10.1109/IADCC.2009.4809148

Filename :

4809148

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3075157