Title :
A New Approach for Clustering Variable Length Documents
Author :
Kumar, Niraj ; Srinathan, Kannan
Author_Institution :
IIIT, Hyderabad
Abstract :
This paper proposes a method to cluster documents of variable length. The main idea is to apply (a) automatic identification of 1, 2, and 3 grams (To reduce the dependency on huge background vocabulary support or learning or complex probabilistic approach), (b) order them by some measure of relevance, which is developed with the help of Tf-Idf and Term-Weighting approach, and finally (c) use them (instead of bag of words based approach) to create vector space model and apply some known clustering methods i. e. Bisecting K-means, K-means, hierarchical method (single link) and Graph based method. Our experimental results with publicly available text dataset (Cogprints and NewsGroup20) show remarkable improvements in the performance of these clustering algorithms with this new approach.
Keywords :
document handling; learning (artificial intelligence); pattern clustering; vocabulary; K-means clustering; automatic identification; background vocabulary support; complex probabilistic approach; learning; term-weighting approach; variable length documents clustering; Classification tree analysis; Clustering algorithms; Clustering methods; Extraterrestrial measurements; Partitioning algorithms; Vocabulary; Bisecting K-means; Clustering algorithms; Document clustering; K-means; Vector Space Modelor; hierarchical methods;
Conference_Titel :
Advance Computing Conference, 2009. IACC 2009. IEEE International
Conference_Location :
Patiala
Print_ISBN :
978-1-4244-2927-1
Electronic_ISBN :
978-1-4244-2928-8
DOI :
10.1109/IADCC.2009.4809148