DocumentCode
1900202
Title
Improved Document Clustering using k-means algorithm
Author
Bide, Pramod ; Shedge, Rajashree
Author_Institution
Dept. Comput. Eng., Ramrao Adik Inst. of Technol., Navi Mumbai, India
fYear
2015
fDate
5-7 March 2015
Firstpage
1
Lastpage
5
Abstract
Searching for similar documents has a crucial role in document management. Because of tremendous increase in documents day by day, it is very essential to segregate these documents in proper clusters. Faster categorization of documents is required in forensic investigation but analysis of these documents is very difficult. So, there is a need to separate multiple collections of documents into similar ones through clustering. Specifying number of clusters is mandatory in existing partitioning algorithms and the output is totally dependent on given input. Over clustering is the major problem in document clustering. The proposed algorithm takes input as Keywords found after extraction and solves the problem of over clustering by dividing the documents into small groups using Divide and Conquer Strategy. In this paper, an Improved Document Clustering algorithm is given which generates number of clusters for any text documents and uses cosine similarity measures to place similar documents in proper clusters. Experimental results showed that accuracy of proposed algorithm is high compare to existing algorithm in terms of F-Measure and time complexity.
Keywords
digital forensics; divide and conquer methods; pattern clustering; text analysis; cosine similarity measures; divide and conquer strategy; document categorization; document clustering algorithm; document management; forensic investigation; k-means algorithm; partitioning algorithms; similar document searching; text documents; Clustering algorithms; Cosine Similarity; Divide and Conquer; Document Clustering; Tf-Idf; Threshold;
fLanguage
English
Publisher
ieee
Conference_Titel
Electrical, Computer and Communication Technologies (ICECCT), 2015 IEEE International Conference on
Conference_Location
Coimbatore
Print_ISBN
978-1-4799-6084-2
Type
conf
DOI
10.1109/ICECCT.2015.7226065
Filename
7226065
Link To Document