• DocumentCode
    1900202
  • Title

    Improved Document Clustering using k-means algorithm

  • Author

    Bide, Pramod ; Shedge, Rajashree

  • Author_Institution
    Dept. Comput. Eng., Ramrao Adik Inst. of Technol., Navi Mumbai, India
  • fYear
    2015
  • fDate
    5-7 March 2015
  • Firstpage
    1
  • Lastpage
    5
  • Abstract
    Searching for similar documents has a crucial role in document management. Because of tremendous increase in documents day by day, it is very essential to segregate these documents in proper clusters. Faster categorization of documents is required in forensic investigation but analysis of these documents is very difficult. So, there is a need to separate multiple collections of documents into similar ones through clustering. Specifying number of clusters is mandatory in existing partitioning algorithms and the output is totally dependent on given input. Over clustering is the major problem in document clustering. The proposed algorithm takes input as Keywords found after extraction and solves the problem of over clustering by dividing the documents into small groups using Divide and Conquer Strategy. In this paper, an Improved Document Clustering algorithm is given which generates number of clusters for any text documents and uses cosine similarity measures to place similar documents in proper clusters. Experimental results showed that accuracy of proposed algorithm is high compare to existing algorithm in terms of F-Measure and time complexity.
  • Keywords
    digital forensics; divide and conquer methods; pattern clustering; text analysis; cosine similarity measures; divide and conquer strategy; document categorization; document clustering algorithm; document management; forensic investigation; k-means algorithm; partitioning algorithms; similar document searching; text documents; Clustering algorithms; Cosine Similarity; Divide and Conquer; Document Clustering; Tf-Idf; Threshold;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Electrical, Computer and Communication Technologies (ICECCT), 2015 IEEE International Conference on
  • Conference_Location
    Coimbatore
  • Print_ISBN
    978-1-4799-6084-2
  • Type

    conf

  • DOI
    10.1109/ICECCT.2015.7226065
  • Filename
    7226065