• DocumentCode
    638034
  • Title

    Clustering documents using tagging communities and semantic proximity

  • Author

    Cunha, Eugenia ; Figueira, A. ; Mealha, Oscar

  • Author_Institution
    CRACS&INESC TEC, Porto, Portugal
  • fYear
    2013
  • fDate
    19-22 June 2013
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    Euclidean distance and cosine similarity are frequently used measures to implement the k-means clustering algorithm. The cosine similarity is widely used because of it´s independence from document length, allowing the identification of patterns, more specifically, two documents can be seen as identical if they share the same words but have different frequencies. However, during each clustering iteration new centroids are still computed following Euclidean distance. Based on a consideration of these two measures we propose the k-Communities clustering algorithm (k-C) which changes the computing of new centroids when using cosine similarity. It begins by selecting the seeds considering a network of tags where a community detection algorithm has been implemented. Each seed is the document which has the greater degree inside its community. The experimental results found through implementing external evaluation measures show that the k-C algorithm is more effective than both the k-means and k-means++. Besides, we implemented all the external evaluation measures, using both a manual and an automatic “Ground Truth”, and the results show a great correlation which is a strong indicator that it is possible to perform tests with this kind of measures even if the dataset structure is unknown.
  • Keywords
    document handling; pattern clustering; Euclidean distance; cosine similarity; document clustering; document length; k-C algorithm; k-communities clustering algorithm; k-means clustering algorithm; semantic proximity; tagging communities; Algorithm design and analysis; Classification algorithms; Clustering algorithms; Communities; Euclidean distance; Indexes; Partitioning algorithms; clustering; communitie detection; cosine similarity; effectiveness; k-Communities; k-means; tagging;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Systems and Technologies (CISTI), 2013 8th Iberian Conference on
  • Conference_Location
    Lisboa
  • Type

    conf

  • Filename
    6615753