• DocumentCode
    866739
  • Title

    Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization

  • Author

    Hammouda, Khaled M. ; Kamel, Mohamed S.

  • Author_Institution
    Desire2Learn Inc., Kitchener, ON
  • Volume
    21
  • Issue
    5
  • fYear
    2009
  • fDate
    5/1/2009 12:00:00 AM
  • Firstpage
    681
  • Lastpage
    698
  • Abstract
    In distributed data mining, adopting a flat node distribution model can affect scalability. To address the problem of modularity, flexibility and scalability, we propose a Hierarchically-distributed Peer-to-Peer (HP2PC) architecture and clustering algorithm. The architecture is based on a multi-layer overlay network of peer neighborhoods. Supernodes, which act as representatives of neighborhoods, are recursively grouped to form higher level neighborhoods. Within a certain level of the hierarchy, peers cooperate within their respective neighborhoods to perform P2P clustering. Using this model, we can partition the clustering problem in a modular way across neighborhoods, solve each part individually using a distributed K-means variant, then successively combine clusterings up the hierarchy where increasingly more global solutions are computed. In addition, for document clustering applications, we summarize the distributed document clusters using a distributed keyphrase extraction algorithm, thus providing interpretation of the clusters. Results show decent speedup, reaching 165 times faster than centralized clustering for a 250-node simulated network, with comparable clustering quality to the centralized approach. We also provide comparison to the P2P K-means algorithm and show that HP2PC accuracy is better for typical hierarchy heights. Results for distributed cluster summarization match those of their centralized counterparts with up to 88% accuracy.
  • Keywords
    data mining; distributed processing; document handling; pattern clustering; peer-to-peer computing; distributed cluster summarization; distributed data mining; distributed document clusters; distributed k-means variant; distributed keyphrase extraction algorithm; flat node distribution; hierarchically distributed peer-to-peer document clustering; higher level neighborhoods; multilayer overlay network; Abstracting methods; Clustering; Data mining; Distributed systems; Text mining;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2008.189
  • Filename
    4626955