• DocumentCode
    2724890
  • Title

    Distributed Document Clustering Using Word-clusters

  • Author

    Deb, Debzani ; Angryk, Rafal A.

  • Author_Institution
    Dept. of Comput. Sci., Montana State Univ., Bozeman, MT
  • fYear
    2007
  • fDate
    March 1 2007-April 5 2007
  • Firstpage
    376
  • Lastpage
    383
  • Abstract
    Document clustering has become an increasingly important task in analyzing huge numbers of documents distributed among various sites. The challenging aspect is to analyze this enormous number of extremely high dimensional distributed documents and to organize them in such a way that results in better search and knowledge extraction without introducing much extra cost and complexity. This paper presents a distributed document clustering approach called distributed information bottleneck (DIB). DIB adopts a two stage agglomerative information bottleneck (aIB) algorithm to generate local clusters. At the first stage, the high-dimensional document vector is significantly reduced by finding word-clusters. These word-clusters are then used to obtain document-clusters in the second stage. DIB then extracts compact but informative local models from these document-clusters and transfers them to a central site. At the global site, the local models, that are likely to describe the same document set, are first combined. The resultant local models are then clustered by using the aIB algorithm to produce a hierarchical organization of all distributed documents. Our experimental results demonstrate the robustness, efficiency and effectiveness of DIB approach to cluster distributed documents.
  • Keywords
    distributed processing; document handling; pattern clustering; agglomerative information bottleneck; distributed document clustering; distributed information bottleneck; high dimensional distributed documents; high-dimensional document vector; knowledge extraction; local models; word-clusters; Clustering algorithms; Computational intelligence; Computer science; Costs; Data mining; Distributed computing; IEEE online publications; Robustness; Software libraries; USA Councils;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence and Data Mining, 2007. CIDM 2007. IEEE Symposium on
  • Conference_Location
    Honolulu, HI
  • Print_ISBN
    1-4244-0705-2
  • Type

    conf

  • DOI
    10.1109/CIDM.2007.368899
  • Filename
    4221323