• DocumentCode
    3099990
  • Title

    Web Document Clustering with Multi-view Information Bottleneck

  • Author

    Gao, Yan ; Gu, Shiwen ; Xia, Liming ; Fei, Yaoping

  • Author_Institution
    Fac. of Inf. Sci. & Eng., Central South Univ., Changsha
  • fYear
    2006
  • fDate
    Nov. 28 2006-Dec. 1 2006
  • Firstpage
    148
  • Lastpage
    148
  • Abstract
    Clustering is an important way to organize the large amount of information on the Web. In this paper, we study how to incorporate many information of Web document, such as content, anchor, URL etc, to improve the performance of clustering. We propose a novel algorithm: multi-view information bottleneck (MVIB), to cluster Web documents with multi-type features. In this algorithm, the compatible constraint maximizing the agreement between clustering hypotheses on different views is imposed on the individual views to cluster instances. Based on the compatible constraints, the set of clustering hypotheses revealing lots of information about correct one is obtained. The final hypothesis can be deduced from these hypotheses. We study the performance of MVIB in different views setting. Experiments on two real datasets indicate that MVIB with 3-view setting based on content, anchor text and URL can improve the quality of clusters more effectively.
  • Keywords
    Internet; document handling; pattern clustering; text analysis; Web document clustering; anchor text; clustering hypotheses; multitype features; multiview information bottleneck; Clustering algorithms; Computational intelligence; Data compression; Data mining; Image processing; Information science; Mutual information; Natural language processing; Random variables; Uniform resource locators;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence for Modelling, Control and Automation, 2006 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, International Conference on
  • Conference_Location
    Sydney, NSW
  • Print_ISBN
    0-7695-2731-0
  • Type

    conf

  • DOI
    10.1109/CIMCA.2006.232
  • Filename
    4052777