• DocumentCode
    2521436
  • Title

    An online clustering algorithm for Chinese web snippets based on Generalized Suffix Array

  • Author

    Hui, Zhang ; Han, Wang ; Gao, Yang ; Jingmin, Zhou

  • Author_Institution
    State Key Lab. of Software Dev. Environ., Beihang Univ., Beijing, China
  • fYear
    2009
  • fDate
    10-11 Oct. 2009
  • Firstpage
    148
  • Lastpage
    154
  • Abstract
    As the information on the Internet increases dramatically, the Web search engine has become an indispensable tool to search and locate the required information. Web snippets clustering can classify the search results and help users to narrow the search scope. This paper presents an online clustering algorithm for Chinese web snippets using common substrings. The algorithm firstly preprocesses the results of a search engine and extracts common substrings using Generalized Suffix Array. Then it builds a snippet-snippet similarity matrix by calculating similarities between every two snippets using common substring-based dimensional model. At last, the algorithm groups the Web snippets using an improved hierarchical clustering algorithm. Theoretical analysis and experiments show that compared to traditional Chinese Web snippet clustering algorithms based on Chinese word segmentation, our algorithm performs better both in the efficiency of clustering and the readability of the generated cluster labels.
  • Keywords
    pattern clustering; search engines; unsupervised learning; word processing; Chinese Web snippets; Chinese word segmentation; Web search engine; generalized suffix array; hierarchical clustering algorithm; online clustering algorithm; snippet-snippet similarity matrix; Algorithm design and analysis; Clustering algorithms; Data mining; Internet; Programming; Search engines; Software algorithms; Sorting; Tin; Web search;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cyber-Enabled Distributed Computing and Knowledge Discovery, 2009. CyberC '09. International Conference on
  • Conference_Location
    Zhangijajie
  • Print_ISBN
    978-1-4244-5218-7
  • Electronic_ISBN
    978-1-4244-5219-4
  • Type

    conf

  • DOI
    10.1109/CYBERC.2009.5342183
  • Filename
    5342183