• DocumentCode
    2850899
  • Title

    The anatomy of a hierarchical clustering engine for Web-page, news and book snippets

  • Author

    Ferragina, Paolo ; Gullí, Antonio

  • Author_Institution
    Dipt. di Informatica, Universita di Pisa, Italy
  • fYear
    2004
  • fDate
    1-4 Nov. 2004
  • Firstpage
    395
  • Lastpage
    398
  • Abstract
    In this paper, we investigate the Web snippet hierarchical clustering problem in its full extent by devising an algorithmic solution, and a software prototype called SnakeT (accessible at http://roquefort.di.unipi.it/), that: (1) draws the snippets from 16 Web search engines, the Amazon collection of books a9.com, the news of Google News and the blogs of Blogline; (2) builds the clusters on-the-fly (ephemeral clustering (Maarek et al., 2000)) in response to a user query without adopting any predefined organization in categories; (3) labels the clusters with sentences of variable length, drawn from the snippets and possibly missing some terms, provided they are not too many; (4) uses some ranking functions which exploit two knowledge bases properly built by our engine at preprocessing time for the sentences selection and cluster-assignment process; (5) organizes the clusters into a hierarchy, and assigns to the nodes intelligible sentences in order to allow post-navigation for query refinement. Our clustering algorithm possibly let the clusters overlap at different levels of the hierarchy.
  • Keywords
    Web sites; information retrieval; knowledge based systems; pattern clustering; search engines; Amazon collection; Blogline; Google News; SnakeT; Web page; Web search engines; book snippets; cluster assignment; clusters on-the-fly; ephemeral clustering; hierarchical clustering engine; knowledge bases; query refinement; ranking functions; sentences selection; Anatomy; Blogs; Books; Clustering algorithms; Data mining; Search engines; Software algorithms; Software architecture; Surges; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2004. ICDM '04. Fourth IEEE International Conference on
  • Print_ISBN
    0-7695-2142-8
  • Type

    conf

  • DOI
    10.1109/ICDM.2004.10027
  • Filename
    1410319