Author :
Dias, Gaël ; Cleuziou, Guillaume ; Machado, David
Author_Institution :
HULTIG, Univ. of Beira Interior, Covilha, Portugal
Abstract :
Ephemeral clustering has been studied for more than a decade, although with low user acceptance. According to us, this situation is mainly due to (1) an excessive number of generated clusters, which makes browsing difficult and (2) low quality labeling, which introduces imprecision within the search process. In this paper, our motivation is twofold. First, we propose to reduce the number of clusters of Web page results, but keeping all different query meanings. For that purpose, we propose a new polythetic methodology based on an informative similarity measure, the InfoSimba, and a new hierarchical clustering algorithm, the HISGK-means. Second, a theoretical background is proposed to define meaningful cluster labels embedded in the definition of the HISGK-means algorithm, which may elect as best label, words outside the given cluster. To confirm our intuitions, we propose a new evaluation framework, which shows that we are able to extract most of the important query meanings but generating much less clusters than state-of-the-art systems.
Keywords :
information retrieval; pattern clustering; HISGK-means algorithm; InfoSimba; Web page results; cluster number reduction; evaluation framework; informative polythetic hierarchical ephemeral clustering; informative similarity measure; query meanings; Clustering algorithms; Context; Convergence; Equations; Partitioning algorithms; Taxonomy; Web pages; Automatic Cluster and Label Evaluation; Hierarchical Ephemeral Clustering; Informative Similarity Measure; Polythetic Web Snippet Representation;