• DocumentCode
    1085928
  • Title

    Efficient Phrase-Based Document Similarity for Clustering

  • Author

    Chim, Hung ; Deng, Xiaotie

  • Author_Institution
    City Univ. of Hong Kong, Hong Kong
  • Volume
    20
  • Issue
    9
  • fYear
    2008
  • Firstpage
    1217
  • Lastpage
    1229
  • Abstract
    In this paper, we propose a phrase-based document similarity to compute the pair-wise similarities of documents based on the suffix tree document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the vector space document (VSD) model, the phrase-based document similarity naturally inherits the term tf-idf weighting scheme in computing the document similarity with phrases. We apply the phrase-based document similarity to the group-average Hierarchical Agglomerative Clustering (HAC) algorithm and develop a new document clustering approach. Our evaluation experiments indicate that, the new clustering approach is very effective on clustering the documents of two standard document benchmark corpora OHSUMED and RCV1. The quality of the clustering results significantly surpass the results of traditional single-word textit{tf-idf} similarity measure in the same HAC algorithm, especially in large document data sets. Furthermore, by studying the property of STD model, we conclude that the feature vector of phrase terms in the STD model can be considered as an expanded feature vector of the traditional single-word terms in the VSD model. This conclusion sufficiently explains why the phrase-based document similarity works much better than the single-word tf-idf similarity measure.
  • Keywords
    pattern clustering; text analysis; trees (mathematics); document clustering; feature vector; group-average hierarchical agglomerative clustering algorithm; pair-wise document similarity; phrase-based document similarity; suffix tree document model; vector space document model; Clustering; Linguistic processing; Trees;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2008.50
  • Filename
    4459328