• DocumentCode
    2513659
  • Title

    Word Clustering Using PLSA Enhanced with Long Distance Bigrams

  • Author

    Bassiou, Nikoletta ; Kotropoulos, Constantine

  • Author_Institution
    Dept. of Inf., Aristotle Univ. of Thessaloniki, Thessaloniki, Greece
  • fYear
    2010
  • fDate
    23-26 Aug. 2010
  • Firstpage
    4226
  • Lastpage
    4229
  • Abstract
    Probabilistic latent semantic analysis is enhanced with long distance bigram models in order to improve word clustering. The long distance bigram probabilities and the interpolated long distance bigram probabilities at varying distances within a context capture different aspects of contextual information. In addition, the baseline bigram, which incorporates trigger-pairs for various histories, is tested in the same framework. The experimental results collected on publicly available corpora (CISI, Cran field, Medline, and NPL) demonstrate the superiority of the long distance bigrams over the baseline bigrams as well as the superiority of the interpolated long distance bigrams against the long distance bigrams and the baseline bigram with trigger-pairs in yielding more compact clusters containing less outliers.
  • Keywords
    interpolation; natural language processing; pattern clustering; statistical analysis; word processing; PLSA; baseline bigram; interpolated long distance bigram probabilities; long distance bigram models; probabilistic latent semantic analysis; word clustering; Clustering algorithms; Dispersion; Entropy; Harmonic analysis; History; Probabilistic logic; Semantics;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition (ICPR), 2010 20th International Conference on
  • Conference_Location
    Istanbul
  • ISSN
    1051-4651
  • Print_ISBN
    978-1-4244-7542-1
  • Type

    conf

  • DOI
    10.1109/ICPR.2010.1027
  • Filename
    5597737