• DocumentCode
    2330606
  • Title

    Direct and latent modeling techniques for computing spoken document similarity

  • Author

    Hazen, Timothy J.

  • Author_Institution
    MIT Lincoln Lab., Lexington, MA, USA
  • fYear
    2010
  • fDate
    12-15 Dec. 2010
  • Firstpage
    366
  • Lastpage
    371
  • Abstract
    Document similarity measures are required for a variety of data organization and retrieval tasks including document clustering, document link detection, and query-by-example document retrieval. In this paper we examine existing and novel document similarity measures for use with spoken document collections processed with automatic speech recognition (ASR) technology. We compare direct vector space approaches using the cosine similarity measure applied to feature vectors constructed with various forms of term frequency inverse document frequency (TF-IDF) normalization against latent topic modeling approaches based on latent Dirichlet allocation (LDA). In document link detection experiments on the Fisher Corpus, we find that an approach that applies bagging to models derived from LDA substantially outperforms the direct vector space approach.
  • Keywords
    document handling; file organisation; information retrieval; speech recognition; Fisher corpus; automatic speech recognition technology; data organization; direct modeling techniques; document link detection experiments; latent Dirichlet allocation; latent modeling techniques; query-by-example document retrieval; spoken document similarity computing; term frequency inverse document frequency normalization; document link detection; document similarity; latent topic modeling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Spoken Language Technology Workshop (SLT), 2010 IEEE
  • Conference_Location
    Berkeley, CA
  • Print_ISBN
    978-1-4244-7904-7
  • Electronic_ISBN
    978-1-4244-7902-3
  • Type

    conf

  • DOI
    10.1109/SLT.2010.5700880
  • Filename
    5700880