• DocumentCode
    323765
  • Title

    Language-model optimization by mapping of corpora

  • Author

    Klakow, Dietrich

  • Author_Institution
    Philips GmbH Forschungslab., Aachen, Germany
  • Volume
    2
  • fYear
    1998
  • fDate
    12-15 May 1998
  • Firstpage
    701
  • Abstract
    It is questionable whether words are really the best basic units for the estimation of stochastic language models-grouping frequent word sequences to phrases can improve language models. More generally, we have investigated various coding schemes for a corpus. In this paper, it is applied to optimize the perplexity of n-gram language models. In tests on two large corpora (WSJ and BNA) the bigram perplexity was reduced by up to 29%. Furthermore, this approach allows to tackle the problem of an open vocabulary with no unknown word
  • Keywords
    grammars; natural languages; optimisation; speech processing; speech recognition; stochastic processes; BNA; WSJ; automatic speech recognition; bigram perplexity; coding schemes; corpora mapping; correlation; frequent word sequences grouping; language-model optimization; n-gram language models; open vocabulary; phrases; stochastic language models; tests; Frequency; Law; Legal factors; Mutual information; Testing; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on
  • Conference_Location
    Seattle, WA
  • ISSN
    1520-6149
  • Print_ISBN
    0-7803-4428-6
  • Type

    conf

  • DOI
    10.1109/ICASSP.1998.675361
  • Filename
    675361