• DocumentCode
    2253312
  • Title

    Scalable backoff language models

  • Author

    Seymore, Kristie ; Rosenfeld, Ronald

  • Author_Institution
    Sch. of Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, USA
  • Volume
    1
  • fYear
    1996
  • fDate
    3-6 Oct 1996
  • Firstpage
    232
  • Abstract
    When a trigram backoff language model is created from a large body of text, trigrams and bigrams that occur few times in the training text are often excluded from the model in order to decrease the model size. Generally, the elimination of n-grams with very low counts is believed to not significantly affect model performance. This project investigates the degradation of a trigram backoff model´s perplexity and word error rates as bigram and trigram cutoffs are increased. The advantage of reduction in model size is compared to the increase in word error rate and perplexity scores. More importantly, this project also investigates alternative ways of excluding bigrams and trigrams from a backoff language model, using criteria other than the number of times an n-gram occurs in the training text. Specifically, a difference method has been investigated where the difference in the logs of the original and backed off trigram and bigram probabilities is used as a basis for n-gram exclusion from the model. We show that excluding trigrams and bigrams based on a weighted version of this difference method results in better perplexity and word error rate performance than excluding trigrams and bigrams based on counts alone
  • Keywords
    computational linguistics; errors; natural language interfaces; probability; speech recognition; bigrams; difference method; model performance; model size; n-grams; natural language; probabilities; pruning; scalable backoff language models; speech recognition; text; training text; trigram backoff language model; word error rates; Computer science; Degradation; Error analysis; Predictive models; Sampling methods; Training data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on
  • Conference_Location
    Philadelphia, PA
  • Print_ISBN
    0-7803-3555-4
  • Type

    conf

  • DOI
    10.1109/ICSLP.1996.607084
  • Filename
    607084