• DocumentCode
    302105
  • Title

    Variable-order N-gram generation by word-class splitting and consecutive word grouping

  • Author

    Masataki, Hirokazu ; Sgisaka, Yoshinori

  • Author_Institution
    ATR Interpreting Telephony Res. Labs., Kyoto, Japan
  • Volume
    1
  • fYear
    1996
  • fDate
    7-10 May 1996
  • Firstpage
    188
  • Abstract
    In this paper, a generation scheme for variable-order N-grams is proposed to attain reliable statistical constraints from a given language corpus. Starting from POS bigrams, the proposed scheme creates variable-order N-grams by splitting a POS into finer groups and by adding frequent consecutive word sequences as word-classes. This word-class splitting and consecutive word grouping are carried out incrementally by minimizing the total entropy. Experiments showed that the perplexity of the proposed model for the test corpus is lower than that for a conventional trigram and that this model requires a quite smaller number of statistical parameters. By applying this model to speech recognition, we get a better recognition rate than using conventional bigrams
  • Keywords
    minimum entropy methods; natural languages; speech recognition; statistical analysis; POS bigrams; consecutive word grouping; consecutive word sequences; language corpus; perplexity; speech recognition; statistical constraints; test corpus; total entropy; variable-order N-gram generation; word-class splitting; Bellows; Data mining; Entropy; History; Probability; Smoothing methods; Speech recognition; Statistical distributions; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on
  • Conference_Location
    Atlanta, GA
  • ISSN
    1520-6149
  • Print_ISBN
    0-7803-3192-3
  • Type

    conf

  • DOI
    10.1109/ICASSP.1996.540322
  • Filename
    540322