• DocumentCode
    45708
  • Title

    Large Scale Distributed Acoustic Modeling With Back-Off {\\rm N} -Grams

  • Author

    Chelba, C. ; Peng Xu ; Pereira, Fernando ; Richardson, Tom

  • Author_Institution
    Google, Inc., Mountain View, CA, USA
  • Volume
    21
  • Issue
    6
  • fYear
    2013
  • fDate
    Jun-13
  • Firstpage
    1158
  • Lastpage
    1169
  • Abstract
    The paper revives an older approach to acoustic modeling that borrows from n-gram language modeling in an attempt to scale up both the amount of training data and model size (as measured by the number of parameters in the model), to approximately 100 times larger than current sizes used in automatic speech recognition. In such a data-rich setting, we can expand the phonetic context significantly beyond triphones, as well as increase the number of Gaussian mixture components for the context-dependent states that allow it. We have experimented with contexts that span seven or more context-independent phones, and up to 620 mixture components per state. Dealing with unseen phonetic contexts is accomplished using the familiar back-off technique used in language modeling due to implementation simplicity. The back-off acoustic model is estimated, stored and served using MapReduce distributed computing infrastructure. Speech recognition experiments are carried out in an N-best list rescoring framework for Google Voice Search. Training big models on large amounts of data proves to be an effective way to increase the accuracy of a state-of-the-art automatic speech recognition system. We use 87 000 hours of training data (speech along with transcription) obtained by filtering utterances in Voice Search logs on automatic speech recognition confidence. Models ranging in size between 20-40 million Gaussians are estimated using maximum likelihood training. They achieve relative reductions in word-error-rate of 11% and 6% when combined with first-pass models trained using maximum likelihood, and boosted maximum mutual information, respectively. Increasing the context size beyond five phones (quinphones) does not help.
  • Keywords
    Gaussian distribution; filtering theory; speech recognition; Gaussian mixture components; Gaussians estimation; Google voice search; MapReduce distributed computing infrastructure; N-best list rescoring framework; automatic speech recognition confidence; back-off acoustic model; back-off n-grams; back-off technique; context size beyond five phones; context-dependent states; context-independent phones; data-rich setting; filtering utterances; first-pass models; implementation simplicity; large scale distributed acoustic modeling; maximum likelihood training; maximum mutual information; n-gram language modeling; quinphones; speech recognition experiments; state-of-the-art automatic speech recognition system; training data; triphones; unseen phonetic contexts; voice search logs; word-error-rate; Acoustics; Context; Data models; Hidden Markov models; Speech; Training; Training data; Automatic speech recognition; acoustic modeling; back-off; distributed storage; hidden Markov models; n-gram; phonetic context;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1558-7916
  • Type

    jour

  • DOI
    10.1109/TASL.2013.2245649
  • Filename
    6451161