DocumentCode :
672381
Title :
Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription
Author :
Liao, Haitao ; McDermott, Erik ; Senior, Alan
fYear :
2013
fDate :
8-12 Dec. 2013
Firstpage :
368
Lastpage :
373
Abstract :
YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Improving accessibility to these videos for the hearing impaired and for search and indexing purposes is an excellent application of automatic speech recognition. However, YouTube videos are extremely challenging for automatic speech recognition systems. Standard adapted Gaussian Mixture Model (GMM) based acoustic models can have word error rates above 50%, making this one of the most difficult reported tasks. Since 2009, YouTube has provided automatic generation of closed captions for videos detected to have English speech; the service now supports ten different languages. This paper describes recent improvements to the original system, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories. Applying an “island of confidence” filtering heuristic to select useful training segments, and increasing the model size by using 44,526 context dependent states with a low-rank final layer weight matrix approximation, improved performance by about 13% relative compared to previously reported sequence trained DNN results for this task.
Keywords :
Gaussian processes; mixture models; neural nets; social networking (online); speech recognition; DNN; English speech; YouTube video transcription; automatic speech recognition systems; large scale deep neural network acoustic modeling; neural networks acoustic models; semi-supervised training data; standard adapted Gaussian mixture model; video sharing website; Acoustics; Approximation methods; Context; Data models; Hidden Markov models; Training; YouTube; Large vocabulary speech recognition; audio indexing; deep learning; deep neural networks;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on
Conference_Location :
Olomouc
Type :
conf
DOI :
10.1109/ASRU.2013.6707758
Filename :
6707758
Link To Document :
بازگشت