DocumentCode :
1760584
Title :
Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition
Author :
Chao Weng ; Dong Yu ; Seltzer, Michael L. ; Droppo, Jasha
Author_Institution :
Microsoft Res., Redmond, WA, USA
Volume :
23
Issue :
10
fYear :
2015
fDate :
Oct. 2015
Firstpage :
1670
Lastpage :
1679
Abstract :
We investigate techniques based on deep neural networks (DNNs) for attacking the single-channel multi-talker speech recognition problem. Our proposed approach contains five key ingredients: a multi-style training strategy on artificially mixed speech data, a separate DNN to estimate senone posterior probabilities of the louder and softer speakers at each frame, a weighted finite-state transducer (WFST)-based two-talker decoder to jointly estimate and correlate the speaker and speech, a speaker switching penalty estimated from the energy pattern change in the mixed-speech, and a confidence based system combination strategy. Experiments on the 2006 speech separation and recognition challenge task demonstrate that our proposed DNN-based system has remarkable noise robustness to the interference of a competing speaker. The best setup of our proposed systems achieves an average word error rate (WER) of 18.8% across different SNRs and outperforms the state-of-the-art IBM superhuman system by 2.8% absolute with fewer assumptions.
Keywords :
decoding; finite state machines; neural nets; probability; speaker recognition; DNN; WER; WFST-based two-talker decoder; artificially mixed speech data; confidence based system combination strategy; deep neural networks; energy pattern change; multi-style training strategy; senone posterior probabilities; single-channel multi-talker speech recognition problem; speaker switching penalty; speech separation; weighted finite-state transducer-based two-talker decoder; word error rate; Acoustics; Decoding; Hidden Markov models; Joints; Speech; Speech recognition; Training; Deep neural network (DNN); joint decoding; multi-talker automatic speech recognition (ASR); noise robustness; single-channel; weighted finite-state transducer (WFST);
fLanguage :
English
Journal_Title :
Audio, Speech, and Language Processing, IEEE/ACM Transactions on
Publisher :
ieee
ISSN :
2329-9290
Type :
jour
DOI :
10.1109/TASLP.2015.2444659
Filename :
7122291
Link To Document :
بازگشت