DocumentCode :
1691486
Title :
Audio-visual deep learning for noise robust speech recognition
Author :
Jing Huang ; Kingsbury, Brian
Author_Institution :
IBM T. J. Watson Res. Center, Yorktown Heights, NY, USA
fYear :
2013
Firstpage :
7596
Lastpage :
7599
Abstract :
Deep belief networks (DBN) have shown impressive improvements over Gaussian mixture models for automatic speech recognition. In this work we use DBNs for audio-visual speech recognition; in particular, we use deep learning from audio and visual features for noise robust speech recognition. We test two methods for using DBNs in a multimodal setting: a conventional decision fusion method that combines scores from single-modality DBNs, and a novel feature fusion method that operates on mid-level features learned by the single-modality DBNs. On a continuously spoken digit recognition task, our experiments show that these methods can reduce word error rate by as much as 21% relative over a baseline multi-stream audio-visual GMM/HMM system.
Keywords :
Gaussian distribution; belief networks; hidden Markov models; learning (artificial intelligence); speech recognition; DBN; Gaussian mixture models; audio visual deep learning; audio visual speech recognition; automatic speech recognition; decision fusion method; deep belief networks; feature fusion method; multistream audio visual GMM/HMM system; noise robust speech recognition; word error rate; Acoustics; Hidden Markov models; Noise measurement; Speech; Speech recognition; Training; Visualization; Audio-visual speech recognition; Deep belief networks; Noise robustness;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on
Conference_Location :
Vancouver, BC
ISSN :
1520-6149
Type :
conf
DOI :
10.1109/ICASSP.2013.6639140
Filename :
6639140
Link To Document :
بازگشت