Title :
Audio-visual deep learning for noise robust speech recognition
Author :
Jing Huang ; Kingsbury, Brian
Author_Institution :
IBM T. J. Watson Res. Center, Yorktown Heights, NY, USA
Abstract :
Deep belief networks (DBN) have shown impressive improvements over Gaussian mixture models for automatic speech recognition. In this work we use DBNs for audio-visual speech recognition; in particular, we use deep learning from audio and visual features for noise robust speech recognition. We test two methods for using DBNs in a multimodal setting: a conventional decision fusion method that combines scores from single-modality DBNs, and a novel feature fusion method that operates on mid-level features learned by the single-modality DBNs. On a continuously spoken digit recognition task, our experiments show that these methods can reduce word error rate by as much as 21% relative over a baseline multi-stream audio-visual GMM/HMM system.
Keywords :
Gaussian distribution; belief networks; hidden Markov models; learning (artificial intelligence); speech recognition; DBN; Gaussian mixture models; audio visual deep learning; audio visual speech recognition; automatic speech recognition; decision fusion method; deep belief networks; feature fusion method; multistream audio visual GMM/HMM system; noise robust speech recognition; word error rate; Acoustics; Hidden Markov models; Noise measurement; Speech; Speech recognition; Training; Visualization; Audio-visual speech recognition; Deep belief networks; Noise robustness;
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on
Conference_Location :
Vancouver, BC
DOI :
10.1109/ICASSP.2013.6639140