DocumentCode :
730723
Title :
Speech acoustic modeling from raw multichannel waveforms
Author :
Hoshen, Yedid ; Weiss, Ron J. ; Wilson, Kevin W.
Author_Institution :
Hebrew Univ. of Jerusalem, Jerusalem, Israel
fYear :
2015
fDate :
19-24 April 2015
Firstpage :
4624
Lastpage :
4628
Abstract :
Standard deep neural network-based acoustic models for automatic speech recognition (ASR) rely on hand-engineered input features, typically log-mel filterbank magnitudes. In this paper, we describe a convolutional neural network - deep neural network (CNN-DNN) acoustic model which takes raw multichannel waveforms as input, i.e. without any preceding feature extraction, and learns a similar feature representation through supervised training. By operating directly in the time domain, the network is able to take advantage of the signal´s fine time structure that is discarded when computing filterbank magnitude features. This structure is especially useful when analyzing multichannel inputs, where timing differences between input channels can be used to localize a signal in space. The first convolutional layer of the proposed model naturally learns a filterbank that is selective in both frequency and direction of arrival, i.e. a bank of bandpass beamformers with an auditory-like frequency scale. When trained on data corrupted with noise coming from different spatial locations, the network learns to filter them out by steering nulls in the directions corresponding to the noise sources. Experiments on a simulated multichannel dataset show that the proposed acoustic model outperforms a DNN that uses log-mel filterbank magnitude features under noisy and reverberant conditions.
Keywords :
channel bank filters; neural nets; speech recognition; ASR; automatic speech recognition; bandpass beamformers; convolutional neural network - deep neural network acoustic model; feature extraction; feature representation; log-mel filterbank magnitudes; raw multichannel waveforms; speech acoustic modeling; standard deep neural network-based acoustic models; Acoustics; Computational modeling; Feature extraction; Indexes; Speech; Speech enhancement; Training; Automatic speech recognition; acoustic modeling; beamforming; convolutional neural networks;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
Conference_Location :
South Brisbane, QLD
Type :
conf
DOI :
10.1109/ICASSP.2015.7178847
Filename :
7178847
Link To Document :
بازگشت