DocumentCode :
730720
Title :
An investigation of augmenting speaker representations to improve speaker normalisation for DNN-based speech recognition
Author :
Hengguan Huang ; Khe Chai Sim
Author_Institution :
Sch. of Comput., Nat. Univ. of Singapore, Singapore, Singapore
fYear :
2015
fDate :
19-24 April 2015
Firstpage :
4610
Lastpage :
4613
Abstract :
The conventional short-term interval features used by the Deep Neural Networks (DNNs) lack the ability to learn longer term information. This poses a challenge for training a speaker-independent (SI) DNN since the short-term features do not provide sufficient information for the DNN to estimate the real robust factors of speaker-level variations. The key to this problem is to obtain a sufficiently robust and informative speaker representation. This paper compares several speaker representations. Firstly, a DNN speaker classifier is used to extract the bottleneck features as the speaker representation, called the Bottleneck Speaker Vector (BSV). To further improve the robustness of this representation, a first-order Bottleneck Speaker Super Vector (BSSV) is also proposed, where the BSV is expanded into a super vector space by incorporating the phoneme posterior probabilities. Finally, a more fine-grain speaker representation based on the FMLLR-shifted features is examined. The experimental results on the WSJ0 and WSJ1 datasets show that the proposed speaker representations are useful in normalising the speaker effects for robust DNN-based automatic speech recognition. The best performance is achieved by augmenting both the BSSV and the FMLLR-shifted representations, yielding 10.0% - 15.3% relatively performance gains over the SI DNN baseline.
Keywords :
feature extraction; neural nets; probability; speech recognition; BSSV; DNN speaker classifier; FMLLR-shifted features; deep neural networks; feature extraction; first-order bottleneck speaker super vector; phoneme posterior probabilities; robust DNN-based automatic speech recognition; short-term features; speaker normalisation; speaker representations; speaker-independent DNN; Acoustics; Feature extraction; Hidden Markov models; Silicon; Speech; Speech recognition; Training; augmented speaker representation; deep neural network; speaker normalisation; speech recognition;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
Conference_Location :
South Brisbane, QLD
Type :
conf
DOI :
10.1109/ICASSP.2015.7178844
Filename :
7178844
Link To Document :
بازگشت