An investigation of augmenting speaker representations to improve speaker normalisation for DNN-based speech recognition

Author

Hengguan Huang ; Khe Chai Sim

Author_Institution

Sch. of Comput., Nat. Univ. of Singapore, Singapore, Singapore

fYear

2015

fDate

19-24 April 2015

Firstpage

4610

Lastpage

4613

Abstract

The conventional short-term interval features used by the Deep Neural Networks (DNNs) lack the ability to learn longer term information. This poses a challenge for training a speaker-independent (SI) DNN since the short-term features do not provide sufficient information for the DNN to estimate the real robust factors of speaker-level variations. The key to this problem is to obtain a sufficiently robust and informative speaker representation. This paper compares several speaker representations. Firstly, a DNN speaker classifier is used to extract the bottleneck features as the speaker representation, called the Bottleneck Speaker Vector (BSV). To further improve the robustness of this representation, a first-order Bottleneck Speaker Super Vector (BSSV) is also proposed, where the BSV is expanded into a super vector space by incorporating the phoneme posterior probabilities. Finally, a more fine-grain speaker representation based on the FMLLR-shifted features is examined. The experimental results on the WSJ0 and WSJ1 datasets show that the proposed speaker representations are useful in normalising the speaker effects for robust DNN-based automatic speech recognition. The best performance is achieved by augmenting both the BSSV and the FMLLR-shifted representations, yielding 10.0% - 15.3% relatively performance gains over the SI DNN baseline.

Keywords

feature extraction; neural nets; probability; speech recognition; BSSV; DNN speaker classifier; FMLLR-shifted features; deep neural networks; feature extraction; first-order bottleneck speaker super vector; phoneme posterior probabilities; robust DNN-based automatic speech recognition; short-term features; speaker normalisation; speaker representations; speaker-independent DNN; Acoustics; Feature extraction; Hidden Markov models; Silicon; Speech; Speech recognition; Training; augmented speaker representation; deep neural network; speaker normalisation; speech recognition;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on

Conference_Location

South Brisbane, QLD

Type

conf

DOI

10.1109/ICASSP.2015.7178844

Filename

7178844