• DocumentCode
    730720
  • Title

    An investigation of augmenting speaker representations to improve speaker normalisation for DNN-based speech recognition

  • Author

    Hengguan Huang ; Khe Chai Sim

  • Author_Institution
    Sch. of Comput., Nat. Univ. of Singapore, Singapore, Singapore
  • fYear
    2015
  • fDate
    19-24 April 2015
  • Firstpage
    4610
  • Lastpage
    4613
  • Abstract
    The conventional short-term interval features used by the Deep Neural Networks (DNNs) lack the ability to learn longer term information. This poses a challenge for training a speaker-independent (SI) DNN since the short-term features do not provide sufficient information for the DNN to estimate the real robust factors of speaker-level variations. The key to this problem is to obtain a sufficiently robust and informative speaker representation. This paper compares several speaker representations. Firstly, a DNN speaker classifier is used to extract the bottleneck features as the speaker representation, called the Bottleneck Speaker Vector (BSV). To further improve the robustness of this representation, a first-order Bottleneck Speaker Super Vector (BSSV) is also proposed, where the BSV is expanded into a super vector space by incorporating the phoneme posterior probabilities. Finally, a more fine-grain speaker representation based on the FMLLR-shifted features is examined. The experimental results on the WSJ0 and WSJ1 datasets show that the proposed speaker representations are useful in normalising the speaker effects for robust DNN-based automatic speech recognition. The best performance is achieved by augmenting both the BSSV and the FMLLR-shifted representations, yielding 10.0% - 15.3% relatively performance gains over the SI DNN baseline.
  • Keywords
    feature extraction; neural nets; probability; speech recognition; BSSV; DNN speaker classifier; FMLLR-shifted features; deep neural networks; feature extraction; first-order bottleneck speaker super vector; phoneme posterior probabilities; robust DNN-based automatic speech recognition; short-term features; speaker normalisation; speaker representations; speaker-independent DNN; Acoustics; Feature extraction; Hidden Markov models; Silicon; Speech; Speech recognition; Training; augmented speaker representation; deep neural network; speaker normalisation; speech recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
  • Conference_Location
    South Brisbane, QLD
  • Type

    conf

  • DOI
    10.1109/ICASSP.2015.7178844
  • Filename
    7178844