• DocumentCode
    730711
  • Title

    Unsupervised speaker adaptation of deep neural network based on the combination of speaker codes and singular value decomposition for speech recognition

  • Author

    Shaofei Xue ; Hui Jiang ; Lirong Dai ; Qingfeng Liu

  • Author_Institution
    Nat. Eng. Lab. of Speech & Language Inf. Process., Univ. of Sci. & Technol. of China, Hefei, China
  • fYear
    2015
  • fDate
    19-24 April 2015
  • Firstpage
    4555
  • Lastpage
    4559
  • Abstract
    Recently, we have proposed a general adaptation scheme for deep neural network based on discriminant condition codes and applied it to supervised speaker adaptation in speech recognition based on either frame-level cross-entropy or sequence-level maximum mutual information training criterion [1, 2, 3, 4]. In this case, each condition code is associated with one speaker in data, which is thus called speaker code for convenience. Our previous work has shown that speaker code based methods are quite effective in adapting DNNs even when only a very small amount of adaptation data is available. However, we have to use a large speaker code size and complex processes to obtain the best ASR performance since good initializations of speaker codes and connection weights are very important. In this paper, we propose a method using singular value decomposition (SVD) as in [5] to initialize speaker codes and connection weights to obtain a comparable ASR performance as before but with a smaller speaker code size and much less computation complexity. Meanwhile, we have evaluated unsupervised speaker adaptation with the proposed method in large vocabulary speech recognition in the Switchboard task. Experimental results have shown that it is effective for providing well initializations and suitable in adapting large DNN models.
  • Keywords
    entropy; neural nets; singular value decomposition; speech recognition; unsupervised learning; DNN; SVD; connection weights; deep neural network; discriminant condition codes; frame-level cross-entropy; large vocabulary speech recognition; sequence-level maximum mutual information training criterion; singular value decomposition; speaker code; unsupervised speaker adaptation; Adaptation models; Hidden Markov models; Matrix decomposition; Neural networks; Speech; Speech recognition; Training; Deep Neural Network (DNN); Speaker Adaptation; Speaker Code; singular value decomposition (SVD);
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
  • Conference_Location
    South Brisbane, QLD
  • Type

    conf

  • DOI
    10.1109/ICASSP.2015.7178833
  • Filename
    7178833