مرکز منطقه ای اطلاع رساني علوم و فناوري - Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors

DocumentCode :

16271

Title :

Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors

Author :

Yajie Miao ; Hao Zhang ; Metze, Florian

Author_Institution :

Sch. of Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, USA

Volume :

Issue :

fYear :

2015

fDate :

Nov. 2015

Firstpage :

1938

Lastpage :

1949

Abstract :

In acoustic modeling, speaker adaptive training (SAT) has been a long-standing technique for the traditional Gaussian mixture models (GMMs). Acoustic models trained with SAT become independent of training speakers and generalize better to unseen testing speakers. This paper ports the idea of SAT to deep neural networks (DNNs), and proposes a framework to perform feature-space SAT for DNNs. Using i-vectors as speaker representations, our framework learns an adaptation neural network to derive speaker-normalized features. Speaker adaptive models are obtained by fine-tuning DNNs in such a feature space. This framework can be applied to various feature types and network structures, posing a very general SAT solution. In this paper, we fully investigate how to build SAT-DNN models effectively and efficiently. First, we study the optimal configurations of SAT-DNNs for large-scale acoustic modeling tasks. Then, after presenting detailed comparisons between SAT-DNNs and the existing DNN adaptation methods, we propose to combine SAT-DNNs and model-space DNN adaptation during decoding. Finally, to accelerate learning of SAT-DNNs, a simple yet effective strategy, frame skipping, is employed to reduce the size of training data. Our experiments show that compared with a strong DNN baseline, the SAT-DNN model achieves 13.5% and 17.5% relative improvement on word error rates (WERs), without and with model-space adaptation applied respectively. Data reduction based on frame skipping results in 2 × speed-up for SAT-DNN training, while causing negligible WER loss on the testing data.

Keywords :

Gaussian processes; acoustic signal processing; learning (artificial intelligence); mixture models; neural nets; signal representation; speaker recognition; vectors; GMM; Gaussian mixture models; SAT-DNN models; deep neural network acoustic models; feature types; feature-space SAT; frame skipping; i-vectors; large-scale acoustic modeling tasks; model-space DNN adaptation; negligible WER loss; network structures; speaker adaptive training; speaker representations; speaker-normalized features; training data size reduction; word error rates; Acoustics; Adaptation models; IEEE transactions; Speech; Speech processing; Testing; Training; Acoustic modeling; deep neural networks (DNNs); speaker adaptive training (SAT);

fLanguage :

English

Journal_Title :

Audio, Speech, and Language Processing, IEEE/ACM Transactions on

Publisher :

ieee

ISSN :

2329-9290

Type :

jour

DOI :

10.1109/TASLP.2015.2457612

Filename :

7160703

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=16271