Title :
Speaker Adaptation With Limited Data Using Regression-Tree-Based Spectral Peak Alignment
Author :
Wang, Shizhen ; Cui, Xiaodong ; Alwan, Abeer
Author_Institution :
Dept. of Electr. Eng., Univ. of California at Los Angeles, Los Angeles, CA
Abstract :
Spectral mismatch between training and testing utterances can cause significant degradation in the performance of automatic speech recognition (ASR) systems. Speaker adaptation and speaker normalization techniques are usually applied to address this issue. One way to reduce spectral mismatch is to reshape the spectrum by aligning corresponding formant peaks. There are various levels of mismatch in formant structures. In this paper, regression-tree-based phoneme- and state-level spectral peak alignment is proposed for rapid speaker adaptation using linearization of the vocal tract length normalization (VTLN) technique. This method is investigated in a maximum-likelihood linear regression (MLLR)-like framework, taking advantage of both the efficiency of frequency warping (VTLN) and the reliability of statistical estimations (MLLR). Two different regression classes are investigated: one based on phonetic classes (using combined knowledge and data-driven techniques) and the other based on Gaussian mixture classes. Compared to MLLR, VTLN, and global peak alignment, improved performance can be obtained for both supervised and unsupervised adaptations for both medium vocabulary (the RM1 database) and connected digits recognition (the TIDIGITS database) tasks. Performance improvements are largest with limited adaptation data which is often the case for ASR applications, and these improvements are shown to be statistically significant.
Keywords :
Gaussian processes; maximum likelihood estimation; regression analysis; speaker recognition; trees (mathematics); Gaussian mixture class; automatic speech recognition system; frequency warping; maximum-likelihood linear regression; regression-tree-based spectral peak alignment; speaker adaptation; speaker normalization; spectral mismatch reduction; statistical estimation; unsupervised adaptation; vocal tract length normalization; Automatic speech recognition; Automatic testing; Databases; Degradation; Frequency estimation; Linear regression; Maximum likelihood estimation; Maximum likelihood linear regression; System testing; Vocabulary; Peak alignment; regression tree; speaker adaptation; speech recognition; vocal tract length normalization (VTLN);
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
DOI :
10.1109/TASL.2007.906740