Regression-Based Context-Dependent Modeling of Deep Neural Networks for Speech Recognition

Author

Guangsen Wang ; Khe Chai Sim

Author_Institution

Dept. of Comput. Sci., Nat. Univ. of Singapore, Singapore, Singapore

Volume

22

Issue

11

fYear

2014

fDate

Nov. 2014

Firstpage

1660

Lastpage

1669

Abstract

The data sparsity problem is addressed by using the decision tree state clusters as the training targets for the state-of-the- art context-dependent (CD) deep neural network (DNN) systems. The CD states within a cluster cannot be distinguished at the frame level. We surmise that the state clustering may cause an issue for the standard CD-DNNs, which has so far not been addressed in the literature. In this paper, a logistic regression framework is proposed for the CD-DNNs based on a set of broad phone classes to address both the data sparsity and the clustering problems. To address the data sparsity issue, the triphones are clustered into shorter biphones with broad phone contexts under multiple articulatory categories. A DNN is trained to discriminate the disjoint biphone clusters within each articulatory category. The regression bases are formed by the concatenated log posterior probabilities of all the broad phone DNNs. Logistic regression is used to transform the regression bases into the triphone state posteriors. Clustering of the regression parameters is used to reduce the regression model complexity while still achieving unique acoustic scores for all possible triphones. Based on some approximations, the regression model can be trained as a sparse softmax layer and its parameters can be learned by optimizing the cross-entropy criterion. The experimental results on a broadcast news transcription task reveal that the proposed regression-based CD-DNN significantly outperforms the standard CD-DNN. The best system provides a 1.3% absolute word error rate reduction compared to the best standard CD-DNN system.

Keywords

approximation theory; decision trees; neural nets; regression analysis; speech recognition; CD-DNN; Logistic regression; approximations; clustering problems; concatenated log posterior probabilities; context-dependent deep neural network systems; cross-entropy criterion; data sparsity; decision tree state clusters; deep neural networks; multiple articulatory categories; regression-based context-dependent modeling; sparse softmax layer; speech recognition; Approximation methods; Context; Context modeling; Detectors; Equations; Mathematical model; Training; Articulatory features; context dependent modeling; deep neural network; logistic regression;

fLanguage

English

Journal_Title

Audio, Speech, and Language Processing, IEEE/ACM Transactions on

Publisher

ieee

ISSN

2329-9290

Type

jour

DOI

10.1109/TASLP.2014.2344855

Filename

6872780