• DocumentCode
    15707
  • Title

    Regression-Based Context-Dependent Modeling of Deep Neural Networks for Speech Recognition

  • Author

    Guangsen Wang ; Khe Chai Sim

  • Author_Institution
    Dept. of Comput. Sci., Nat. Univ. of Singapore, Singapore, Singapore
  • Volume
    22
  • Issue
    11
  • fYear
    2014
  • fDate
    Nov. 2014
  • Firstpage
    1660
  • Lastpage
    1669
  • Abstract
    The data sparsity problem is addressed by using the decision tree state clusters as the training targets for the state-of-the- art context-dependent (CD) deep neural network (DNN) systems. The CD states within a cluster cannot be distinguished at the frame level. We surmise that the state clustering may cause an issue for the standard CD-DNNs, which has so far not been addressed in the literature. In this paper, a logistic regression framework is proposed for the CD-DNNs based on a set of broad phone classes to address both the data sparsity and the clustering problems. To address the data sparsity issue, the triphones are clustered into shorter biphones with broad phone contexts under multiple articulatory categories. A DNN is trained to discriminate the disjoint biphone clusters within each articulatory category. The regression bases are formed by the concatenated log posterior probabilities of all the broad phone DNNs. Logistic regression is used to transform the regression bases into the triphone state posteriors. Clustering of the regression parameters is used to reduce the regression model complexity while still achieving unique acoustic scores for all possible triphones. Based on some approximations, the regression model can be trained as a sparse softmax layer and its parameters can be learned by optimizing the cross-entropy criterion. The experimental results on a broadcast news transcription task reveal that the proposed regression-based CD-DNN significantly outperforms the standard CD-DNN. The best system provides a 1.3% absolute word error rate reduction compared to the best standard CD-DNN system.
  • Keywords
    approximation theory; decision trees; neural nets; regression analysis; speech recognition; CD-DNN; Logistic regression; approximations; clustering problems; concatenated log posterior probabilities; context-dependent deep neural network systems; cross-entropy criterion; data sparsity; decision tree state clusters; deep neural networks; multiple articulatory categories; regression-based context-dependent modeling; sparse softmax layer; speech recognition; Approximation methods; Context; Context modeling; Detectors; Equations; Mathematical model; Training; Articulatory features; context dependent modeling; deep neural network; logistic regression;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    2329-9290
  • Type

    jour

  • DOI
    10.1109/TASLP.2014.2344855
  • Filename
    6872780