• DocumentCode
    134281
  • Title

    Mandarin speech recognition using convolution neural network with augmented tone features

  • Author

    Xinhui Hu ; Xugang Lu ; Hori, Chiori

  • Author_Institution
    Nat. Inst. of Inf. & Commun. Technol., Kyoto, Japan
  • fYear
    2014
  • fDate
    12-14 Sept. 2014
  • Firstpage
    15
  • Lastpage
    18
  • Abstract
    Due to its ability of reducing spectral variations and modeling spectral correlations existed in speech signals, the convolutional neural network (CNN) has been shown effective in modeling speech compared to deep neural network (DNN). In this study, we explore applying CNN to Mandarin speech recognitions. Besides exploring appropriate CNN architecture for recognition performance, focuses are on investigating the effective acoustic features, and effectivenesses of applying tonal information which have been verified helpful in other types of acoustic models to the acoustic features in the CNN. We conduct speech recognition experiments on Mandarin broadcast speech recognition to test the effectivenesses of the proposed approaches. The CNN shows its clear superiority to the DNN, with relative reductions of character error rate (CER) among 7.7-13.1% for broadcast news speech (BN), and 5.4-9.9% for broadcast conversation speech (BC). Like in the Gaussian Mixture Model (GMM) and DNN systems, the tonal information characterized by the fundamental frequency (F0) and fundamental frequency variations (FFV) are found still helpful in CNN models, they achieve relative CER reductions over 6.7% for BN and 4.3% for BC respectively when compared with the baseline Mel-filter bank feature.
  • Keywords
    Gaussian processes; acoustic signal processing; convolution; mixture models; neural nets; speech recognition; CNN models; DNN systems; Gaussian mixture model; Mandarin broadcast speech recognition; acoustic features; acoustic models; augmented tone features; broadcast conversation speech; broadcast news speech; character error rate reductions; convolution neural network; deep neural network; fundamental frequency variations; spectral correlations; spectral variations; speech modeling; speech signals; Decision support systems; Radio frequency; Rail to rail inputs; CNN; F0; FFV; Mandarin speech recognition; tonal feature;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on
  • Conference_Location
    Singapore
  • Type

    conf

  • DOI
    10.1109/ISCSLP.2014.6936674
  • Filename
    6936674