• DocumentCode
    134182
  • Title

    The Vietnamese speech recognition based on rectified linear units deep neural network and spoken term detection system combination

  • Author

    Shifu Xiong ; Wu Guo ; Diyuan Liu

  • Author_Institution
    Nat. Eng. Lab. of Speech & Language Inf. Process., Univ. of Sci. & Technol. of China, Hefei, China
  • fYear
    2014
  • fDate
    12-14 Sept. 2014
  • Firstpage
    183
  • Lastpage
    186
  • Abstract
    In this paper, we report our recent progress on the under-resource language automatic speech recognition (ASR) and the following spoken term detection (STD). The experiments are carried on the National Institute of Standards and Technology (NIST) Open Keyword Search 2013 (OpenKWS13) evaluation Vietnamese corpus. Compared with the conventional ASR system, we made the following modifications to improve recognition accuracy. First, pitch features and tone modeling are applied to cover pitch and tone information since Vietnamese is a tonal language. Second, automatic question generation for decision tree is used for state tying to address the problem of lack of linguistic knowledge. Finally, we investigate rectified linear units (ReLUs) activation function and cross-lingual pre-training in deep neural network (DNN) acoustic model training. In the STD procedure, we adopt term-dependent score normalization and combine the outputs of diverse ASR systems to increase actual term weighted value (ATWV). After applying these methods, our current best single system achieves 48.32% word accuracy and 0.398 ATWV after STD system combination on OpenKWS13 Vietnamese development set.
  • Keywords
    decision trees; natural language processing; neural nets; speech recognition; ASR system; ATWV; DNN acoustic model training; NIST; National Institute of Standards and Technology; Open Keyword Search 2013 evaluation Vietnamese corpus; OpenKWS13 Vietnamese development set; ReLU activation function; STD procedure; Vietnamese speech recognition; actual term weighted value; cross-lingual pretraining; decision tree; deep neural network acoustic model training; linguistic knowledge; pitch features; rectified linear unit activation function; rectified linear unit deep neural network; spoken term detection; spoken term detection system; term-dependent score normalization; tone modeling; under-resource language automatic speech recognition; Accuracy; Feature extraction; Neural networks; Noise; Speech; Speech recognition; Training; deep neural network; rectified linear units; spoken term detection; system combination; under-resource speech recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on
  • Conference_Location
    Singapore
  • Type

    conf

  • DOI
    10.1109/ISCSLP.2014.6936574
  • Filename
    6936574