Title :
The Vietnamese speech recognition based on rectified linear units deep neural network and spoken term detection system combination
Author :
Shifu Xiong ; Wu Guo ; Diyuan Liu
Author_Institution :
Nat. Eng. Lab. of Speech & Language Inf. Process., Univ. of Sci. & Technol. of China, Hefei, China
Abstract :
In this paper, we report our recent progress on the under-resource language automatic speech recognition (ASR) and the following spoken term detection (STD). The experiments are carried on the National Institute of Standards and Technology (NIST) Open Keyword Search 2013 (OpenKWS13) evaluation Vietnamese corpus. Compared with the conventional ASR system, we made the following modifications to improve recognition accuracy. First, pitch features and tone modeling are applied to cover pitch and tone information since Vietnamese is a tonal language. Second, automatic question generation for decision tree is used for state tying to address the problem of lack of linguistic knowledge. Finally, we investigate rectified linear units (ReLUs) activation function and cross-lingual pre-training in deep neural network (DNN) acoustic model training. In the STD procedure, we adopt term-dependent score normalization and combine the outputs of diverse ASR systems to increase actual term weighted value (ATWV). After applying these methods, our current best single system achieves 48.32% word accuracy and 0.398 ATWV after STD system combination on OpenKWS13 Vietnamese development set.
Keywords :
decision trees; natural language processing; neural nets; speech recognition; ASR system; ATWV; DNN acoustic model training; NIST; National Institute of Standards and Technology; Open Keyword Search 2013 evaluation Vietnamese corpus; OpenKWS13 Vietnamese development set; ReLU activation function; STD procedure; Vietnamese speech recognition; actual term weighted value; cross-lingual pretraining; decision tree; deep neural network acoustic model training; linguistic knowledge; pitch features; rectified linear unit activation function; rectified linear unit deep neural network; spoken term detection; spoken term detection system; term-dependent score normalization; tone modeling; under-resource language automatic speech recognition; Accuracy; Feature extraction; Neural networks; Noise; Speech; Speech recognition; Training; deep neural network; rectified linear units; spoken term detection; system combination; under-resource speech recognition;
Conference_Titel :
Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on
Conference_Location :
Singapore
DOI :
10.1109/ISCSLP.2014.6936574