The Vietnamese speech recognition based on rectified linear units deep neural network and spoken term detection system combination

Author

Shifu Xiong ; Wu Guo ; Diyuan Liu

Author_Institution

Nat. Eng. Lab. of Speech & Language Inf. Process., Univ. of Sci. & Technol. of China, Hefei, China

fYear

2014

fDate

12-14 Sept. 2014

Firstpage

183

Lastpage

186

Abstract

In this paper, we report our recent progress on the under-resource language automatic speech recognition (ASR) and the following spoken term detection (STD). The experiments are carried on the National Institute of Standards and Technology (NIST) Open Keyword Search 2013 (OpenKWS13) evaluation Vietnamese corpus. Compared with the conventional ASR system, we made the following modifications to improve recognition accuracy. First, pitch features and tone modeling are applied to cover pitch and tone information since Vietnamese is a tonal language. Second, automatic question generation for decision tree is used for state tying to address the problem of lack of linguistic knowledge. Finally, we investigate rectified linear units (ReLUs) activation function and cross-lingual pre-training in deep neural network (DNN) acoustic model training. In the STD procedure, we adopt term-dependent score normalization and combine the outputs of diverse ASR systems to increase actual term weighted value (ATWV). After applying these methods, our current best single system achieves 48.32% word accuracy and 0.398 ATWV after STD system combination on OpenKWS13 Vietnamese development set.

Keywords

decision trees; natural language processing; neural nets; speech recognition; ASR system; ATWV; DNN acoustic model training; NIST; National Institute of Standards and Technology; Open Keyword Search 2013 evaluation Vietnamese corpus; OpenKWS13 Vietnamese development set; ReLU activation function; STD procedure; Vietnamese speech recognition; actual term weighted value; cross-lingual pretraining; decision tree; deep neural network acoustic model training; linguistic knowledge; pitch features; rectified linear unit activation function; rectified linear unit deep neural network; spoken term detection; spoken term detection system; term-dependent score normalization; tone modeling; under-resource language automatic speech recognition; Accuracy; Feature extraction; Neural networks; Noise; Speech; Speech recognition; Training; deep neural network; rectified linear units; spoken term detection; system combination; under-resource speech recognition;

fLanguage

English

Publisher

ieee

Conference_Titel

Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on

Conference_Location

Singapore

Type

conf

DOI

10.1109/ISCSLP.2014.6936574

Filename

6936574