مرکز منطقه ای اطلاع رساني علوم و فناوري - Improving Phoneme Sequence Recognition using Phoneme Duration Information in DNN-HSMM

Abstract :

Improving phoneme recognition has attracted the attention of many researchers due to its applications in various fields of speech processing. The recent research achievements show that using deep neural network (DNN) in speech recognition systems significantly improves the performance of these systems. There are two phases in the DNN-based phoneme recognition systems including training and testing. Most previous research works have attempted to improve training phases such as training algorithms, different types of network, network architecture and feature type. However, in this work, we focus on the test phase, which is related to the generation of phoneme sequence that is also essential to achieve a good phoneme recognition accuracy. Past research works have used Viterbi algorithm on hidden Markov model (HMM) to generate phoneme sequences. We address an important problem associated with this method. In order to deal with the problem of considering geometric distribution of state duration in HMM, we use real duration probability distribution for each phoneme with the aid of hidden semi-Markov model (HSMM). We also represent each phoneme with only one state to simply use phoneme duration information in HSMM. Furthermore, we investigate the performance of a post-processing method that corrects the phoneme sequence obtained from the neural network based on our knowledge about phonemes. The experimental results obtained using the Persian FarsDat corpus show that using the extended Viterbi algorithm on HSMM achieves phoneme recognition accuracy improvements of 2.68% and 0.56% over the conventional methods using Gaussian mixture model-hidden Markov models (GMM-HMMs) and Viterbi on HMM, respectively. The postprocessing method also increases the accuracy compared to before its application.