Title :
Supervised Detection and Unsupervised Discovery of Pronunciation Error Patterns for Computer-Assisted Language Learning
Author :
Yow-Bang Wang ; Lin-Shan Lee
Author_Institution :
Nuance Commun., Inc., Burlington, MA, USA
Abstract :
Pronunciation error patterns (EPs) are patterns of mispronunciation frequently produced by language learners, and are usually different for different pairs of target and native languages. Accurate information of EPs can offer helpful feedbacks to the learners to improve their language skills. However, the major difficulty of EP detection comes from the fact that EPs are intrinsically similar to their corresponding canonical pronunciation, and different EPs corresponding to same canonical pronunciation are also intrinsically similar to each other. As a result, distinguishing EPs from their corresponding canonical pronunciation and between different EPs of the same phoneme is a difficult task-perhaps even more difficult than distinguishing between different phonemes in one language. On the other hand, the cost of deriving all EPs for each pair of target and native languages is high, usually requiring extensive expert knowledge or high-quality annotated data. Unsupervised EP discovery from a corpus of learner recordings would thus be an attractive addition to the field. In this paper, we propose new frameworks for both supervised EP detection and unsupervised EP discovery. For supervised EP detection, we use hierarchical multi-layer perceptrons (MLPs) as the EP classifiers to be integrated with the baseline using HMM/GMM in a two-pass Viterbi decoding architecture. Experimental results show that the new framework enhances the power of EP diagnosis. For unsupervised EP discovery we propose the first known framework, using the hierarchical agglomerative clustering (HAC) algorithm to explore sub-segmental variation within phoneme segments and produce fixed-length segment-level feature vectors in order to distinguish different EPs. We tested K-means (assuming a known number of EPs) and the Gaussian mixture model with the minimum description length principle (estimating an unknown number of EPs) for EP discovery. Preliminary experiments offered very encouraging results, al- hough there is still a long way to go to approach the performance of human experts. We also propose to use the universal phoneme posteriorgram (UPP), derived from an MLP trained on corpora of mixed languages, as frame-level features in both supervised detection and unsupervised discovery of EPs. Experimental results show that using UPP not only achieves the best performance, but also is useful in analyzing the mispronunciation produced by language learners.
Keywords :
Gaussian processes; Viterbi decoding; computer aided instruction; feature selection; hidden Markov models; learning (artificial intelligence); linguistics; mixture models; multilayer perceptrons; natural language processing; pattern clustering; signal classification; speech recognition; EP classifiers; EP diagnosis; GMM; Gaussian mixture model; HAC algorithm; HMM; K-means; MLP training; UPP; canonical pronunciation; computer-assisted language learning; fixed-length segment-level feature vectors; hierarchical agglomerative clustering algorithm; hierarchical multilayer perceptrons; language skills improvement; minimum description length principle; mispronunciation pattern; native language; phoneme segments; subsegmental variation; supervised pronunciation error pattern detection; target language; two-pass Viterbi decoding architecture; universal phoneme posteriorgram; unsupervised EP discovery; unsupervised pronunciation error pattern discovery; Acoustics; Feature extraction; IEEE transactions; Speech; Speech processing; Training; Vectors; Computer-aided pronunciation training; computer-assisted language learning; error pattern detection; error pattern discovery; universal phoneme posteriorgram;
Journal_Title :
Audio, Speech, and Language Processing, IEEE/ACM Transactions on
DOI :
10.1109/TASLP.2014.2387413