DocumentCode :
730739
Title :
Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search
Author :
Chongjia Ni ; Cheung-Chi Leung ; Lei Wang ; Chen, Nancy F. ; Bin Ma
Author_Institution :
Inst. for Infocomm Res. (I2R), A*STAR, Singapore, Singapore
fYear :
2015
fDate :
19-24 April 2015
Firstpage :
4714
Lastpage :
4718
Abstract :
This paper considers an unsupervised data selection problem for the training data of an acoustic model and the vocabulary coverage of a keyword search system in low-resource settings. We propose to use Gaussian component index based n-grams as acoustic features in a submodular function for unsupervised data selection. The submodular function provides a near-optimal solution in terms of the objective being optimized. Moreover, to further resolve the high out-of-vocabulary (OOV) rate for morphologically-rich languages like Tamil, word-morph mixed language modeling is also considered. Our experiments are conducted on the Tamil speech provided by the IAPRA Babel program for the 2014 NIST Open Keyword Search Evaluation (OpenKWS14). We show that the selection of data plays an important role to the word error rate of the speech recognition system and the actual term weighted value (ATWV) of the keyword search system. The 10 hours of speech selected from the full language pack (FLP) using the proposed algorithm provides a relative 23.2% and 20.7% ATWV improvement over two other data subsets, the 10-hour data from the limited language pack (LLP) defined by IARPA and the 10 hours of speech randomly selected from the FLP, respectively. The proposed algorithm also increases the vocabulary coverage, implicitly alleviating the OOV problem: The number of OOV search terms drops from 1,686 and 1,171 in the two baseline conditions to 972.
Keywords :
Gaussian processes; information retrieval; natural language processing; optimisation; speech recognition; unsupervised learning; vocabulary; 2014 NIST Open Keyword Search Evaluation; Gaussian component index based n-grams; IAPRA Babel program; OOV problem; OpenKWS14; Tamil low-resource keyword search; acoustic features; acoustic model; active learning; actual term weighted value; full language pack; limited language pack; low-resource settings; near-optimal solution; out-of-vocabulary rate; speech recognition; submodular function; submodular optimization; training data; unsupervised data selection problem; vocabulary coverage; word error rate; word-morph mixed language model; Acoustics; Data models; Indexes; Keyword search; Speech; Speech recognition; Training; Submodular optimization; active learning; keyword spotting; spoken term detection;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
Conference_Location :
South Brisbane, QLD
Type :
conf
DOI :
10.1109/ICASSP.2015.7178865
Filename :
7178865
Link To Document :
بازگشت