Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search

Author

Chongjia Ni ; Cheung-Chi Leung ; Lei Wang ; Chen, Nancy F. ; Bin Ma

Author_Institution

Inst. for Infocomm Res. (I2R), A*STAR, Singapore, Singapore

fYear

2015

fDate

19-24 April 2015

Firstpage

4714

Lastpage

4718

Abstract

This paper considers an unsupervised data selection problem for the training data of an acoustic model and the vocabulary coverage of a keyword search system in low-resource settings. We propose to use Gaussian component index based n-grams as acoustic features in a submodular function for unsupervised data selection. The submodular function provides a near-optimal solution in terms of the objective being optimized. Moreover, to further resolve the high out-of-vocabulary (OOV) rate for morphologically-rich languages like Tamil, word-morph mixed language modeling is also considered. Our experiments are conducted on the Tamil speech provided by the IAPRA Babel program for the 2014 NIST Open Keyword Search Evaluation (OpenKWS14). We show that the selection of data plays an important role to the word error rate of the speech recognition system and the actual term weighted value (ATWV) of the keyword search system. The 10 hours of speech selected from the full language pack (FLP) using the proposed algorithm provides a relative 23.2% and 20.7% ATWV improvement over two other data subsets, the 10-hour data from the limited language pack (LLP) defined by IARPA and the 10 hours of speech randomly selected from the FLP, respectively. The proposed algorithm also increases the vocabulary coverage, implicitly alleviating the OOV problem: The number of OOV search terms drops from 1,686 and 1,171 in the two baseline conditions to 972.

Keywords

Gaussian processes; information retrieval; natural language processing; optimisation; speech recognition; unsupervised learning; vocabulary; 2014 NIST Open Keyword Search Evaluation; Gaussian component index based n-grams; IAPRA Babel program; OOV problem; OpenKWS14; Tamil low-resource keyword search; acoustic features; acoustic model; active learning; actual term weighted value; full language pack; limited language pack; low-resource settings; near-optimal solution; out-of-vocabulary rate; speech recognition; submodular function; submodular optimization; training data; unsupervised data selection problem; vocabulary coverage; word error rate; word-morph mixed language model; Acoustics; Data models; Indexes; Keyword search; Speech; Speech recognition; Training; Submodular optimization; active learning; keyword spotting; spoken term detection;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on

Conference_Location

South Brisbane, QLD

Type

conf

DOI

10.1109/ICASSP.2015.7178865

Filename

7178865