DocumentCode
3423179
Title
Sample selection for automatic language identification
Author
Farris, David ; White, Chris ; Khudanpur, Sanjeev
Author_Institution
Center for Language & Speech Process., Johns Hopkins Univ., Baltimore, MD
fYear
2008
fDate
March 31 2008-April 4 2008
Firstpage
4225
Lastpage
4228
Abstract
Current approaches to automatic spoken language identification (LID) assume the availability of a large corpus of manually language-labeled speech samples for training statistical classifiers. We investigate two methods of active learning to significantly reduce the amount of labeled speech needed for training LID systems. Starting with a small training set, an automated method is used to select samples from a corpus of unlabeled speech, which are then labeled and added to the training pool - one selection method is based on a previously known entropy criterion, and another on a novel likelihood-ratio criterion. We demonstrate LID performance comparable to a large training corpus using only a tenth of the training data. A further 40% improvement in LID performance is obtained using a third of the training data. Finally, we show that our novel selection method is more robust to variance in the unlabeled pool than the entropy based method.
Keywords
entropy; natural language processing; speech recognition; automatic language identification; entropy criterion; language-labeled speech samples; likelihood-ratio criterion; sample selection; spoken language identification; statistical classifiers; Costs; Error analysis; Iterative algorithms; Iterative methods; Natural languages; Partitioning algorithms; Sampling methods; Speech processing; Training data; Uncertainty; natural languages; speech processing; unsupervised learning;
fLanguage
English
Publisher
ieee
Conference_Titel
Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on
Conference_Location
Las Vegas, NV
ISSN
1520-6149
Print_ISBN
978-1-4244-1483-3
Electronic_ISBN
1520-6149
Type
conf
DOI
10.1109/ICASSP.2008.4518587
Filename
4518587
Link To Document