• DocumentCode
    3423179
  • Title

    Sample selection for automatic language identification

  • Author

    Farris, David ; White, Chris ; Khudanpur, Sanjeev

  • Author_Institution
    Center for Language & Speech Process., Johns Hopkins Univ., Baltimore, MD
  • fYear
    2008
  • fDate
    March 31 2008-April 4 2008
  • Firstpage
    4225
  • Lastpage
    4228
  • Abstract
    Current approaches to automatic spoken language identification (LID) assume the availability of a large corpus of manually language-labeled speech samples for training statistical classifiers. We investigate two methods of active learning to significantly reduce the amount of labeled speech needed for training LID systems. Starting with a small training set, an automated method is used to select samples from a corpus of unlabeled speech, which are then labeled and added to the training pool - one selection method is based on a previously known entropy criterion, and another on a novel likelihood-ratio criterion. We demonstrate LID performance comparable to a large training corpus using only a tenth of the training data. A further 40% improvement in LID performance is obtained using a third of the training data. Finally, we show that our novel selection method is more robust to variance in the unlabeled pool than the entropy based method.
  • Keywords
    entropy; natural language processing; speech recognition; automatic language identification; entropy criterion; language-labeled speech samples; likelihood-ratio criterion; sample selection; spoken language identification; statistical classifiers; Costs; Error analysis; Iterative algorithms; Iterative methods; Natural languages; Partitioning algorithms; Sampling methods; Speech processing; Training data; Uncertainty; natural languages; speech processing; unsupervised learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on
  • Conference_Location
    Las Vegas, NV
  • ISSN
    1520-6149
  • Print_ISBN
    978-1-4244-1483-3
  • Electronic_ISBN
    1520-6149
  • Type

    conf

  • DOI
    10.1109/ICASSP.2008.4518587
  • Filename
    4518587