• DocumentCode
    2702047
  • Title

    Unsupervised Lexicon Acquisition from Speech and Text

  • Author

    Kurata, Gakuto ; Mori, Shinsuke ; Itoh, N. ; Nishimura, M.

  • Author_Institution
    IBM Res., IBM Japan Ltd., Kanagawa, Japan
  • Volume
    4
  • fYear
    2007
  • fDate
    15-20 April 2007
  • Abstract
    When introducing a large vocabulary continuous speech recognition (LVCSR) system into a specific domain, it is preferable to add the necessary domain-specific words and their correct pronunciations selectively to the lexicon, especially in the areas where the LVCSR system should be updated frequently by adding new words. In this paper, we propose an unsupervised method of word acquisition in Japanese, where no spaces exist between words. In our method, by taking advantage of the speech of the target domain, we selected the domain-specific words among an enormous number of word candidates extracted from the raw corpora. The experiments showed that the acquired lexicon was of good quality and that it contributed to the performance of the LVCSR system for the target domain.
  • Keywords
    natural languages; speech recognition; Japanese word acquisition; large vocabulary continuous speech recognition system; text; unsupervised lexicon acquisition; Laboratories; Magnetooptic recording; Natural languages; Speech recognition; Vocabulary; Large Vocabulary Continuous Speech Recognition; Lexicon acquisition; Stochastically segmented corpus;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on
  • Conference_Location
    Honolulu, HI
  • ISSN
    1520-6149
  • Print_ISBN
    1-4244-0727-3
  • Type

    conf

  • DOI
    10.1109/ICASSP.2007.366939
  • Filename
    4218127