• DocumentCode
    1652305
  • Title

    Construction of Chinese conversational corpora for spontaneous speech recognition and comparative study on the trilingual parallel corpora

  • Author

    Hu, Xinhui ; Isotani, Ryosuke ; Nakamura, Satoshi

  • Author_Institution
    Nat. Inst. of Inf. & Commun. Technol., Koganei, Japan
  • fYear
    2009
  • Firstpage
    56
  • Lastpage
    59
  • Abstract
    In this paper, we describe the development of Chinese conversational segmented and POS-tagged corpora currently used in the NICT/ATR speech-to-speech translation system. Over 500 K manually checked utterances provide 3.5 M words of Chinese corpora. As far as we know, they are the largest conversational textual corpora; in the domain of travel. A set of three parallel corpora is obtained with the corresponding pairs of Japanese and English words from which the Chinese words are translated. Based on these parallel corpora, we make an investigation on the statistics of each language, performances of language model and speech recognition, and find the differences among these languages. The problems and their solutions to the present Chinese corpora are also analyzed and discussed.
  • Keywords
    natural language processing; speech recognition; Chinese conversational corpora; English words; Japanese words; NICT-ATR speech-to-speech translation system; POS-tagged corpora; corpus-based natural language processing; speech recognition; trilingual parallel corpora; Communications technology; Computational linguistics; Guidelines; Information science; Natural language processing; Natural languages; Speech processing; Speech recognition; Statistics; Stress;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Speech Database and Assessments, 2009 Oriental COCOSDA International Conference on
  • Conference_Location
    Urumqi
  • Print_ISBN
    978-1-4244-4400-7
  • Electronic_ISBN
    978-1-4244-4400-7
  • Type

    conf

  • DOI
    10.1109/ICSDA.2009.5278375
  • Filename
    5278375