• DocumentCode
    1937712
  • Title

    Corpus building for data-driven TTS systems

  • Author

    Zhu, Weibin ; Zhang, Wei ; Shi, Qin ; Chen, Fangxin ; Li, Haiping ; Ma, Xijun ; Shen, Liqin

  • Author_Institution
    IBM China Res. Lab, Beijing, China
  • fYear
    2002
  • fDate
    11-13 Sept. 2002
  • Firstpage
    199
  • Lastpage
    202
  • Abstract
    To generate a data-driven TTS system of Mandarin, we built a large and balanced Mandarin text-and-speech corpus, named IBM Mandarin TTS Corpus. The corpus is designed for both statistical prosody modeling, and context dependence of phonemic features. In the script-design stage, we investigated the problem of a proper synthetic unit. Based on the appropriate choice of synthetic unit, we developed a numerical criterion for the coverage and balance of variants of the synthetic units. In the speech-recording stage, we paid attention to speaking style, which is essential to generate an effective concatenative speech synthesis system. We formulated a specification of speaking style, and guided the speaker to strictly follow the guidelines. Corpus processing is another important step. In that step, we carefully executed pronunciation marking, segment aligning, and the prosodic events labeling, etc. We defined a set of prosodic hierarchical layers, to describe various prosodic events. Because those issues often involve manual effort, the quality of the processed corpus depends on both proper specifications for each step, and the training of the operating team.
  • Keywords
    speech processing; speech synthesis; statistical analysis; IBM Mandarin TTS Corpus; concatenative speech synthesis; context dependence; corpus building; corpus processing; coverage; data-driven TTS systems; numerical criterion; phonemic features; pronunciation marking; prosodic events labeling; prosodic hierarchical layers; script design; segment aligning; speaking style; speech recording; statistical prosody modeling; synthetic unit; variant balance; Concatenated codes; Context modeling; Degradation; Humans; Predictive models; Signal processing; Signal synthesis; Spatial databases; Speech processing; Speech synthesis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Speech Synthesis, 2002. Proceedings of 2002 IEEE Workshop on
  • Print_ISBN
    0-7803-7395-2
  • Type

    conf

  • DOI
    10.1109/WSS.2002.1224408
  • Filename
    1224408