• DocumentCode
    1440922
  • Title

    Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora

  • Author

    Yamagishi, Junichi ; Usabaev, Bela ; King, Simon ; Watts, Oliver ; Dines, John ; Tian, Jilei ; Guan, Yong ; Hu, Rile ; Oura, Keiichiro ; Wu, Yi-Jian ; Tokuda, Keiichi ; Karhila, Reima ; Kurimo, Mikko

  • Author_Institution
    Centre for Speech Technol. Res. (CSTR), Univ. of Edinburgh, Edinburgh, UK
  • Volume
    18
  • Issue
    5
  • fYear
    2010
  • fDate
    7/1/2010 12:00:00 AM
  • Firstpage
    984
  • Lastpage
    1004
  • Abstract
    In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an “average voice model” plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on “non-TTS” corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues.
  • Keywords
    hidden Markov models; speaker recognition; speech synthesis; ASR corpora; Globalphone; SPEECON databases; TTS systems; automatic speech recognition; hidden Markov model; microphones; perceptual evaluation; resource management; speaker-adaptive HMM-based speech synthesis; text-to-speech synthesis systems; wall street journal; Automatic speech recognition (ASR); H Triple S (HTS); SPEECON database; WSJ database; average voice; hidden Markov model (HMM)-based speech synthesis; speaker adaptation; speech synthesis; voice conversion;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1558-7916
  • Type

    jour

  • DOI
    10.1109/TASL.2010.2045237
  • Filename
    5431023