• DocumentCode
    2174385
  • Title

    Improved F0 modeling and generation in voice conversion

  • Author

    Kunikoshi, Aki ; Qian, Yao ; Soong, Frank ; Minematsu, Nobuaki

  • Author_Institution
    Microsoft Res. Asia, Beijing, China
  • fYear
    2011
  • fDate
    22-27 May 2011
  • Firstpage
    4568
  • Lastpage
    4571
  • Abstract
    F0 is an acoustic feature that varies largely from one speaker to an other. F0 is characterized by a discontinuity in the transition between voiced and unvoiced sounds that presents an obstacle to GMM modeling for use in voice conversion. A Multi-Space Distribution (MSD) [5] can be used to model unvoiced and voiced F0 regions in a linearly weighted mixture. However, the use of two incompatible probabilistic spaces, for example a continuous probability density for voiced observations, and a discrete probability for unvoiced observations, may result in an imprecise voiced/unvoiced (v/u) conversion in a maximum likelihood (ML) sense. In this paper we propose to use voicing strength, characterized by the normalized correlation coefficient magnitude, as calculated from F0 feature extraction, as an additional feature for improving F0 modeling and the v/u decision in the context of voice conversion. The proposed method was evaluated on male-to-female voice conversion tasks in both Mandarin and English. Objective tests showed that the approach is effective in reducing the Root Mean Square Error, while the results for subjective metrics including AB preference and ABX speaker similarity tests also showed gains.
  • Keywords
    feature extraction; maximum likelihood estimation; mean square error methods; speech synthesis; ABX speaker; F0 modeling; FO feature extraction; GMM modeling; continuous probability density; maximum likelihood sense; multispace distribution; normalized correlation coefficient; root mean square error; voice conversion; Correlation; Databases; Feature extraction; Hidden Markov models; Speech; Speech synthesis; Trajectory; F0 generation; Voice Conversion; Voicing Strength; v/u decision model;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on
  • Conference_Location
    Prague
  • ISSN
    1520-6149
  • Print_ISBN
    978-1-4577-0538-0
  • Electronic_ISBN
    1520-6149
  • Type

    conf

  • DOI
    10.1109/ICASSP.2011.5947371
  • Filename
    5947371