Conditional restricted Boltzmann machine for voice conversion

Author

Zhizheng Wu ; Eng Siong Chng ; Haizhou Li

Author_Institution

Sch. of Comput. Eng., Nanyang Technol. Univ. (NTU), Singapore, Singapore

fYear

2013

fDate

6-10 July 2013

Firstpage

104

Lastpage

108

Abstract

The conventional statistical-based transformation functions for voice conversion have been shown to suffer over-smoothing and over-fitting problems. The over-smoothing problem arises because of the statistical average during estimating the model parameters for the transformation function. In addition, the large number of parameters in the statistical model cannot be well estimated from the limited parallel training data, which will result in the over-fitting problem. In this work, we investigate a robust transformation function for voice conversion using conditional restricted Boltzmann machine. Conditional restricted Boltzmann machine, which performs linear and non-linear transformations simultaneously, is proposed to learn the relationship between source and target speech. CMU ARCTIC corpus is adopted in the experimental validations. The number of parallel training utterances is varied from 2 to 40. For these different training situations, two objective evaluation measures, mel-cepstral distortion and correlation coefficient, both show that the proposed method outperforms the main stream joint density Gaussian mixture model method consistently.

Keywords

Boltzmann machines; parameter estimation; speech synthesis; statistical analysis; CMU ARCTIC corpus; Mel-cepstral distortion; conditional restricted Boltzmann machine; correlation coefficient; linear transformations; model parameter estimation; nonlinear transformations; objective evaluation measures; overfitting problems; oversmoothing problems; parallel training data; robust transformation function; speech synthesis; statistical model; statistical-based transformation functions; stream joint density Gaussian mixture model method; target speech processing; voice conversion; Correlation; Distortion measurement; Speech; Speech processing; Training; Training data; Vectors; Speech synthesis; conditional restricted Boltzmann machine; voice conversion;

fLanguage

English

Publisher

ieee

Conference_Titel

Signal and Information Processing (ChinaSIP), 2013 IEEE China Summit & International Conference on

Conference_Location

Beijing

Type

conf

DOI

10.1109/ChinaSIP.2013.6625307

Filename

6625307