مرکز منطقه ای اطلاع رساني علوم و فناوري - Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training

DocumentCode :

54103

Title :

Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training

Author :

Ling-Hui Chen ; Zhen-Hua Ling ; Li-Juan Liu ; Li-Rong Dai

Author_Institution :

Nat. Eng. Lab. of Speech & Language Inf. Process., Univ. of Sci. & Technol. of China, Hefei, China

Volume :

Issue :

fYear :

2014

fDate :

Dec. 2014

Firstpage :

1859

Lastpage :

1872

Abstract :

This paper presents a new spectral envelope conversion method using deep neural networks (DNNs). The conventional joint density Gaussian mixture model (JDGMM) based spectral conversion methods perform stably and effectively. However, the speech generated by these methods suffer severe quality degradation due to the following two factors: 1) inadequacy of JDGMM in modeling the distribution of spectral features as well as the non-linear mapping relationship between the source and target speakers, 2) spectral detail loss caused by the use of high-level spectral features such as mel-cepstra. Previously, we have proposed to use the mixture of restricted Boltzmann machines (MoRBM) and the mixture of Gaussian bidirectional associative memories (MoGBAM) to cope with these problems. In this paper, we propose to use a DNN to construct a global non-linear mapping relationship between the spectral envelopes of two speakers. The proposed DNN is generatively trained by cascading two RBMs, which model the distributions of spectral envelopes of source and target speakers respectively, using a Bernoulli BAM (BBAM). Therefore, the proposed training method takes the advantage of the strong modeling ability of RBMs in modeling the distribution of spectral envelopes and the superiority of BAMs in deriving the conditional distributions for conversion. Careful comparisons and analysis among the proposed method and some conventional methods are presented in this paper. The subjective results show that the proposed method can significantly improve the performance in terms of both similarity and naturalness compared to conventional methods.

Keywords :

Boltzmann machines; speech synthesis; Bernoulli BAM; DNN; JDGMM; MoGBAM; MoRBM; deep neural networks; joint density Gaussian mixture model; layer-wise generative training; mel-cepstra; mixture of Gaussian bidirectional associative memories; mixture of restricted Boltzmann machines; quality degradation; spectral conversion methods; voice conversion; Covariance matrices; Joints; Neural networks; Speech; Speech processing; Stochastic processes; Training; Bidirectional associative memory; Gaussian mixture model; deep neural network; restricted Boltzmann machine; spectral envelope conversion; voice conversion;

fLanguage :

English

Journal_Title :

Audio, Speech, and Language Processing, IEEE/ACM Transactions on

Publisher :

ieee

ISSN :

2329-9290

Type :

jour

DOI :

10.1109/TASLP.2014.2353991

Filename :

6891242

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=54103