• DocumentCode
    3604242
  • Title

    Speaker-Adaptive Acoustic-Articulatory Inversion Using Cascaded Gaussian Mixture Regression

  • Author

    Hueber, Thomas ; Girin, Laurent ; Alameda-Pineda, Xavier ; Bailly, Gerard

  • Author_Institution
    GIPSA-Lab., Univ. of Grenoble Alpes, St. Martin d´Hères, France
  • Volume
    23
  • Issue
    12
  • fYear
    2015
  • Firstpage
    2246
  • Lastpage
    2259
  • Abstract
    This paper addresses the adaptation of an acoustic-articulatory model of a reference speaker to the voice of another speaker, using a limited amount of audio-only data. In the context of pronunciation training, a virtual talking head displaying the internal speech articulators (e.g., the tongue) could be automatically animated by means of such a model using only the speaker´s voice. In this study, the articulatory-acoustic relationship of the reference speaker is modeled by a gaussian mixture model (GMM). To address the speaker adaptation problem, we propose a new framework called cascaded Gaussian mixture regression (C-GMR), and derive two implementations. The first one, referred to as Split-C-GMR, is a straightforward chaining of two distinct GMRs: one mapping the acoustic features of the source speaker into the acoustic space of the reference speaker, and the other estimating the articulatory trajectories with the reference model. In the second implementation, referred to as Integrated-C-GMR, the two mapping steps are tied together in a single probabilistic model. For this latter model, we present the full derivation of the exact EM training algorithm, that explicitly exploits the missing data methodology of machine learning. Other adaptation schemes based on maximum-a posteriori (MAP), maximum likelihood linear regression (MLLR) and direct cross-speaker acoustic-to-articulatory GMR are also investigated. Experiments conducted on two speakers for different amount of adaptation data show the interest of the proposed C-GMR techniques.
  • Keywords
    Gaussian processes; acoustic signal processing; adaptive signal processing; learning (artificial intelligence); maximum likelihood estimation; mixture models; regression analysis; speaker recognition; C-GMR model; GMM; Gaussian mixture model; MAP scheme; MLLR scheme; cascaded Gaussian mixture regression; direct cross-speaker acoustic-to-articulatory GMR; internal speech articulator; machine learning; maximum likelihood linear regression scheme; maximum-a posteriori scheme; missing data method; pronunciation training; single probabilistic model; source speaker acoustic feature mapping; speaker adaptive acoustic articulatory inversion; virtual talking head; Acoustics; Adaptation models; Context modeling; Data models; Hidden Markov models; Magnetic heads; Speech processing; Acoustic-articulatory inversion; EM algorithm; Gaussian mixture regression; pronunciation training; speaker adaptation; speech production; talking head;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    2329-9290
  • Type

    jour

  • DOI
    10.1109/TASLP.2015.2464702
  • Filename
    7180325