• DocumentCode
    1328483
  • Title

    Learning Speaker-Specific Characteristics With a Deep Neural Architecture

  • Author

    Chen, Ke ; Salman, Ahmad

  • Author_Institution
    Sch. of Comput. Sci., Univ. of Manchester, Manchester, UK
  • Volume
    22
  • Issue
    11
  • fYear
    2011
  • Firstpage
    1744
  • Lastpage
    1756
  • Abstract
    Speech signals convey various yet mixed information ranging from linguistic to speaker-specific information. However, most of acoustic representations characterize all different kinds of information as whole, which could hinder either a speech or a speaker recognition (SR) system from producing a better performance. In this paper, we propose a novel deep neural architecture (DNA) especially for learning speaker-specific characteristics from mel-frequency cepstral coefficients, an acoustic representation commonly used in both speech recognition and SR, which results in a speaker-specific overcomplete representation. In order to learn intrinsic speaker-specific characteristics, we come up with an objective function consisting of contrastive losses in terms of speaker similarity/dissimilarity and data reconstruction losses used as regularization to normalize the interference of non-speaker-related information. Moreover, we employ a hybrid learning strategy for learning parameters of the deep neural networks: i.e., local yet greedy layerwise unsupervised pretraining for initialization and global supervised learning for the ultimate discriminative goal. With four Linguistic Data Consortium (LDC) benchmarks and two non-English corpora, we demonstrate that our overcomplete representation is robust in characterizing various speakers, no matter whether their utterances have been used in training our DNA, and highly insensitive to text and languages spoken. Extensive comparative studies suggest that our approach yields favorite results in speaker verification and segmentation. Finally, we discuss several issues concerning our proposed approach.
  • Keywords
    cepstral analysis; learning (artificial intelligence); natural languages; neural net architecture; speaker recognition; acoustic representation; data reconstruction loss; deep neural architecture; hybrid learning strategy; languages spoken; learning speaker-specific characteristics; linguistic data consortium benchmark; mel-frequency cepstral coefficient; nonEnglish corpora; nonspeaker-related information; speaker recognition system; speaker segmentation; speaker similarity; speaker verification; speaker-specific characteristics learning; speech signal; supervised learning; DNA; Learning systems; Neurons; Speech; Speech processing; Speech recognition; Strontium; Deep neural architecture; hybrid learning strategy; overcomplete representation; speaker comparison; speaker segmentation; speaker verification; speaker-specific characteristics; Algorithms; Artificial Intelligence; Computer Systems; Discrimination Learning; Humans; Language; Models, Neurological; Neural Networks (Computer); Normal Distribution; Speech Recognition Software;
  • fLanguage
    English
  • Journal_Title
    Neural Networks, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9227
  • Type

    jour

  • DOI
    10.1109/TNN.2011.2167240
  • Filename
    6026951