• DocumentCode
    2019599
  • Title

    Comparison of feature extraction methods for speech recognition in noise-free and in traffic noise environment

  • Author

    Sárosi, Gellért ; Mozsáry, Mihály ; Mihajlik, Péter ; Fegyó, Tibor

  • Author_Institution
    Dept. of Telecommun. & Media Inf., Budapest Univ. of Technol. & Econ., Budapest, Hungary
  • fYear
    2011
  • fDate
    18-21 May 2011
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    A crucial part of a speech recognizer is the acoustic feature extraction, especially when the application is intended to be used in noisy environment. In this paper we investigate several novel front-end techniques and compare them to multiple baselines. Recognition tests were performed on studio quality wide band recordings on Hungarian as well as on narrow band telephone speech including real-life noises collected in six languages: English, German, French, Italian, Spanish and Hungarian. The following baseline feature types were used with several settings: Mel Frequency Cepstral Coefficients (MFCC), Perceptual Linear Prediction (PLP) features implemented in HTK, SPHINX, or by ourselves. Novel methods include Perceptual Minimum Variance Distortionless Response (PMVDR) and multiple variations of the Power-Normalized Cepstral Coefficients (PNCC). Also, adaptive techniques are applied to reduce convolutive distortions. We have experienced a significant difference between the MFCC implementations, and there were major differences in the PNCC variations useful in the different bandwidths and noise conditions.
  • Keywords
    cepstral analysis; feature extraction; speech recognition; traffic; MFCC; Mel Frequency Cepstral Coefficient; acoustic feature extraction; front end technique; narrow band telephone speech; perceptual linear prediction; perceptual minimum variance distortionless response; speech recognition; studio quality; traffic noise; wide band recording; Databases; Feature extraction; Hidden Markov models; Mel frequency cepstral coefficient; Noise; Speech recognition; Training; feature extraction; multiple languages; multiple sample rates; real-life and white noise; varied SNR;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Speech Technology and Human-Computer Dialogue (SpeD), 2011 6th Conference on
  • Conference_Location
    Brasov
  • Print_ISBN
    978-1-4577-0440-6
  • Type

    conf

  • DOI
    10.1109/SPED.2011.5940729
  • Filename
    5940729