DocumentCode :
1479007
Title :
Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification
Author :
Kinnunen, Tomi ; Saeidi, Rahim ; Sedlák, Filip ; Lee, Kong Aik ; Sandberg, Johan ; Hansson-Sandsten, Maria ; Li, Haizhou
Author_Institution :
Sch. of Comput., Univ. of Eastern Finland, Joensuu, Finland
Volume :
20
Issue :
7
fYear :
2012
Firstpage :
1990
Lastpage :
2001
Abstract :
In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension to windowed DFT is the so-called multitaper method which uses multiple time-domain windows (tapers) with frequency-domain averaging. Multitapers have received little attention in speech processing even though they produce low-variance features. In this paper, we propose the multitaper method for MFCC extraction with a practical focus. We provide, first, detailed statistical analysis of MFCC bias and variance using autoregressive process simulations on the TIMIT corpus. For speaker verification experiments on the NIST 2002 and 2008 SRE corpora, we consider three Gaussian mixture model based classifiers with universal background model (GMM-UBM), support vector machine (GMM-SVM) and joint factor analysis (GMM-JFA). Multitapers improve MinDCF over the baseline windowed DFT by relative 20.4% (GMM-SVM) and 13.7% (GMM-JFA) on the interview-interview condition in NIST 2008. The GMM-JFA system further reduces MinDCF by 18.7% on the telephone data. With these improvements and generally noncritical parameter selection, multitaper MFCCs are a viable candidate for replacing the conventional MFCCs.
Keywords :
Gaussian processes; autoregressive processes; cepstral analysis; discrete Fourier transforms; estimation theory; frequency-domain analysis; speaker recognition; speech processing; statistical analysis; support vector machines; time-domain analysis; GMM-JFA; GMM-SVM; GMM-UBM; Gaussian mixture model; MFCC bias; MFCC extraction; MFCC variance; MinDCF; NIST 2002 SRE corpora; NIST 2008 SRE corpora; TIMIT corpus; audio applications; autoregressive process simulations; baseline windowed DFT; frequency-domain averaging; interview-interview condition; joint factor analysis; low-variance features; low-variance multitaper MFCC features; mel-frequency cepstral coefficients; multiple time-domain windows; multitaper method; noncritical parameter selection; robust speaker verification; short-term signal spectrum; speaker verification experiments; spectral leakage; spectrum estimate; speech applications; speech processing; statistical analysis; support vector machine; telephone data; universal background model; windowed discrete Fourier transform; Analytical models; Discrete Fourier transforms; Frequency domain analysis; Mel frequency cepstral coefficient; Robustness; Speech; Speech processing; Mel-frequency cepstral coefficient (MFCC); multitaper; small-variance estimation; speaker verification;
fLanguage :
English
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
Publisher :
ieee
ISSN :
1558-7916
Type :
jour
DOI :
10.1109/TASL.2012.2191960
Filename :
6175110
Link To Document :
بازگشت