Author_Institution :
Dipt. di Autom. e Inf., Politec. di Torino, Turin, Italy
Abstract :
A low-dimensional representation of a speech segment, the so-called i-vector, in combination with probabilistic linear discriminant analysis (PLDA) models, is the current state-of-the-art in speaker recognition. An i-vector is a compact representation of a Gaussian Mixture Model (GMM) supervector, which captures most of the GMM supervectors variability. It is usually obtained by a MAP estimate of the mean of a posterior distribution. A new PLDA model has been recently presented that, unlike the standard one, exploits the intrinsic i-vector uncertainty. This approach, referred to in this paper as Full Posterior Distribution PLDA (FP-PLDA), is particularly effective for speaker detection of short and variable duration speech segments. It is, however, computationally far more expensive than standard PLDA, making it unattractive for real applications. This paper presents three simplifications of FP-PLDA based on approximate diagonalizations of matrices involved in FP-PLDA scoring. Using in sequence these approximations allows obtaining computational costs comparable to PLDA models, with only a small performance degradation with respect to the more accurate, but less efficient, FP-PLDA models. In particular, up to 10% better performance than PLDA is obtained, with similar computational complexity, on short speech segments of variable duration, randomly extracted from the interviews and telephone conversations included in the NIST SRE 2010 extended dataset. The benefits of the proposed diagonalization approaches have also been confirmed on a short utterance text-independent verification task, where approximately 43% and 34% improvement of the EER and minimum DCF08, respectively, has been obtained with respect to PLDA.
Keywords :
Gaussian processes; computational complexity; matrix algebra; maximum likelihood estimation; mixture models; speaker recognition; statistical distributions; EER; FP-PLDA; GMM supervector representation; Gaussian mixture model supervector representation; MAP estimation; NIST SRE 2010 extended dataset; computational complexity; full posterior distribution PLDA; i-vectors; interview conversation; matrix approximate diagonalization; posterior distribution; probabilistic linear discriminant analysis; speaker detection; speaker recognition; speech segment low-dimensional representation; telephone conversation; Computational modeling; IEEE transactions; Speech; Speech processing; Speech recognition; Standards; Uncertainty; I-vector extraction; I-vectors; probabilistic linear discriminant analysis (PLDA); speaker recognition;
Journal_Title :
Audio, Speech, and Language Processing, IEEE/ACM Transactions on