Language recognition using deep-structured conditional random fields

Author

Yu, Dong ; Wang, Shizhen ; Karam, Zahi ; Deng, Li

Author_Institution

Microsoft Res., Redmond, WA, USA

fYear

2010

fDate

14-19 March 2010

Firstpage

5030

Lastpage

5033

Abstract

We present a novel language identification technique using our recently developed deep-structured conditional random fields (CRFs). The deep-structured CRF is a multi-layer CRF model in which each higher layer´s input observation sequence consists of the lower layer´s observation sequence and the resulting lower layer´s frame-level marginal probabilities. In this paper we extend the original deep-structured CRF by allowing for distinct state representations at different layers and demonstrate its benefits. We propose an unsupervised algorithm to pre-train the intermediate layers by casting it as a multi-objective programming problem that is aimed at minimizing the average frame-level conditional entropy while maximizing the state occupation entropy. Empirical evaluation on a seven-language/dialect voice mail routing task showed that our approach can achieve a routing accuracy (RA) of 86.4% and average equal error rate (EER) of 6.6%. These results are significantly better than the 82.5% RA and 7.5% average EER obtained using the Gaussian mixture model trained with the maximum mutual information criterion but slightly worse than the 87.7% RA and 6.4% EER achieved using the support vector machine with model pushing on the Gaussian super vector (GSV).

Keywords

entropy; random processes; speech recognition; support vector machines; voice mail; Gaussian super vector; deep-structured conditional random fields; equal error rate; frame-level conditional entropy; frame-level marginal probabilities; language identification technique; language recognition; multiobjective programming problem; observation sequence; state occupation entropy; support vector machine; voice mail routing task; Automatic speech recognition; Casting; Entropy; Error analysis; Mutual information; Routing; Support vector machine classification; Support vector machines; Unsupervised learning; Voice mail; conditional random field; deep learning; deep-structure; language identification; unsupervised learning;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on

Conference_Location

Dallas, TX

ISSN

1520-6149

Print_ISBN

978-1-4244-4295-9

Electronic_ISBN

1520-6149

Type

conf

DOI

10.1109/ICASSP.2010.5495072

Filename

5495072