A fusion approach to spoken language identification based on combining multiple phone recognizers and speech attribute detectors

Author

Yannan Wang ; Jun Du ; Lirong Dai ; Chin-Hui Lee

Author_Institution

Nat. Eng. Lab. for Speech & Language Inf. Process., Univ. of Sci. & Technol. of China, Hefei, China

fYear

2014

fDate

12-14 Sept. 2014

Firstpage

158

Lastpage

162

Abstract

We propose a fusion approach to spoken language recognition by combining multiple tokenizers with phone and speech attribute models trained on a collection of multilingual corpora with different front-end features. The speech attribute models are trained with bottleneck features extracted from deep neural networks while the phone models are trained with temporal patterns neural network features. By exploiting different combinations of front-end features, fundamental speech units and tokenization models, we demonstrate that speech attribute units are complementary to phone units and produce enhanced performances when they are combined with conventional phone based tokenizers. Tested on the National Institute of Standards and Technology 2009 language recognition evaluation task, leveraged upon diversity in system combination, we find that speech attribute recognition followed by language modeling achieves an additional average relative equal error rate reduction of more than 20% when fused with the state-of-the-art systems with phone recognition followed by language modeling.

Keywords

feature extraction; neural nets; speech recognition; bottleneck feature extraction; front-end features; fusion approach; language modeling; language recognition evaluation task; multilingual corpora; phone attribute models; phone based tokenizers; phone recognition; phone recognizers; phone units; speech attribute detectors; speech attribute models; speech attribute recognition; speech attribute units; spoken language identification; spoken language recognition; temporal pattern neural network features; tokenization models; Acoustics; Feature extraction; Hidden Markov models; NIST; Neural networks; Speech; Speech recognition; automatic speech attribute transcription; bottleneck features; deep neural network; manner and place of articulation; phone recognition followed by language modeling; phonetic features; spoken language recognition;

fLanguage

English

Publisher

ieee

Conference_Titel

Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on

Conference_Location

Singapore

Type

conf

DOI

10.1109/ISCSLP.2014.6936714

Filename

6936714