Deep neural network based acoustic model using speaker-class information for short time utterance

Author

Hiroshi Seki;Kazumasa Yamamoto;Seiichi Nakagawa

Author_Institution

Toyohashi University of Technology, Aichi, Japan

fYear

2015

Firstpage

1222

Lastpage

1225

Abstract

In speech recognition, it is preferable not to hypothesize the details, e.g., specific age and gender, of a target user. However, speaker independence is one of the things that degrades ASR performance. In this work, we propose a speaker adaptation method to recognize a short time utterance. There have been several studies on speaker-independent DNN-HMM in which i-vector is computed, and the additional information is combined with acoustic features. However, it is difficult to calculate i-vector accurately or apply speaker adaptation (e.g. fMLLR) when the utterance time is short (0.5sec~). In our approach, we calculate the similarity score between the speaker class and the target utterance and utilize speaker class information configured in advance. As a precondition, we restrict the available time period to the first 50 frames per utterance for the recognition of short utterances. In experimental tests, we obtained a 4.0% relative WER gain compared to conventional DNN-HMM.

Keywords

"Training data","Acoustics","Hidden Markov models","Speech recognition","Speech","Data models","Databases"

Publisher

ieee

Conference_Titel

Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2015 Asia-Pacific

Type

conf

DOI

10.1109/APSIPA.2015.7415467

Filename

7415467