DocumentCode :
1694312
Title :
Audiovisual corpus to analyze whisper speech
Author :
Tran, Thomas ; Mariooryad, S. ; Busso, Carlos
Author_Institution :
Dept. of Electr. Eng., Multimodal Signal Process. (MSP) Lab., Univ. of Texas at Dallas, Richardson, TX, USA
fYear :
2013
Firstpage :
8101
Lastpage :
8105
Abstract :
Current automatic speech recognition (ASR) systems cannot recognize whisper speech with high accuracy. ASR systems are trained with neutral speech, which have significant acoustic differences with whisper speech (i.e., energy, duration, harmonics structure, and spectral slope). Given the limitations of speech-based systems to process whisper speech, we propose to explore the benefits of visual features describing the orofacial area. We hypothesize that the lips´ articulation between whisper and neutral speech is similar, providing a valuable whisper-invariant modality. This paper introduces the first audiovisual corpus of whisper speech. While we are targeting over 40 speakers, the current corpus has recordings from eleven subjects who were asked to read TIMIT sentences, and isolated digits alternating between neutral and whisper speech. The corpus also includes spontaneous recordings, in which the subject answered a series of general questions. The paper also analyzes an exhaustive set of audiovisual features, including action units (AUs), lip spreading, fundamental frequency, intensity, MFCCs, and formants. We study the differences in the features´ distributions between whisper and neutral speech using Kullback-Leibler divergence (KLD). Then, we conducted statistical test to determine whether the differences in the features are statistically significant. The results support our hypothesis that visual features are less affected by whisper speech.
Keywords :
speech recognition; statistical analysis; ASR systems; AU; KLD; Kullback-Leibler divergence; TIMIT sentences; acoustic differences; action units; audiovisual corpus; automatic speech recognition systems; features distributions; general questions; harmonics structure; isolated digits; lips articulation; neutral speech; orofacial area; spectral slope; statistical test; whisper speech analysis; whisper-invariant modality; Acoustics; Feature extraction; Gold; Speech; Speech processing; Speech recognition; Visualization; Audiovisual corpus; whisper speech;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on
Conference_Location :
Vancouver, BC
ISSN :
1520-6149
Type :
conf
DOI :
10.1109/ICASSP.2013.6639243
Filename :
6639243
Link To Document :
بازگشت