مرکز منطقه ای اطلاع رساني علوم و فناوري - Audiovisual corpus to analyze whisper speech

DocumentCode :

1694312

Title :

Audiovisual corpus to analyze whisper speech

Author :

Tran, Thomas ; Mariooryad, S. ; Busso, Carlos

Author_Institution :

Dept. of Electr. Eng., Multimodal Signal Process. (MSP) Lab., Univ. of Texas at Dallas, Richardson, TX, USA

fYear :

2013

Firstpage :

8101

Lastpage :

8105

Abstract :

Current automatic speech recognition (ASR) systems cannot recognize whisper speech with high accuracy. ASR systems are trained with neutral speech, which have significant acoustic differences with whisper speech (i.e., energy, duration, harmonics structure, and spectral slope). Given the limitations of speech-based systems to process whisper speech, we propose to explore the benefits of visual features describing the orofacial area. We hypothesize that the lips´ articulation between whisper and neutral speech is similar, providing a valuable whisper-invariant modality. This paper introduces the first audiovisual corpus of whisper speech. While we are targeting over 40 speakers, the current corpus has recordings from eleven subjects who were asked to read TIMIT sentences, and isolated digits alternating between neutral and whisper speech. The corpus also includes spontaneous recordings, in which the subject answered a series of general questions. The paper also analyzes an exhaustive set of audiovisual features, including action units (AUs), lip spreading, fundamental frequency, intensity, MFCCs, and formants. We study the differences in the features´ distributions between whisper and neutral speech using Kullback-Leibler divergence (KLD). Then, we conducted statistical test to determine whether the differences in the features are statistically significant. The results support our hypothesis that visual features are less affected by whisper speech.

Keywords :

speech recognition; statistical analysis; ASR systems; AU; KLD; Kullback-Leibler divergence; TIMIT sentences; acoustic differences; action units; audiovisual corpus; automatic speech recognition systems; features distributions; general questions; harmonics structure; isolated digits; lips articulation; neutral speech; orofacial area; spectral slope; statistical test; whisper speech analysis; whisper-invariant modality; Acoustics; Feature extraction; Gold; Speech; Speech processing; Speech recognition; Visualization; Audiovisual corpus; whisper speech;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on

Conference_Location :

Vancouver, BC

ISSN :

1520-6149

Type :

conf

DOI :

10.1109/ICASSP.2013.6639243

Filename :

6639243

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1694312