مرکز منطقه ای اطلاع رساني علوم و فناوري - Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help

DocumentCode :

1411520

Title :

Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help

Author :

Petridis, S. ; Pantic, Maja

Author_Institution :

Dept. of Comput., Imperial Coll. London, London, UK

Volume :

Issue :

fYear :

2011

fDate :

4/1/2011 12:00:00 AM

Firstpage :

216

Lastpage :

234

Abstract :

Past research on automatic laughter classification/detection has focused mainly on audio-based approaches. Here we present an audiovisual approach to distinguishing laughter from speech, and we show that integrating the information from audio and video channels may lead to improved performance over single-modal approaches. Both audio and visual channels consist of two streams (cues), facial expressions and head pose for video and cepstral and prosodic features for audio. Two types of experiments were performed: 1) subject-independent cross-validation on the AMI dataset and 2) cross-database experiments on the AMI and SAL datasets. We experimented with different combinations of cues with the most informative being the combination of facial expressions, cepstral, and prosodic features. Our results suggest that the performance of the audiovisual approach is better on average than single-modal approaches. The addition of visual information produces better results when it comes to female subjects. When the training conditions are less diverse in terms of head movements than the testing conditions (training on the SAL dataset, testing on the AMI dataset), then no improvement was observed with the addition of visual information. On the other hand, when the training conditions are similar (cross validation on the AMI dataset), or more diverse (training on the AMI dataset, testing on the SAL dataset), in terms of head movements than is the case in the testing conditions, an absolute increase of about 3% in the F1 rate for laughter is reported when visual information is added to audio information.

Keywords :

face recognition; feature extraction; human computer interaction; speech recognition; video signal processing; audio channel; audio information; audiovisual discrimination; cepstral feature; facial expression feature; head pose feature; laughter information; prosodic feature; speech information; subject-independent cross-validation; video channel; visual information; Human behavior analysis; laughter-versus-speech discrimination; neural networks; principal components analysis (PCA);

fLanguage :

English

Journal_Title :

Multimedia, IEEE Transactions on

Publisher :

ieee

ISSN :

1520-9210

Type :

jour

DOI :

10.1109/TMM.2010.2101586

Filename :

5674087

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1411520