Detection of Inconsistency Between Subject and Speaker Based on the Co-occurrence of Lip Motion and Voice Towards Speech Scene Extraction from News Videos

Author

Kumagai, Shogo ; Doman, Keisuke ; Takahashi, Tomokazu ; Deguchi, Daisuke ; Ide, Ichiro ; Murase, Hiroshi

Author_Institution

Grad. Sch. of Inf. Sci., Nagoya Univ., Nagoya, Japan

fYear

2011

fDate

5-7 Dec. 2011

Firstpage

311

Lastpage

318

Abstract

We propose a method to detect the inconsistency between a subject and the speaker for extracting speech scenes from news videos. Speech scenes in news videos contain a wealth of multimedia information, and are valuable as archived material. In order to extract speech scenes from news videos, there is an approach that uses the position and size of a face region. However, it is difficult to extract them with only such approach, since news videos contain non-speech scenes where the speaker is not the subject, such as narrated scenes. To solve this problem, we propose a method to discriminate between speech scenes and narrated scenes based on the co-occurrence between a subject´s lip motion and the speaker´s voice. The proposed method uses lip shape and degree of lip opening as visual features representing a subject´s lip motion, and uses voice volume and phoneme as audio feature representing a speaker´s voice. Then, the proposed method discriminates between speech scenes and narrated scenes based on the correlations of these features. We report the results of experiments on videos captured in a laboratory condition and also on actual broadcast news videos. Their results showed the effectiveness of our method and the feasibility of our research goal.

Keywords

feature extraction; object detection; video signal processing; lip motion co-occurrence; lip opening degree; lip shape; news video; phoneme; speech scene extraction; subject-speaker inconsistency detection; voice co-occurrence; voice volume; Accuracy; Face; Feature extraction; Speech; Vectors; Videos; Visualization; audiovisual integration; correlation; lip motion; news videos; speech scene extraction;

fLanguage

English

Publisher

ieee

Conference_Titel

Multimedia (ISM), 2011 IEEE International Symposium on

Conference_Location

Dana Point CA

Print_ISBN

978-1-4577-2015-4

Type

conf

DOI

10.1109/ISM.2011.56

Filename

6123363