Designing a multimodal corpus of audio-visual speech using a high-speed camera

Author

Karpov, Aleksey ; Ronzhin, Anatoly ; Kipyatkova, Irina

Author_Institution

Speech & Multimodal Interfaces Lab., St. Petersburg Inst. for Inf. & Autom., St. Petersburg, Russia

Volume

1

fYear

2012

fDate

21-25 Oct. 2012

Firstpage

519

Lastpage

522

Abstract

In this paper, we present a research on designing and processing an audio-visual speech database for an automatic Russian speech recognition system using Oktava MK-012 microphone and JAI Pulnix RMC-6740GE high-speed camera (200 frames per second). Developed audio-visual speech recording system is described, it provides synchronization and fusion of audio and video data recorded by the independent sensors. The system automatically detects voice activity in audio signal and stores only speech fragments discarding non-informative signals. Also it takes into account and processes natural asynchrony of both speech modalities. Methods for feature extraction of acoustic (based on Mel-frequency cepstral coefficients) and visual speech (pixel-based features of mouth region) and multimodal data temporal segmentation (by forced alignment) are presented.

Keywords

audio databases; image sensors; speech recognition; JAI Pulnix RMC-6740GE; Mel frequency cepstral coefficients; Oktava MK-012 microphone; Russian speech recognition system; audio signal; audio visual speech database; high speed camera; independent sensors; informative signals; mouth region; multimodal corpus design; multimodal data temporal segmentation; pixel based features; speech fragments; voice activity; audio-visual speech; automatic speech recognition; computer vision; high-speed camera; multimodal system;

fLanguage

English

Publisher

ieee

Conference_Titel

Signal Processing (ICSP), 2012 IEEE 11th International Conference on

Conference_Location

Beijing

ISSN

2164-5221

Print_ISBN

978-1-4673-2196-9

Type

conf

DOI

10.1109/ICoSP.2012.6491539

Filename

6491539