DocumentCode
1843177
Title
Designing a multimodal corpus of audio-visual speech using a high-speed camera
Author
Karpov, Aleksey ; Ronzhin, Anatoly ; Kipyatkova, Irina
Author_Institution
Speech & Multimodal Interfaces Lab., St. Petersburg Inst. for Inf. & Autom., St. Petersburg, Russia
Volume
1
fYear
2012
fDate
21-25 Oct. 2012
Firstpage
519
Lastpage
522
Abstract
In this paper, we present a research on designing and processing an audio-visual speech database for an automatic Russian speech recognition system using Oktava MK-012 microphone and JAI Pulnix RMC-6740GE high-speed camera (200 frames per second). Developed audio-visual speech recording system is described, it provides synchronization and fusion of audio and video data recorded by the independent sensors. The system automatically detects voice activity in audio signal and stores only speech fragments discarding non-informative signals. Also it takes into account and processes natural asynchrony of both speech modalities. Methods for feature extraction of acoustic (based on Mel-frequency cepstral coefficients) and visual speech (pixel-based features of mouth region) and multimodal data temporal segmentation (by forced alignment) are presented.
Keywords
audio databases; image sensors; speech recognition; JAI Pulnix RMC-6740GE; Mel frequency cepstral coefficients; Oktava MK-012 microphone; Russian speech recognition system; audio signal; audio visual speech database; high speed camera; independent sensors; informative signals; mouth region; multimodal corpus design; multimodal data temporal segmentation; pixel based features; speech fragments; voice activity; audio-visual speech; automatic speech recognition; computer vision; high-speed camera; multimodal system;
fLanguage
English
Publisher
ieee
Conference_Titel
Signal Processing (ICSP), 2012 IEEE 11th International Conference on
Conference_Location
Beijing
ISSN
2164-5221
Print_ISBN
978-1-4673-2196-9
Type
conf
DOI
10.1109/ICoSP.2012.6491539
Filename
6491539
Link To Document