DocumentCode
1763742
Title
Multilevel Depth and Image Fusion for Human Activity Detection
Author
Bingbing Ni ; Yong Pei ; Moulin, Philippe ; Shuicheng Yan
Author_Institution
Adv. Digital Sci. Center, Singapore, Singapore
Volume
43
Issue
5
fYear
2013
fDate
Oct. 2013
Firstpage
1383
Lastpage
1394
Abstract
Recognizing complex human activities usually requires the detection and modeling of individual visual features and the interactions between them. Current methods only rely on the visual features extracted from 2-D images, and therefore often lead to unreliable salient visual feature detection and inaccurate modeling of the interaction context between individual features. In this paper, we show that these problems can be addressed by combining data from a conventional camera and a depth sensor (e.g., Microsoft Kinect). We propose a novel complex activity recognition and localization framework that effectively fuses information from both grayscale and depth image channels at multiple levels of the video processing pipeline. In the individual visual feature detection level, depth-based filters are applied to the detected human/object rectangles to remove false detections. In the next level of interaction modeling, 3-D spatial and temporal contexts among human subjects or objects are extracted by integrating information from both grayscale and depth images. Depth information is also utilized to distinguish different types of indoor scenes. Finally, a latent structural model is developed to integrate the information from multiple levels of video processing for an activity detection. Extensive experiments on two activity recognition benchmarks (one with depth information) and a challenging grayscale + depth human activity database that contains complex interactions between human-human, human-object, and human-surroundings demonstrate the effectiveness of the proposed multilevel grayscale + depth fusion scheme. Higher recognition and localization accuracies are obtained relative to the previous methods.
Keywords
feature extraction; gesture recognition; image fusion; object recognition; pipeline processing; spatiotemporal phenomena; visual databases; 2D image visual feature extraction; 3D spatial contexts; 3D temporal contexts; Microsoft Kinect; complex activity localization framework; complex human activity detection; depth image channels; depth sensor; depth-based filters; grayscale image channels; grayscale+depth human activity database; human-human interactions; human-object interactions; human-object rectangle detection; human-surrounding interactions; image fusion; individual visual feature detection; individual visual feature modeling; indoor scenes; latent structural model; multilevel depth fusion; multilevel grayscale+depth fusion scheme; salient visual feature detection; video processing pipeline; Accuracy; Context modeling; Feature extraction; Gray-scale; Image recognition; Joints; Visualization; Action recognition and localization; depth sensor; spatial and temporal context; Actigraphy; Algorithms; Artificial Intelligence; Computer Peripherals; Computer Simulation; Computer Systems; Humans; Image Enhancement; Imaging, Three-Dimensional; Pattern Recognition, Automated; Subtraction Technique; Transducers; Video Games; Whole Body Imaging;
fLanguage
English
Journal_Title
Cybernetics, IEEE Transactions on
Publisher
ieee
ISSN
2168-2267
Type
jour
DOI
10.1109/TCYB.2013.2276433
Filename
6587272
Link To Document