Title :
Selection and context for action recognition
Author :
Han, Dong ; Bo, Liefeng ; Sminchisescu, Cristian
Author_Institution :
Univ. of Bonn, Bonn, Germany
fDate :
Sept. 29 2009-Oct. 2 2009
Abstract :
Recognizing human action in non-instrumented video is a challenging task not only because of the variability produced by general scene factors like illumination, background, occlusion or intra-class variability, but also because of subtle behavioral patterns among interacting people or between people and objects in images. To improve recognition, a system may need to use not only low-level spatio-temporal video correlations but also relational descriptors between people and objects in the scene. In this paper we present contextual scene descriptors and Bayesian multiple kernel learning methods for recognizing human action in complex non-instrumented video. Our contribution is threefold: (1) we introduce bag-of-detector scene descriptors that encode presence/absence and structural relations between object parts; (2) we derive a novel Bayesian classification method based on Gaussian processes with multiple kernel covariance functions (MKGPC), in order to automatically select and weight multiple features, both low-level and high-level, out of a large collection, in a principled way, and (3) perform large scale evaluation using a variety of features on the KTH and a recently introduced, challenging, Hollywood movie dataset. On the KTH dataset, we obtain 94.1% accuracy, the best result reported to date. On the Hollywood dataset we obtain promising results in several action classes using fewer descriptors and about 9.1% improvement in a previous benchmark test.
Keywords :
Bayes methods; Gaussian processes; covariance analysis; image classification; image coding; image motion analysis; learning (artificial intelligence); video signal processing; Bayesian classification method; Bayesian multiple kernel learning method; Gaussian process; Hollywood movie dataset; KTH dataset; absence encoding; bag-of-detector scene descriptors; contextual scene descriptors; human action recognition; low-level spatio-temporal video correlations; multiple kernel covariance function; noninstrumented video; presence encoding; relational descriptors; structural relation encoding; Bayesian methods; Gaussian processes; Humans; Image recognition; Kernel; Layout; Learning systems; Lighting; Pattern recognition; Performance evaluation;
Conference_Titel :
Computer Vision, 2009 IEEE 12th International Conference on
Conference_Location :
Kyoto
Print_ISBN :
978-1-4244-4420-5
Electronic_ISBN :
1550-5499
DOI :
10.1109/ICCV.2009.5459427