Author_Institution :
Beijing Lab. of Intell. Inf. Technol., Beijing Inst. of Technol., Beijing, China
Abstract :
In the task of action recognition, object and scene can provide rich source of contextual information for analyzing human actions, as human actions often occur under particular scene settings with certain related objects. Therefore, we try to utilize the contextual object and scene for improving the performance of action recognition. Specifically, a latent structural SVM is introduced to build the co-occurrence relationship among action, object and scene, in which the object class label and scene class label are treated as latent variables. Using this framework, we can simultaneously predict action class labels, object class labels as well as scene class labels. Moreover, we use a mid-level discriminative feature to separately describe the information of action, object and scene. The feature is actually a set of decision values from the pre-learned classifiers of each class, measuring the likelihood that the input video belongs to the corresponding class. In this paper, we use SVM as action and scene pre-learned classifiers, and use deformable part-based object detector as the object pre-learned classifier, so that object location can be obtained as a by-product. Experimental results on UCF Sports, YouTube and UCF50 datasets demonstrate the effectiveness of the proposed approach.
Keywords :
image classification; image motion analysis; image recognition; object detection; support vector machines; video signal processing; UCF Sports; UCF50 datasets; YouTube; action pre-learned classifiers; action recognition; deformable part-based object detector; latent structural SVM; object pre-learned classifier; scene pre-learned classifiers; Accuracy; Context; Context modeling; Correlation; Feature extraction; Training; YouTube; LSSVM; action recognition; context modeling; object detection; scene recognition;