Sum-max video pooling for complex event recognition

Author

Sang Phan ; Duy-Dinh Le ; Satoh, S.

Author_Institution

Grad. Univ. for Adv. Studies (SOKENDAI), Yokosuka, Japan

fYear

2014

fDate

27-30 Oct. 2014

Firstpage

1026

Lastpage

1030

Abstract

A video can be viewed as a layered structure where the lowest layer are frames, the top layer is the entire video, and the middle layers are the sequences of consecutive frames or the concatenation of lower layers. While it is easy to find local discriminative features in video from lower layers, it is non-trivial to aggregate these features into a discriminative video representation. In literature, people often use sum pooling to obtain reasonable recognition performance on artificial videos. However, the sum pooling technique does not work well on complex videos because the region of interests may reside within some middle layers. In this paper, we leverage the layered structure of video to propose a new pooling method, named sum-max video pooling, to handle this problem. Basically, we apply sum pooling at the low layer representation while using max pooling at the high layer representation. Sum pooling is used to keep sufficient relevant features at the low layer, while max pooling is used to retrieve the most relevant features at the high layer, therefore it can discard irrelevant features in the final video representation. Experimental results on the TRECVID Multimedia Event Detection 2010 dataset shows the effectiveness of our method.

Keywords

image recognition; image representation; video signal processing; TRECVID Multimedia Event Detection 2010 dataset; artificial videos; complex event recognition; complex videos; discriminative video representation; feature aggregation; frame sequences; high-layer representation; layered structure; local discriminative features; low-layer representation; lower-layer concatenation; middle layers; recognition performance; region of interests; sum-max video pooling; top layer; video representation; Aggregates; Computer vision; Event detection; Feature extraction; Multimedia communication; Noise measurement; Visualization; max-pooling; multimedia event detection; sum-max video pooling; sum-pooling; video representation;

fLanguage

English

Publisher

ieee

Conference_Titel

Image Processing (ICIP), 2014 IEEE International Conference on

Conference_Location

Paris

Type

conf

DOI

10.1109/ICIP.2014.7025204

Filename

7025204