• DocumentCode
    70219
  • Title

    STAP: Spatial-Temporal Attention-Aware Pooling for Action Recognition

  • Author

    Nguyen, Troy V. ; Zheng Song ; Shuicheng Yan

  • Author_Institution
    Dept. for Technol., Innovation & Enterprise, Singapore Polytech., Singapore, Singapore
  • Volume
    25
  • Issue
    1
  • fYear
    2015
  • fDate
    Jan. 2015
  • Firstpage
    77
  • Lastpage
    86
  • Abstract
    Human action recognition is valuable for numerous practical applications, e.g., gaming, video surveillance, and video search. In this paper we hypothesize that the classification of actions can be boosted by designing a smart feature pooling strategy under the prevalently used bag-of-words-based representation. Founded on automatic video saliency analysis, we propose the spatial-temporal attention-aware pooling scheme for feature pooling. First, the video saliencies are predicted using the video saliency model, and the localized spatial-temporal features are pooled at different saliency levels and video-saliency-guided channels are formed. Saliency-aware matching kernels are thus derived as the similarity measurement of these channels. Intuitively, the proposed kernels calculate the similarities of the video foreground (salient areas) or background (nonsalient areas) at different levels. Finally, the kernels are fed into popular support vector machines for action classification. Extensive experiments on three popular data sets for action classification validate the effectiveness of our proposed method, which outperforms the state-of-the-art methods, namely 95.3% on UCF Sports (better by 4.0%), 87.9% on YouTube data set (better by 2.5%), and achieves comparable results on Hollywood2 dataset.
  • Keywords
    gesture recognition; image classification; support vector machines; Hollywood2 dataset; STAP scheme; YouTube data set; action classification; automatic video saliency analysis; bag-of-words-based representation; gaming; human action recognition; localized spatial-temporal features; saliency-aware matching kernels; smart feature pooling strategy; spatial-temporal attention-aware pooling scheme; support vector machines; video background; video foreground; video search; video surveillance; video-saliency-guided channels; Computational modeling; Feature extraction; Kernel; Predictive models; Support vector machines; Visualization; YouTube; Action recognition; Feature pooling; Visual attention; feature pooling; visual attention;
  • fLanguage
    English
  • Journal_Title
    Circuits and Systems for Video Technology, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1051-8215
  • Type

    jour

  • DOI
    10.1109/TCSVT.2014.2333151
  • Filename
    6844027