• DocumentCode
    3427206
  • Title

    Pedestrian Parsing via Deep Decompositional Network

  • Author

    Ping Luo ; Xiaogang Wang ; Xiaoou Tang

  • Author_Institution
    Dept. of Inf. Eng., Chinese Univ. of Hong Kong, Hong Kong, China
  • fYear
    2013
  • fDate
    1-8 Dec. 2013
  • Firstpage
    2648
  • Lastpage
    2655
  • Abstract
    We propose a new Deep Decompositional Network (DDN) for parsing pedestrian images into semantic regions, such as hair, head, body, arms, and legs, where the pedestrians can be heavily occluded. Unlike existing methods based on template matching or Bayesian inference, our approach directly maps low-level visual features to the label maps of body parts with DDN, which is able to accurately estimate complex pose variations with good robustness to occlusions and background clutters. DDN jointly estimates occluded regions and segments body parts by stacking three types of hidden layers: occlusion estimation layers, completion layers, and decomposition layers. The occlusion estimation layers estimate a binary mask, indicating which part of a pedestrian is invisible. The completion layers synthesize low-level features of the invisible part from the original features and the occlusion mask. The decomposition layers directly transform the synthesized visual features to label maps. We devise a new strategy to pre-train these hidden layers, and then fine-tune the entire network using the stochastic gradient descent. Experimental results show that our approach achieves better segmentation accuracy than the state-of-the-art methods on pedestrian images with or without occlusions. Another important contribution of this paper is that it provides a large scale benchmark human parsing dataset that includes 3,673 annotated samples collected from 171 surveillance videos. It is 20 times larger than existing public datasets.
  • Keywords
    image segmentation; neural nets; pedestrians; video surveillance; Bayesian inference; DDN; binary mask; completion layers; decomposition layers; deep decompositional network; low-level visual features; occlusion estimation layers; pedestrian image parsing; stochastic gradient descent; template matching; video surveillance; Clutter; Estimation; Noise; Shape; Training; Transforms; Vectors; deep learning; pedestrian parsing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Vision (ICCV), 2013 IEEE International Conference on
  • Conference_Location
    Sydney, NSW
  • ISSN
    1550-5499
  • Type

    conf

  • DOI
    10.1109/ICCV.2013.329
  • Filename
    6751440