A Sentence Is Worth a Thousand Pixels

Author

Fidler, Sanja ; Sharma, Ashok ; Urtasun, Raquel

Author_Institution

TTI, Chicago, IL, USA

fYear

2013

fDate

23-28 June 2013

Firstpage

1995

Lastpage

2002

Abstract

We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models.

Keywords

image segmentation; object detection; text analysis; UIUC sentences dataset; complex sentential descriptions; holistic scene; image information; object extraction; semantic parsing; semantic segmentation; spatial extent; thousand pixels; Boats; Deformable models; Image recognition; Image segmentation; Object detection; Semantics; Visualization; Holistic scene models; Images and text; Scene understanding;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on

Conference_Location

Portland, OR

ISSN

1063-6919

Type

conf

DOI

10.1109/CVPR.2013.260

Filename

6619104