• DocumentCode
    3748733
  • Title

    Multimodal Convolutional Neural Networks for Matching Image and Sentence

  • Author

    Lin Ma;Zhengdong Lu;Lifeng Shang;Hang Li

  • Author_Institution
    Noah´s Ark Lab., Huawei Technol., Hong Kong, China
  • fYear
    2015
  • Firstpage
    2623
  • Lastpage
    2631
  • Abstract
    In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities. More specifically, it consists of one image CNN encoding the image content and one matching CNN modeling the joint representation of image and sentence. The matching CNN composes different semantic fragments from words and learns the inter-modal relations between image and the composed fragments at different levels, thus fully exploit the matching relations between image and sentence. Experimental results demonstrate that the proposed m-CNNs can effectively capture the information necessary for image and sentence matching. More specifically, our proposed m-CNNs significantly outperform the state-of-the-art approaches for bidirectional image and sentence retrieval on the Flickr8K and Flickr30K datasets.
  • Keywords
    "Convolution","Image representation","Semantics","Neural networks","Computer architecture","Natural languages","Grounding"
  • Publisher
    ieee
  • Conference_Titel
    Computer Vision (ICCV), 2015 IEEE International Conference on
  • Electronic_ISBN
    2380-7504
  • Type

    conf

  • DOI
    10.1109/ICCV.2015.301
  • Filename
    7410658