Abstract
We propose a novel attention based deep learning architecture for visual question answering task (VQA). Given an image and an image related natural language question, VQA generates the natural language answer for the question. Generating the correct answers requires the model’s attention to focus on the regions corresponding to the question, because different questions inquire about the attributes of different image regions. We introduce an attention based configurable convolutional neural network (ABC-CNN) to learn such question-guided attention. ABC-CNN determines an attention map for an image-question pair by convolving the image feature map with configurable convolutional kernels derived from the question’s semantics. We evaluate the ABC-CNN architecture on three benchmark VQA datasets: Toronto COCO-QA, DAQUAR, and VQA dataset. ABC-CNN model achieves significant improvements over state-of-the-art methods on these datasets. The question-guided attention generated by ABC-CNN is also shown to reflect the regions that are highly relevant to the questions.
Abstract (translated by Google)
我们提出了一种新颖的基于视觉问题回答任务(VQA)的基于注意力的深度学习体系结构。给定图像和图像相关的自然语言问题,VQA为该问题生成自然语言答案。生成正确的答案需要模型的注意力集中在与问题对应的区域,因为不同的问题询问不同图像区域的属性。我们引入基于注意的可配置卷积神经网络(ABC-CNN)来学习这样的问题引导注意力。 ABC-CNN通过将图像特征映射与从问题语义派生的可配置卷积核进行卷积来确定图像问题对的注意图。我们在三个基准VQA数据集上评估ABC-CNN体系结构:多伦多COCO-QA,DAQUAR和VQA数据集。 ABC-CNN模型比这些数据集上最先进的方法有了显着的改进。 ABC-CNN提出的问题引导注意力也反映了与问题高度相关的地区。
URL
https://arxiv.org/abs/1511.05960