Abstract
We present a method that learns to answer visual questions by selecting image regions relevant to the text-based query. Our method exhibits significant improvements in answering questions such as “what color,” where it is necessary to evaluate a specific location, and “what room,” where it selectively identifies informative image regions. Our model is tested on the VQA dataset which is the largest human-annotated visual question answering dataset to our knowledge.
Abstract (translated by Google)
我们提出一种通过选择与基于文本的查询相关的图像区域来学习回答视觉问题的方法。我们的方法在回答诸如“什么颜色”这样的需要评估特定位置的问题方面显示出显着的改进,在选择性地识别信息图像区域的“什么房间”方面显示出显着的改进。我们的模型在VQA数据集上进行了测试,这是我们所知道的最大的带有人类注释的视觉问题解答数据集。
URL
https://arxiv.org/abs/1511.07394