Abstract
We conduct large-scale studies on `human attention’ in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Overall, our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans.
Abstract (translated by Google)
我们在视觉问答(VQA)中对“人的注意力”进行大规模的研究,以了解人类选择回答关于图像的问题。我们设计并测试了多个受游戏启发的新颖注意 - 注释界面,这些界面要求主体锐化模糊图像的区域来回答问题。因此,我们引入VQA-HAT(人类注意力)数据集。我们通过定性(通过可视化)和定量(通过秩序相关)来评估由最先进的VQA模型生成的注意力图与人类的关注。总的来说,我们的实验表明目前的VQA关注模型似乎没有像人类那样看待同一个地区。
URL
https://arxiv.org/abs/1606.05589