Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

Abstract
Abstract (translated by Google)
URL
PDF

Abstract

We conduct large-scale studies on `human attention’ in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Overall, our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans.

Abstract (translated by Google)

我们在视觉问答（VQA）中对“人的注意力”进行大规模的研究，以了解人类选择回答关于图像的问题。我们设计并测试了多个受游戏启发的新颖注意 - 注释界面，这些界面要求主体锐化模糊图像的区域来回答问题。因此，我们引入VQA-HAT（人类注意力）数据集。我们通过定性（通过可视化）和定量（通过秩序相关）来评估由最先进的VQA模型生成的注意力图与人类的关注。总的来说，我们的实验表明目前的VQA关注模型似乎没有像人类那样看待同一个地区。

URL

https://arxiv.org/abs/1606.05589

PDF

https://arxiv.org/pdf/1606.05589

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

Abstract

Abstract (translated by Google)

URL

PDF

Similar Posts

Comments