Abstract
Recently, a number of deep-learning based models have been proposed for the task of Visual Question Answering (VQA). The performance of most models is clustered around 60-70%. In this paper we propose systematic methods to analyze the behavior of these models as a first step towards recognizing their strengths and weaknesses, and identifying the most fruitful directions for progress. We analyze two models, one each from two major classes of VQA models – with-attention and without-attention and show the similarities and differences in the behavior of these models. We also analyze the winning entry of the VQA Challenge 2016. Our behavior analysis reveals that despite recent progress, today’s VQA models are “myopic” (tend to fail on sufficiently novel instances), often “jump to conclusions” (converge on a predicted answer after ‘listening’ to just half the question), and are “stubborn” (do not change their answers across images).
Abstract (translated by Google)
最近,许多基于深度学习的模型被提出用于视觉问答(VQA)的任务。大多数模型的表现集中在60-70%左右。在本文中,我们提出了系统的方法来分析这些模型的行为,作为认识其优缺点的第一步,并确定最有成效的进展方向。我们分析两种模型,分别来自VQA模型的两个主要类型 - 注意力和不注意力,并显示这些模型行为的相似性和差异性。我们还分析了2016年VQA挑战赛的获胜条目。我们的行为分析显示,尽管最近取得了一些进展,但今天的VQA模型是“近视的”(倾向于失败的新型实例),往往“跳到结论”(收敛于预测的答案在“听”到问题的一半之后),并且“固执”(不要在图像上改变他们的答案)。
URL
https://arxiv.org/abs/1606.07356