Abstract
Part of the appeal of Visual Question Answering (VQA) is its promise to answer new questions about previously unseen images. Most current methods demand training questions that illustrate every possible concept, and will therefore never achieve this capability, since the volume of required training data would be prohibitive. Answering general questions about images requires methods capable of Zero-Shot VQA, that is, methods able to answer questions beyond the scope of the training questions. We propose a new evaluation protocol for VQA methods which measures their ability to perform Zero-Shot VQA, and in doing so highlights significant practical deficiencies of current approaches, some of which are masked by the biases in current datasets. We propose and evaluate several strategies for achieving Zero-Shot VQA, including methods based on pretrained word embeddings, object classifiers with semantic embeddings, and test-time retrieval of example images. Our extensive experiments are intended to serve as baselines for Zero-Shot VQA, and they also achieve state-of-the-art performance in the standard VQA evaluation setting.
Abstract (translated by Google)
视觉问答(VQA)的部分吸引力在于回答关于以前看不见的图像的新问题。目前大多数方法都需要训练问题来说明每一个可能的概念,因此从来不会达到这个能力,因为所需要的训练数据量是过高的。回答关于图像的一般问题需要能够进行零射击VQA的方法,即能够回答超出训练问题范围的问题的方法。我们为VQA方法提出了一个新的评估协议,这个协议测量了他们执行Zero-Shot VQA的能力,这样做强调了当前方法的重大实际缺陷,其中一些被当前数据集中的偏差所掩盖。我们提出并评估了实现Zero-Shot VQA的几种策略,包括基于预训练词嵌入的方法,带有语义嵌入的对象分类器以及示例图像的测试时间检索。我们广泛的实验旨在作为Zero-Shot VQA的基准线,并在标准的VQA评估设置中实现最先进的性能。
URL
https://arxiv.org/abs/1611.05546