papers AI Learner
The Github is limit! Click to go to the new site.

VQA: Visual Question Answering

2016-10-27
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh

Abstract

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (this http URL).

Abstract (translated by Google)

我们提出了自由形式和开放式的视觉问答(VQA)的任务。给定一幅关于图像的图像和自然语言问题,其任务是提供一个准确的自然语言答案。反映现实世界的情景,如帮助视障者,问题和答案都是开放式的。视觉问题有选择地针对图像的不同区域,包括背景细节和背景。因此,一个在VQA上成功的系统通常需要比生成通用图像标题的系统更加详细地了解图像和复杂的推理。此外,VQA适合自动评估,因为许多开放式答案只包含几个字或一组封闭的答案,可以以多选形式提供。我们提供一个包含〜0.25M图像,约0.76M个问题和〜10M个答案(www.visualqa.org)的数据集,并讨论它提供的信息。提供了许多VQA的基线和方法,并与人的表现进行比较。我们的VQA演示可在CloudCV(此http URL)上找到。

URL

https://arxiv.org/abs/1505.00468

PDF

https://arxiv.org/pdf/1505.00468


Similar Posts

Comments