Abstract
As machines have become more intelligent, there has been a renewed interest in methods for measuring their intelligence. A common approach is to propose tasks for which a human excels, but one which machines find difficult. However, an ideal task should also be easy to evaluate and not be easily gameable. We begin with a case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence. An alternative and more promising task is Visual Question Answering that tests a machine’s ability to reason about language and vision. We describe a dataset unprecedented in size created for the task that contains over 760,000 human generated questions about images. Using around 10 million human generated answers, machines may be easily evaluated.
Abstract (translated by Google)
随着机器变得更加智能化,人们重新关注测量智能的方法。一个常见的方法是提出一个人擅长的任务,但是机器难以找到的任务。然而,一个理想的任务也应该是容易评估,不容易游戏。我们从案例研究开始,探索最近流行的图像字幕任务及其作为测量机器智能任务的局限性。另一个更有前途的任务是视觉问答(Visual Question Answering),它测试机器推理语言和视觉的能力。我们描述了一个数据集,其中包含超过760,000个人类生成的关于图像的问题。使用大约1000万人类生成的答案,机器可能很容易评估。
URL
https://arxiv.org/abs/1608.08716