Abstract
Visual question answering (VQA) systems are emerging from a desire to empower users to ask any natural language question about visual content and receive a valid answer in response. However, close examination of the VQA problem reveals an unavoidable, entangled problem that multiple humans may or may not always agree on a single answer to a visual question. We train a model to automatically predict from a visual question whether a crowd would agree on a single answer. We then propose how to exploit this system in a novel application to efficiently allocate human effort to collect answers to visual questions. Specifically, we propose a crowdsourcing system that automatically solicits fewer human responses when answer agreement is expected and more human responses when answer disagreement is expected. Our system improves upon existing crowdsourcing systems, typically eliminating at least 20% of human effort with no loss to the information collected from the crowd.
Abstract (translated by Google)
视觉问答(VQA)系统正在涌现,希望使用户能够询问关于视觉内容的任何自然语言问题,并得到有效答案。然而,仔细研究VQA问题揭示了一个不可避免的,纠结的问题,即多个人可能会也可能不会总是同意一个视觉问题的单一答案。我们训练一个模型,根据视觉问题自动预测一个人群是否同意一个答案。然后,我们提出如何在一个新的应用程序中利用这个系统来高效地分配人力来收集视觉问题的答案。具体而言,我们提出了一个众包系统,当预计会达成一致意见时,会自动征求较少的人类反应,当预期会有不同意见时,会有更多的人类反应。我们的系统改进了现有的众包系统,通常至少消除了20%的人力投入,而不会损失从人群中收集的信息。
URL
https://arxiv.org/abs/1608.08188