Abstract
Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires reasoning over visual elements of the image and general knowledge to infer the correct answer. In the first part of this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space. We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the second part of this survey, we review the datasets available for training and evaluating VQA systems. The various datatsets contain questions at different levels of complexity, which require different capabilities and types of reasoning. We examine in depth the question/answer pairs from the Visual Genome project, and evaluate the relevance of the structured annotations of images with scene graphs for VQA. Finally, we discuss promising future directions for the field, in particular the connection to structured knowledge bases and the use of natural language processing models.
Abstract (translated by Google)
视觉问答(VQA)是一项具有挑战性的任务,受到计算机视觉和自然语言处理社区的广泛关注。给定一个自然语言的图像和问题,它需要推理图像的视觉元素和一般知识来推断正确的答案。在本次调查的第一部分中,我们通过比较现代方法来分析问题的现状。我们通过它们的机制来分类方法来连接视觉和文本的方式。特别是,我们研究了卷积和递归神经网络的共同方法,将图像和问题映射到共同的特征空间。我们还讨论了与结构化知识库交互的内存扩展和模块化架构。在本次调查的第二部分,我们回顾了可用于培训和评估VQA系统的数据集。各种数据集包含不同复杂程度的问题,这些问题需要不同的推理能力和类型。我们深入研究Visual Genome项目中的问题/答案对,并评估图像结构化注释与VQA场景图的相关性。最后,我们讨论这个领域有希望的未来方向,特别是与结构化知识库的连接以及使用自然语言处理模型。
URL
https://arxiv.org/abs/1607.05910