Abstract
We describe a very simple bag-of-words baseline for visual question answering. This baseline concatenates the word features from the question and CNN features from the image to predict the answer. When evaluated on the challenging VQA dataset [2], it shows comparable performance to many recent approaches using recurrent neural networks. To explore the strength and weakness of the trained model, we also provide an interactive web demo and open-source code. .
Abstract (translated by Google)
我们描述了视觉问题回答的一个非常简单的词袋基线。该基线连接来自图像的问题和CNN特征的单词特征以预测答案。当对具有挑战性的VQA数据集[2]进行评估时,它显示了与使用递归神经网络的许多近期方法相当的性能。为了探索训练模型的优势和劣势,我们还提供了一个交互式的网络演示和开源代码。 。
URL
https://arxiv.org/abs/1512.02167