papers AI Learner
The Github is limit! Click to go to the new site.

Leveraging Visual Question Answering for Image-Caption Ranking

2016-08-31
Xiao Lin, Devi Parikh

Abstract

Visual Question Answering (VQA) is the task of taking as input an image and a free-form natural language question about the image, and producing an accurate answer. In this work we view VQA as a “feature extraction” module to extract image and caption representations. We employ these representations for the task of image-caption ranking. Each feature dimension captures (imagines) whether a fact (question-answer pair) could plausibly be true for the image and caption. This allows the model to interpret images and captions from a wide variety of perspectives. We propose score-level and representation-level fusion models to incorporate VQA knowledge in an existing state-of-the-art VQA-agnostic image-caption ranking model. We find that incorporating and reasoning about consistency between images and captions significantly improves performance. Concretely, our model improves state-of-the-art on caption retrieval by 7.1% and on image retrieval by 4.4% on the MSCOCO dataset.

Abstract (translated by Google)

视觉问答(VQA)是把图像和自由形式的自然语言问题作为输入的图像,并产生一个准确的答案。在这项工作中,我们将VQA视为一个“特征提取”模块来提取图像和标题表示。我们将这些表示用于图像标题排名的任务。每个特征维度捕捉(想象)一个事实(问题 - 答案对)对于图像和标题是否可能是真实的。这允许模型从各种各样的角度解释图像和标题。我们提出了分数级和表示级融合模型,以将VQA知识并入现有的最先进的VQA不可知的图像标题排名模型中。我们发现,合并和推理图像和标题之间的一致性显着提高性能。具体来说,我们的模型在MSCOCO数据集上提高了7.1%的字幕检索水平和4.4%的图像检索水平。

URL

https://arxiv.org/abs/1605.01379

PDF

https://arxiv.org/pdf/1605.01379


Similar Posts

Comments