Abstract
Much of the recent progress in Vision-to-Language (V2L) problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. We propose here a method of incorporating high-level concepts into the very successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art performance in both image captioning and visual question answering. We also show that the same mechanism can be used to introduce external semantic information and that doing so further improves performance. In doing so we provide an analysis of the value of high level semantic information in V2L problems.
Abstract (translated by Google)
视觉语言(V2L)问题的近期进展大部分是通过卷积神经网络(CNN)和递归神经网络(RNN)的组合来实现的。这种方法没有明确表示高层次的语义概念,而是试图直接从图像特征发展到文本。我们在这里提出一种将高级概念纳入CNN-RNN成功方法的方法,并且显示出它在图像字幕和视觉问题回答方面的最新性能得到显着改善。我们还表明,同样的机制可以用来引入外部语义信息,这样做可以进一步提高性能。在这样做的过程中,我们提供了V2L问题中高级语义信息的价值分析。
URL
https://arxiv.org/abs/1506.01144