Abstract
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, we apply the m-RNN model to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval. The project page of this work is: www.stat.ucla.edu/~junhua.mao/m-RNN.html .
Abstract (translated by Google)
在本文中,我们提出了一个多模式递归神经网络(m-RNN)模型来生成新的图像字幕。它直接模拟给出前一个词和一个图像产生一个词的概率分布。图像标题是通过从这个分布采样生成的。该模型由两个子网络组成:用于句子的深度递归神经网络和用于图像的深度卷积网络。这两个子网在多模态层相互作用形成整个m-RNN模型。我们模型的有效性通过四个基准数据集(IAPR TC-12,Flickr 8K,Flickr 30K和MS COCO)进行验证。我们的模型胜过最先进的方法。另外,我们将m-RNN模型应用于检索图像或句子的检索任务,并且相对于直接优化检索的排序目标函数的现有技术的方法实现显着的性能改进。这项工作的项目页面是:www.stat.ucla.edu/~junhua.mao/m-RNN.html。
URL
https://arxiv.org/abs/1412.6632