Abstract
Two recent approaches have achieved state-of-the-art results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a recurrent neural network (RNN) that then generates the caption sequence. In this paper, we compare the merits of these different language modeling approaches for the first time by using the same state-of-the-art CNN as input. We examine issues in the different approaches, including linguistic irregularities, caption repetition, and data set overlap. By combining key aspects of the ME and RNN methods, we achieve a new record performance over previously published results on the benchmark COCO dataset. However, the gains we see in BLEU do not translate to human judgments.
Abstract (translated by Google)
最近的两种方法已经在图像字幕方面取得了最新的成果。第一种是使用流水线处理,其中由图像上的卷积神经网络(CNN)生成一组候选词,然后使用最大熵(ME)语言模型将这些词排列成连贯句子。第二个使用CNN的倒数第二个激活层作为循环神经网络(RNN)的输入,然后产生字幕序列。在本文中,我们首次使用相同的最先进的CNN作为输入来比较这些不同的语言建模方法的优点。我们用不同的方法来研究问题,包括语言违规,标题重复和数据集重叠。通过结合ME和RNN方法的关键方面,我们在基准COCO数据集的先前发表的结果上实现了新的记录性能。然而,我们在BLEU看到的收益并不能转化为人类的判断。
URL
https://arxiv.org/abs/1505.01809