Abstract
We explore a variety of nearest neighbor baseline approaches for image captioning. These approaches find a set of nearest neighbor images in the training set from which a caption may be borrowed for the query image. We select a caption for the query image by finding the caption that best represents the “consensus” of the set of candidate captions gathered from the nearest neighbor images. When measured by automatic evaluation metrics on the MS COCO caption evaluation server, these approaches perform as well as many recent approaches that generate novel captions. However, human studies show that a method that generates novel captions is still preferred over the nearest neighbor approach.
Abstract (translated by Google)
我们探索图像字幕的各种最近邻基线方法。这些方法在训练集中找到一组最近的邻居图像,从中可以为查询图像借用标题。我们通过找到最能代表从最近的邻居图像中收集到的一组候选字幕的“共识”的字幕来选择查询图像的标题。当通过MS COCO字幕评估服务器上的自动评估指标进行衡量时,这些方法以及许多最近产生新颖字幕的方法都可以执行。然而,人类研究表明,产生新颖的字幕的方法仍然优于最近邻的方法。
URL
https://arxiv.org/abs/1505.04467