Abstract
The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for continued progress in automatic image description and grounded language understanding. They enable us to define a new benchmark for localization of textual entity mentions in an image. We present a strong baseline for this task that combines an image-text embedding, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While our baseline rivals in accuracy more complex state-of-the-art models, we show that its gains cannot be easily parlayed into improvements on such tasks as image-sentence retrieval, thus underlining the limitations of current methods and the need for further research.
Abstract (translated by Google)
Flickr30k数据集已成为基于句子的图像描述的标准基准。本文介绍了Flickr30k Entities,它用Flickr30k增加了带有244k共享链的158k字幕,将相同实体跨同一图像的不同标题相关联,并将其与276k手动标注的边界框相关联。这种注释对于自动图像描述和基础语言理解的持续进展是必不可少的。它们使我们能够为图像中文本实体提及的本地化定义新的基准。我们为这个任务提供了一个强大的基线,它结合了图像文本嵌入,普通对象检测器,颜色分类器以及选择较大对象的偏好。虽然我们的基准对手在准确性方面比较复杂,但最先进的模型,我们表明,它的收益不容易成为像图像句子检索这样的任务的改进,从而突出目前的方法的局限性和需要进一步的研究。
URL
https://arxiv.org/abs/1505.04870