Abstract
Image segmentation from referring expressions is a joint vision and language modeling task, where the input is an image and a textual expression describing a particular region in the image; and the goal is to localize and segment the specific image region based on the given expression. One major difficulty to train such language-based image segmentation systems is the lack of datasets with joint vision and text annotations. Although existing vision datasets such as MS COCO provide image captions, there are few datasets with region-level textual annotations for images, and these are often smaller in scale. In this paper, we explore how existing large scale vision-only and text-only datasets can be utilized to train models for image segmentation from referring expressions. We propose a method to address this problem, and show in experiments that our method can help this joint vision and language modeling task with vision-only and text-only data and outperforms previous results.
Abstract (translated by Google)
从引用表达式的图像分割是联合视觉和语言建模任务,其中输入是描述图像中的特定区域的图像和文本表达;目标是根据给定的表达式来定位和分割特定的图像区域。训练这种基于语言的图像分割系统的一个主要困难是缺乏具有联合视觉和文本注释的数据集。虽然现有的视觉数据集(如MS COCO)提供图像标题,但是很少有数据集具有区域级的图像文本注释,而且这些数据集通常规模较小。在本文中,我们探讨如何利用现有的大规模纯视觉和纯文本数据集来训练来自引用表达式的图像分割模型。我们提出一种方法来解决这个问题,并在实验中表明,我们的方法可以帮助这个联合视觉和语言建模任务与纯视觉和纯文本数据,并胜过以前的结果。
URL
https://arxiv.org/abs/1608.08305