Abstract
Motivated by the application of fact-level image understanding, we present an automatic method for data collection of structured visual facts from images with captions. Example structured facts include attributed objects (e.g., <flower, red>), actions (e.g., <baby, smile>), interactions (e.g., <man, walking, dog>), and positional information (e.g., <vase, on, table>). The collected annotations are in the form of fact-image pairs (e.g.,<man, walking, dog> and an image region containing this fact). With a language approach, the proposed method is able to collect hundreds of thousands of visual fact annotations with accuracy of 83% according to human judgment. Our method automatically collected more than 380,000 visual fact annotations and more than 110,000 unique visual facts from images with captions and localized them in images in less than one day of processing time on standard CPU platforms.
Abstract (translated by Google)
受事实级图像理解的应用的启发,我们提出了一种自动方法,用于从带有字幕的图像中对结构化视觉事实进行数据收集。示例结构化事实包括属性对象(例如<flower,red>),动作(例如<baby,smile>),交互(例如<man,walking,dog>)和位置信息(例如<vase,on ,表>)。所收集的注释是以事实图像对(例如,<男人,步行,狗>和包含这个事实的图像区域)的形式。用语言的方法,根据人的判断,所提出的方法能够以83%的准确率收集成千上万的视觉事实注释。我们的方法自动收集超过380,000个视觉事实注释和超过110,000个独特的视觉事实,并且在标准CPU平台上的处理时间少于一天的时间内将图像定位在图像中。
URL
https://arxiv.org/abs/1604.00466