Abstract
In this paper we describe a novel framework and algorithms for discovering image patch patterns from a large corpus of weakly supervised image-caption pairs generated from news events. Current pattern mining techniques attempt to find patterns that are representative and discriminative, we stipulate that our discovered patterns must also be recognizable by humans and preferably with meaningful names. We propose a new multimodal pattern mining approach that leverages the descriptive captions often accompanying news images to learn semantically meaningful image patch patterns. The mutltimodal patterns are then named using words mined from the associated image captions for each pattern. A novel evaluation framework is provided that demonstrates our patterns are 26.2% more semantically meaningful than those discovered by the state of the art vision only pipeline, and that we can provide tags for the discovered images patches with 54.5% accuracy with no direct supervision. Our methods also discover named patterns beyond those covered by the existing image datasets like ImageNet. To the best of our knowledge this is the first algorithm developed to automatically mine image patch patterns that have strong semantic meaning specific to high-level news events, and then evaluate these patterns based on that criteria.
Abstract (translated by Google)
在本文中,我们描述了一个新的框架和算法,用于从新闻事件中产生的弱监督图像字幕对的大型语料库中发现图像块模式。目前的模式挖掘技术试图找到具有代表性和区分性的模式,我们规定我们发现的模式也必须能被人类识别,最好是有意义的名称。我们提出了一种新的多模式模式挖掘方法,利用通常伴随着新闻图像的描述性标题来学习语义上有意义的图像补丁模式。然后使用从每个模式的相关图像标题中挖掘出的单词来命名多重模式。提供了一种新颖的评估框架,证明我们的模式比仅由现有视觉管线发现的语义有26.2%的语义意义,并且可以为发现的图像块提供标签,精确度达到54.5%,无需直接监督。我们的方法还发现了已有的图像数据集(比如ImageNet)所涵盖的命名模式。据我们所知,这是第一个自动挖掘具有特定于高级新闻事件的强语义含义的图像补丁模式的算法,然后根据该标准评估这些模式。
URL
https://arxiv.org/abs/1601.00022