Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents

2019-04-16

Jack Hessel, Lillian Lee, David Mimno

arXiv_CV

arXiv_CV Caption Relation

Abstract
Abstract (translated by Google)
URL
PDF

Abstract

Images and text co-occur everywhere on the web, but explicit links between images and sentences (or other intra-document textual units) are often not annotated by users. We present algorithms that successfully discover image-sentence relationships without relying on any explicit multimodal annotation. We explore several variants of our approach on seven datasets of varying difficulty, ranging from images that were captioned post hoc by crowd-workers to naturally-occurring user-generated multimodal documents, wherein correspondences between illustrations and individual textual units may not be one-to-one. We find that a structured training objective based on identifying whether sets of images and sentences co-occur in documents can be sufficient to predict links between specific sentences and specific images within the same document at test time.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.07826

PDF

http://arxiv.org/pdf/1904.07826

Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents

Abstract

Abstract (translated by Google)

URL

PDF

Similar Posts

Comments