Abstract
This technical report provides extra details of the deep multimodal similarity model (DMSM) which was proposed in (Fang et al. 2015, arXiv:1411.4952). The model is trained via maximizing global semantic similarity between images and their captions in natural language using the public Microsoft COCO database, which consists of a large set of images and their corresponding captions. The learned representations attempt to capture the combination of various visual concepts and cues.
Abstract (translated by Google)
这份技术报告提供了在(Fang et al。2015,arXiv:1411.4952)中提出的深度多模式相似度模型(DMSM)的额外细节。该模型通过使用公共的Microsoft COCO数据库(其由大量图像及其相应字幕组成)来最大化自然语言中的图像及其标题的全局语义相似性来训练。学习的表示试图捕捉各种视觉概念和线索的组合。
URL
https://arxiv.org/abs/1504.03083