Abstract
Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modeling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, and show how it can be applied to a variety of tasks involving images and language. We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.
Abstract (translated by Google)
超词,文本蕴含和图像字幕可以被看作是单词,句子和图像上单个视觉语义层次的特例。在本文中,我们主张明确建模这个层次的部分顺序结构。为了实现这个目标,我们介绍了一种学习有序表示的一般方法,并且展示了如何将它应用于涉及图像和语言的各种任务。我们表明,由此产生的表示相对于当前的上位词预测和图像字幕检索的方法改善了性能。
URL
https://arxiv.org/abs/1511.06361