Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

2019-02-15

Nikolaos Gkanatsios, Vassilis Pitsikalis, Petros Koutras, Athanasia Zlatintsi, Petros Maragos

arXiv_CV

arXiv_CV Attention Embedding Quantitative Detection Relation

Abstract
Abstract (translated by Google)
URL
PDF

Abstract

Detecting visual relationships, i.e. <Subject, Predicate, Object> triplets, is a challenging Scene Understanding task approached in the past via linguistic priors or spatial information in a single feature branch. We introduce a new deeply supervised two-branch architecture, the Multimodal Attentional Translation Embeddings, where the visual features of each branch are driven by a multimodal attentional mechanism that exploits spatio-linguistic similarities in a low-dimensional space. We present a variety of experiments comparing against all related approaches in the literature, as well as by re-implementing and fine-tuning several of them. Results on the commonly employed VRD dataset [1] show that the proposed method clearly outperforms all others, while we also justify our claims both quantitatively and qualitatively.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1902.05829

PDF

http://arxiv.org/pdf/1902.05829

Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

Abstract

Abstract (translated by Google)

URL

PDF

Similar Posts

Comments