Abstract
The attention mechanism is an important part of the neural machine translation (NMT) where it was reported to produce richer source representation compared to fixed-length encoding sequence-to-sequence models. Recently, the effectiveness of attention has also been explored in the context of image captioning. In this work, we assess the feasibility of a multimodal attention mechanism that simultaneously focus over an image and its natural language description for generating a description in another language. We train several variants of our proposed attention mechanism on the Multi30k multilingual image captioning dataset. We show that a dedicated attention for each modality achieves up to 1.6 points in BLEU and METEOR compared to a textual NMT baseline.
Abstract (translated by Google)
注意机制是神经机器翻译(NMT)的重要组成部分,据报道与固定长度编码序列 - 序列模型相比,其产生更丰富的源代表。最近,在图像字幕的背景下也探讨了注意力的有效性。在这项工作中,我们评估多模态注意机制的可行性,同时关注图像及其自然语言描述,用另一种语言生成描述。我们在Multi30k多语言图像字幕数据集上训练了我们提出的关注机制的几个变体。我们显示,与文本NMT基线相比,每种模态的专注力达到BLEU和METEOR的1.6分。
URL
https://arxiv.org/abs/1609.03976