Abstract
We present an approach to improve statistical machine translation of image descriptions by multimodal pivots defined in visual space. The key idea is to perform image retrieval over a database of images that are captioned in the target language, and use the captions of the most similar images for crosslingual reranking of translation outputs. Our approach does not depend on the availability of large amounts of in-domain parallel data, but only relies on available large datasets of monolingually captioned images, and on state-of-the-art convolutional neural networks to compute image similarities. Our experimental evaluation shows improvements of 1 BLEU point over strong baselines.
Abstract (translated by Google)
我们提出了一种方法来改善在视觉空间中定义的多模态枢轴的图像描述的统计机器翻译。关键的想法是对目标语言中标题图像的数据库进行图像检索,并使用最相似图像的标题进行翻译输出的跨语种重新排列。我们的方法并不依赖于大量的域内并行数据的可用性,而只依赖于单幅标题图像的可用大数据集以及用于计算图像相似性的最先进的卷积神经网络。我们的实验评估显示,在强烈的基线上,1 BLEU点的改进。
URL
https://arxiv.org/abs/1601.03916