Abstract
Learning a joint language-visual embedding has a number of very appealing properties and can result in variety of practical application, including natural language image/video annotation and search. In this work, we study three different joint language-visual neural network model architectures. We evaluate our models on large scale LSMDC16 movie dataset for two tasks: 1) Standard Ranking for video annotation and retrieval 2) Our proposed movie multiple-choice test. This test facilitate automatic evaluation of visual-language models for natural language video annotation based on human activities. In addition to original Audio Description (AD) captions, provided as part of LSMDC16, we collected and will make available a) manually generated re-phrasings of those captions obtained using Amazon MTurk b) automatically generated human activity elements in “Predicate + Object” (PO) phrases based on “Knowlywood”, an activity knowledge mining model. Our best model archives Recall@10 of 19.2% on annotation and 18.9% on video retrieval tasks for subset of 1000 samples. For multiple-choice test, our best model achieve accuracy 58.11% over whole LSMDC16 public test-set.
Abstract (translated by Google)
学习联合语言 - 视觉嵌入具有许多非常吸引人的特性,并且可以导致各种实际应用,包括自然语言图像/视频注释和搜索。在这项工作中,我们研究了三种不同的联合语言 - 视觉神经网络模型体系结构。我们评估我们的模型大规模LSMDC16电影数据集的两个任务:1)标准排名的视频注释和检索2)我们提出的电影多选题测试。该测试方便了基于人类活动的自然语言视频注释的视觉语言模型的自动评估。除了作为LSMDC16的一部分提供的原始音频描述(AD)字幕外,我们收集并将提供a)使用Amazon MTurk获得的那些字幕的手动生成的重新解释b)在“谓词+对象”中自动生成的人类活动元素(PO)的短语基于“知识型”,一个活动知识挖掘模型。对于1000个样本的子集,我们最好的模型归档回收率为19.2%,注册率为18.2%,视频检索任务为18.9%。对于多项选择测试,我们最好的模型在整个LSMDC16公开测试集上达到了58.11%的准确率。
URL
https://arxiv.org/abs/1609.08124