Abstract
We present our submission to the Microsoft Video to Language Challenge of generating short captions describing videos in the challenge dataset. Our model is based on the encoder–decoder pipeline, popular in image and video captioning systems. We propose to utilize two different kinds of video features, one to capture the video content in terms of objects and attributes, and the other to capture the motion and action information. Using these diverse features we train models specializing in two separate input sub-domains. We then train an evaluator model which is used to pick the best caption from the pool of candidates generated by these domain expert models. We argue that this approach is better suited for the current video captioning task, compared to using a single model, due to the diversity in the dataset. Efficacy of our method is proven by the fact that it was rated best in MSR Video to Language Challenge, as per human evaluation. Additionally, we were ranked second in the automatic evaluation metrics based table.
Abstract (translated by Google)
我们将提交给Microsoft视频到语言的挑战,即在挑战数据集中生成描述视频的简短字幕。我们的模型基于编码器 - 解码器流水线,在图像和视频字幕系统中非常流行。我们建议利用两种不同的视频特征,一种是以对象和属性的形式捕捉视频内容,另一种是捕捉动作和动作信息。使用这些不同的功能,我们训练模型专门在两个单独的输入子域。然后,我们训练一个评估者模型,用来从这些领域专家模型生成的候选人中选择最佳的标题。我们认为,由于数据集的多样性,与使用单个模型相比,这种方法更适合当前的视频字幕任务。根据人类评估,我们的方法的功效已被证明是在MSR Video to Language Challenge中被评为最佳。另外,我们在自动评估指标表中排名第二。
URL
https://arxiv.org/abs/1608.04959