Abstract
Solving the visual symbol grounding problem has long been a goal of artificial intelligence. The field appears to be advancing closer to this goal with recent breakthroughs in deep learning for natural language grounding in static images. In this paper, we propose to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure. Described video datasets are scarce, and most existing methods have been applied to toy domains with a small vocabulary of possible words. By transferring knowledge from 1.2M+ images with category labels and 100,000+ images with captions, our method is able to create sentence descriptions of open-domain videos with large vocabularies. We compare our approach with recent work using language generation metrics, subject, verb, and object prediction accuracy, and a human evaluation.
Abstract (translated by Google)
解决视觉符号接地问题一直是人工智能的一个目标。随着静态图像中自然语言接地的深度学习的近期突破,该领域似乎正朝着这个目标前进。在本文中,我们建议使用具有卷积和循环结构的统一深度神经网络将视频直接翻译成句子。所描述的视频数据集是稀缺的,大多数现有的方法已经被应用到可能的单词词汇量小的玩具领域。通过将带有类别标签的1.2M +图像和带有字幕的100,000+图像的知识转移,我们的方法能够创建具有大词汇表的开放域视频的句子描述。我们使用语言生成指标,主题,动词和对象预测准确度以及人为评估来比较我们的方法与最近的工作。
URL
https://arxiv.org/abs/1412.4729