Abstract
In this paper, we describe the system for generating textual descriptions of short video clips using recurrent neural networks (RNN), which we used while participating in the Large Scale Movie Description Challenge 2015 in ICCV 2015. Our work builds on static image captioning systems with RNN based language models and extends this framework to videos utilizing both static image features and video-specific features. In addition, we study the usefulness of visual content classifiers as a source of additional information for caption generation. With experimental results we show that utilizing keyframe based features, dense trajectory video features and content classifier outputs together gives better performance than any one of them individually.
Abstract (translated by Google)
在本文中,我们描述了使用递归神经网络(RNN)生成短视频短片的文本描述的系统,我们在参加2015年ICCV大型电影描述挑战赛时使用了该系统。我们的工作基于静态图像字幕系统基于RNN的语言模型,并将此框架扩展为利用静态图像特征和视频特定特征的视频。另外,我们研究视觉内容分类器作为字幕生成附加信息的来源的有用性。通过实验结果,我们发现利用基于关键帧的特征,密集的轨迹视频特征和内容分类器输出一起给出比其中任何一个更好的性能。
URL
https://arxiv.org/abs/1512.02949