Abstract
Generating descriptions for videos has many applications including assisting blind people and human-robot interaction. The recent advances in image captioning as well as the release of large-scale movie description datasets such as MPII Movie Description allow to study this task in more depth. Many of the proposed methods for image captioning rely on pre-trained object classifier CNNs and Long-Short Term Memory recurrent networks (LSTMs) for generating descriptions. While image description focuses on objects, we argue that it is important to distinguish verbs, objects, and places in the challenging setting of movie description. In this work we show how to learn robust visual classifiers from the weak annotations of the sentence descriptions. Based on these visual classifiers we learn how to generate a description using an LSTM. We explore different design choices to build and train the LSTM and achieve the best performance to date on the challenging MPII-MD dataset. We compare and analyze our approach and prior work along various dimensions to better understand the key challenges of the movie description task.
Abstract (translated by Google)
为视频生成描述有许多应用,包括帮助盲人和人机交互。最近在图像字幕方面的进展以及MPII Movie Description等大型电影描述数据集的发布,可以更深入地研究这一任务。许多所提出的图像字幕的方法依赖于预先训练的对象分类器CNN和长 - 短期记忆递归网络(LSTM)来生成描述。虽然图像描述侧重于对象,但我们认为在电影描述的挑战性环境中区分动词,宾语和地点是很重要的。在这项工作中,我们展示了如何从弱句子描述中学习强健的视觉分类器。基于这些视觉分类器,我们学习如何使用LSTM生成描述。我们探索不同的设计选择来构建和培训LSTM,并在具有挑战性的MPII-MD数据集上实现迄今为止的最佳性能。我们比较和分析我们的方法和沿着各个维度的前期工作,以更好地理解电影描述任务的关键挑战。
URL
https://arxiv.org/abs/1506.01698