papers AI Learner
The Github is limit! Click to go to the new site.

Bidirectional Long-Short Term Memory for Video Description

2016-06-15
Yi Bin, Yang Yang, Zi Huang, Fumin Shen, Xing Xu, Heng Tao Shen

Abstract

Video captioning has been attracting broad research attention in multimedia community. However, most existing approaches either ignore temporal information among video frames or just employ local contextual temporal knowledge. In this work, we propose a novel video captioning framework, termed as \emph{Bidirectional Long-Short Term Memory} (BiLSTM), which deeply captures bidirectional global temporal structure in video. Specifically, we first devise a joint visual modelling approach to encode video data by combining a forward LSTM pass, a backward LSTM pass, together with visual features from Convolutional Neural Networks (CNNs). Then, we inject the derived video representation into the subsequent language model for initialization. The benefits are in two folds: 1) comprehensively preserving sequential and visual information; and 2) adaptively learning dense visual features and sparse semantic representations for videos and sentences, respectively. We verify the effectiveness of our proposed video captioning framework on a commonly-used benchmark, i.e., Microsoft Video Description (MSVD) corpus, and the experimental results demonstrate that the superiority of the proposed approach as compared to several state-of-the-art methods.

Abstract (translated by Google)

视频字幕已经引起了多媒体界的广泛关注。然而,大多数现有的方法要么忽略视频帧之间的时间信息,要么仅仅使用本地上下文时间知识。在这项工作中,我们提出了一个新颖的视频字幕框架,称为\双向长短期记忆(双向长短期记忆)(BiLSTM),它深刻捕捉视频中的双向全局时间结构。具体来说,我们首先设计一种联合可视化建模方法,通过将前向LSTM传递,后向LSTM传递与来自卷积神经网络(CNN)的视觉特征相结合来编码视频数据。然后,我们将派生的视频表示注入到后续的语言模型中进行初始化。其优点有两个:1)全面保存顺序和可视信息; 2)分别自适应学习视频和句子的密集视觉特征和稀疏语义表示。我们验证了我们建议的视频字幕框架在常用基准(即Microsoft Video Description(MSVD)语料库)上的有效性,实验结果表明,所提出方法的优越性与一些最新技术方法。

URL

https://arxiv.org/abs/1606.04631

PDF

https://arxiv.org/pdf/1606.04631


Similar Posts

Comments