papers AI Learner
The Github is limit! Click to go to the new site.

Spatio-Temporal Attention Models for Grounded Video Captioning

2016-10-18
Mihai Zanfir, Elisabeta Marinoiu, Cristian Sminchisescu

Abstract

Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video data to high level textual description, bypassing localization and recognition, thus discarding potentially valuable information for content localization and generalization. In this work we present an automatic video captioning model that combines spatio-temporal attention and image classification by means of deep neural network structures based on long short-term memory. The resulting system is demonstrated to produce state-of-the-art results in the standard YouTube captioning benchmark while also offering the advantage of localizing the visual concepts (subjects, verbs, objects), with no grounding supervision, over space and time.

Abstract (translated by Google)

由于动态真实场景中的复杂交互,自动视频字幕具有挑战性。一个全面的系统将最终本地化和追踪视频中的对象,动作和交互,并生成一个依赖于时间本地化的描述,以便为视觉概念奠定基础。然而,大多数现有的自动视频字幕系统从原始视频数据映射到高级文本描述,绕过本地化和识别,从而丢弃了用于内容定位和概括的潜在有价值的信息。在这项工作中,我们提出了一个自动视频字幕模型,结合了时空关注和图像分类的手段,基于深度神经网络结构的长期短期记忆。由此产生的系统被证明能够在标准的YouTube字幕基准测试中产生最新的结果,同时还提供了在空间和时间上对视觉概念(主题,动词,对象)进行本地化的优势,而不需要接地监督。

URL

https://arxiv.org/abs/1610.04997

PDF

https://arxiv.org/pdf/1610.04997


Similar Posts

Comments