Abstract
This paper presents a restricted visual Turing test (VTT) for story-line based deep understanding in long-term and multi-camera captured videos. Given a set of videos of a scene (such as a multi-room office, a garden, and a parking lot.) and a sequence of story-line based queries, the task is to provide answers either simply in binary form “true/false” (to a polar query) or in an accurate natural language description (to a non-polar query). Queries, polar or non-polar, consist of view-based queries which can be answered from a particular camera view and scene-centered queries which involves joint inference across different cameras. The story lines are collected to cover spatial, temporal and causal understanding of input videos. The data and queries distinguish our VTT from recently proposed visual question answering in images and video captioning. A vision system is proposed to perform joint video and query parsing which integrates different vision modules, a knowledge base and a query engine. The system provides unified interfaces for different modules so that individual modules can be reconfigured to test a new method. We provide a benchmark dataset and a toolkit for ontology guided story-line query generation which consists of about 93.5 hours videos captured in four different locations and 3,426 queries split into 127 story lines. We also provide a baseline implementation and result analyses.
Abstract (translated by Google)
本文提出了一种基于故事情节的视觉图灵测试(VTT),其基于对长期和多摄像机捕捉视频的深入理解。给定一组场景的视频(例如多房间办公室,花园和停车场)和一系列基于故事情节的查询,其任务是提供简单的二进制形式的答案“真实/假“(对极性查询)或准确的自然语言描述(对非极性查询)。极性或非极性的查询由基于查看的查询组成,这些查询可以从特定的相机视图和以涉及不同相机之间的联合推断的以场景为中心的查询来回答。收集故事情节以涵盖输入视频的空间,时间和因果关系。数据和查询将我们的VTT与最近提出的图像和视频字幕的视觉问题区分开来。提出了一个视觉系统,将不同的视觉模块,知识库和查询引擎集成在一起,实现视频和查询的联合解析。系统为不同的模块提供统一的接口,使得单独的模块可以重新配置以测试新的方法。我们提供了基于本体的故事线查询生成的基准数据集和工具包,其中包括在四个不同位置捕获的约93.5小时的视频和分成127个故事线的3,426个查询。我们还提供基线实施和结果分析。
URL
https://arxiv.org/abs/1512.01715