papers AI Learner
The Github is limit! Click to go to the new site.

Cross-Modal and Hierarchical Modeling of Video and Text

2018-10-16
Bowen Zhang, Hexiang Hu, Fei Sha

Abstract

Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a paragraph may contain sentences with different topics, which collectively conveys a coherent message or story. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities. Specifically, we introduce hierarchical sequence embedding (HSE), a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information. We perform empirical studies on large-scale video and paragraph retrieval datasets and demonstrated superior performance by the proposed methods. Furthermore, we examine the effectiveness of our learned embeddings when applied to downstream tasks. We show its utility in zero-shot action recognition and video captioning.

Abstract (translated by Google)
URL

https://arxiv.org/abs/1810.07212

PDF

https://arxiv.org/pdf/1810.07212


Similar Posts

Comments