papers AI Learner
The Github is limit! Click to go to the new site.

Seeing with Humans: Gaze-Assisted Neural Image Captioning

2016-08-18
Yusuke Sugano, Andreas Bulling

Abstract

Gaze reflects how humans process visual scenes and is therefore increasingly used in computer vision systems. Previous works demonstrated the potential of gaze for object-centric tasks, such as object localization and recognition, but it remains unclear if gaze can also be beneficial for scene-centric tasks, such as image captioning. We present a new perspective on gaze-assisted image captioning by studying the interplay between human gaze and the attention mechanism of deep neural networks. Using a public large-scale gaze dataset, we first assess the relationship between state-of-the-art object and scene recognition models, bottom-up visual saliency, and human gaze. We then propose a novel split attention model for image captioning. Our model integrates human gaze information into an attention-based long short-term memory architecture, and allows the algorithm to allocate attention selectively to both fixated and non-fixated image regions. Through evaluation on the COCO/SALICON datasets we show that our method improves image captioning performance and that gaze can complement machine attention for semantic scene understanding tasks.

Abstract (translated by Google)

凝视反映了人类如何处理视觉场景,因此越来越多地用于计算机视觉系统。以前的研究表明注视潜力的目标定位和识别等物体为中心的任务,但目前还不清楚凝视是否也可以有利于以场景为中心的任务,如图像字幕。通过研究人类凝视与深度神经网络的注意机制之间的相互作用,我们提出了一种新的透视辅助图像字幕的视角。使用公开的大规模凝视数据集,我们首先评估最先进的对象与场景识别模型之间的关系,自下而上的视觉显着性和人类的凝视。然后,我们提出了一种新的图像字幕的分割注意模型。我们的模型将人眼注视信息整合到基于注意力的长短期记忆体系结构中,并允许算法有选择地将注意力分配给固定和非固定图像区域。通过对COCO / SALICON数据集的评估,我们发现我们的方法改善了图像字幕性能,并且凝视可以补充语义场景理解任务的机器注意力。

URL

https://arxiv.org/abs/1608.05203

PDF

https://arxiv.org/pdf/1608.05203


Similar Posts

Comments