Abstract
Predicting attention is a popular topic at the intersection of human and computer vision. However, even though most of the available video saliency data sets and models claim to target human observers’ fixations, they fail to differentiate them from smooth pursuit (SP), a major eye movement type that is unique to perception of dynamic scenes. In this work, we highlight the importance of SP and its prediction (which we call supersaliency, due to greater selectivity compared to fixations), and aim to make its distinction from fixations explicit for computational models. To this end, we (i) use algorithmic and manual annotations of SP and fixations for two well-established video saliency data sets, (ii) train Slicing Convolutional Neural Networks for saliency prediction on either fixation- or SP-salient locations, and (iii) evaluate our and 26 publicly available dynamic saliency models on three data sets against traditional saliency and supersaliency ground truth. Overall, our models outperform the state of the art in both the new supersaliency and the traditional saliency problem settings, for which literature models are optimized. Importantly, on two independent data sets, our supersaliency model shows greater generalization ability and outperforms all other models, even for fixation prediction.
Abstract (translated by Google)
URL
http://arxiv.org/abs/1801.08925