Abstract
Detecting temporal extents of human actions in videos is a challenging computer vision problem that require detailed manual supervision including frame-level labels. This expensive annotation process limits deploying action detectors on a limited number of categories. We propose a novel action recognition method, called WSGN, that can learn to detect actions from “weak supervision”, video-level labels. WSGN learns to exploit both video-specific and dataset-wide statistics to predict relevance of each frame to an action category. We show that a combination of the local and global channels leads to significant gains in two standard benchmarks THUMOS14 and Charades. Our method improves more than 12% mAP over a weakly supervised baseline, outperforms other weakly supervised state-of-the-art methods and only 4% behind the state-of-the-art supervised method in THUMOS14 dataset for action detection. Similarly, our method is only 0.3% mAP behind a state-of-the-art supervised method on challenging Charades dataset for action localisation.
Abstract (translated by Google)
URL
http://arxiv.org/abs/1904.07774