Abstract
Recent studies have demonstrated the power of recurrent neural networks for machine translation, image captioning and speech recognition. For the task of capturing temporal structure in video, however, there still remain numerous open research questions. Current research suggests using a simple temporal feature pooling strategy to take into account the temporal aspect of video. We demonstrate that this method is not sufficient for gesture recognition, where temporal information is more discriminative compared to general video classification tasks. We explore deep architectures for gesture recognition in video and propose a new end-to-end trainable neural network architecture incorporating temporal convolutions and bidirectional recurrence. Our main contributions are twofold; first, we show that recurrence is crucial for this task; second, we show that adding temporal convolutions leads to significant improvements. We evaluate the different approaches on the Montalbano gesture recognition dataset, where we achieve state-of-the-art results.
Abstract (translated by Google)
最近的研究已经证明了机器翻译,图像字幕和语音识别的递归神经网络的能力。然而,为了捕捉视频中的时间结构,仍然有许多开放的研究问题。目前的研究表明,使用简单的时间特征池策略来考虑视频的时间方面。我们证明,这种方法是不够的手势识别,其中时间信息比一般视频分类任务更具有区别性。我们探索深度架构的视频手势识别,并提出了一个新的端到端可训练的神经网络架构,结合了时间卷积和双向重现。我们的主要贡献是双重的;首先,我们表明复发对于这个任务是至关重要的;其次,我们表明,添加时间卷积导致显着的改善。我们评估Montalbano手势识别数据集的不同方法,在那里我们获得最先进的结果。
URL
https://arxiv.org/abs/1506.01911