Abstract
This letter describes a network that is able to capture spatiotemporal correlations over arbitrary timestamps. The proposed scheme operates as a complementary, extended network over spatiotemporal regions. Recently, multimodal fusion has been extensively researched in deep learning. For action recognition, the spatial and temporal streams are vital components of deep Convolutional Neural Network (CNNs), but reducing the occurrence of overfitting and fusing these two streams remain open problems. The existing fusion approach is to average the two streams. To this end, we propose a correlation network with a Shannon fusion to learn a CNN that has already been trained. Long-range video may consist of spatiotemporal correlation over arbitrary times. This correlation can be captured using simple fully connected layers to form the correlation network. This is found to be complementary to the existing network fusion methods. We evaluate our approach on the UCF-101 and HMDB-51 datasets, and the resulting improvement in accuracy demonstrates the importance of multimodal correlation.
Abstract (translated by Google)
URL
http://arxiv.org/abs/1807.08291