Abstract
The temporal dynamics and the discriminative information in the audio signals are very crucial for the Acoustic Scene Classification (ASC). In this work, we propose a temporal feature learning method with hierarchical architecture called Multi-Layer Temporal Pooling (MLTP). Via recursive non-linear feature mappings and temporal pooling operations, our proposed MLTP can effectively capture the high-level temporal dynamics for an entire audio signal with arbitrary duration in an unsupervised way. With the patch-level discriminative features extracted by a simple pre-trained convolutional neural network (CNN) as input, our method attempts to learn the temporal features for the entire audio sample which will be directly used to train the classifier. Experimental results show that our method significantly improves the ASC performance. Without using any data augmentation techniques or ensemble strategies, our method can still achieve the state of art performance with only one lightweight CNN and a single classifier.
Abstract (translated by Google)
URL
http://arxiv.org/abs/1902.10063