Towards Visually Grounded Sub-Word Speech Unit Discovery

Abstract
Abstract (translated by Google)
URL
PDF

Abstract

In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes. We show how diphone boundaries can be superficially extracted from the activation patterns of intermediate layers of the model, suggesting that the model may be leveraging these events for the purpose of word recognition. We present a series of experiments investigating the information encoded by these events.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1902.08213

PDF

http://arxiv.org/pdf/1902.08213

Towards Visually Grounded Sub-Word Speech Unit Discovery

Abstract

Abstract (translated by Google)

URL

PDF

Similar Posts

Comments