Abstract
In most automatic speech recognition (ASR) systems, the audio signal is processed to produce a time series of sensor measurements (e.g., filterbank outputs). This time series encodes semantic information in a speaker-dependent way. An earlier paper showed how to use the sequence of sensor measurements to derive an “inner” time series that is unaffected by any previous invertible transformation of the sensor measurements. The current paper considers two or more speakers, who mimic one another in the following sense: when they say the same words, they produce sensor states that are invertibly mapped onto one another. It follows that the inner time series of their utterances must be the same when they say the same words. In other words, the inner time series encodes their speech in a manner that is speaker-independent. Consequently, the ASR training process can be simplified by collecting and labelling the inner time series of the utterances of just one speaker, instead of training on the sensor time series of the utterances of a large variety of speakers. A similar argument suggests that the inner time series of music is instrument-independent. This is demonstrated in experiments on monophonic electronic music.
Abstract (translated by Google)
URL
http://arxiv.org/abs/1905.03278