TTS Skins: Speaker Conversion via ASR

Abstract
Abstract (translated by Google)
URL
PDF

Abstract

We present a fully convolutional wav-to-wav network for converting between speakers’ voices, without relying on text. Our network is based on an encoder-decoder architecture, where the encoder is pre-trained for the task of Automatic Speech Recognition (ASR), and a multi-speaker waveform decoder is trained to reconstruct the original signal in an autoregressive manner. We train the network on narrated audiobooks, and demonstrate the ability to perform multi-voice TTS in those voices, by converting the voice of a TTS robot. We observe no degradation in the quality of the generated voices, in comparison to the reference TTS voice. The modularity of our approach, which separates the target voice generation from the TTS module, enables client-side personalized TTS in a privacy-aware manner.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1904.08983

PDF

http://arxiv.org/pdf/1904.08983

TTS Skins: Speaker Conversion via ASR

Abstract

Abstract (translated by Google)

URL

PDF

Similar Posts

Comments