Abstract: As with many other fields, text-to-speech (TTS) has reached a new level with the recent advancements in Deep Learning. TTS is a seq2seq problem riddled with peculiarities and specific challenges. A three step approach is the modern solution: first, a sequence-to-sequence model align text and audio; second a feed-forward network predicts spectrogram from the text input and last, a vocoder model synthesizes the final waveform from the predicted spectrograms.
High-dimensional target space, non-bijective text-spectrograms correspondence, large discrepancy between input and target sequence length, student-teacher approach to avoid autoregression, very long output sequences ,slow inference speed and long training times (over 2 weeks), no explicit evaluation metric that correlates with perceived audio quality are some of the challenges of this problem.
During the course of this project we, together with other teams (e.g. Mozilla), have tackled many of these issues and successfully trained the current state-of-the art architectures such as Tacotron and Transformer-based models as well as developed their feed-forward counterparts and made all of it available open source.
Using these models, we created the first brand voice for Axel Springer, which now allows for audio content on the news website.
Bio: Christian Schäfer is driving AI research within the Axel Springer group and helps to integrate machine learning systems in production. His goal is to create products that are both smart and pretty, so that everyone will like them. Having an academic background in theoretical physics, Chris is interested in understanding complex phenomena, and his professional interests moved from network theory over neuroscience to deep learning, which all share some interesting similarities