Key Points

1. A new text-to-speech (TTS) model called BASE TTS has been introduced, which is the largest TTS model to date, trained on 100K hours of public domain speech data. It uses a 1-billion-parameter autoregressive Transformer combined with a convolution-based decoder to convert raw texts into discrete speechcodes and then into waveforms.

2. The model demonstrates natural prosody on complex textual inputs and achieves state-of-the-art naturalness compared to other large-scale text-to-speech systems. It shows emergent abilities when trained with increasing volume of data, similar to large language models, and improves its capability to render appropriate prosody for complex texts with larger dataset and model sizes.

3. BASE TTS uses novel speech tokenization techniques for speaker ID disentanglement and compression with byte-pair encoding, resulting in high-quality discrete speech representations that can be decoded into waveforms with a fast and streamable decoder.

4. The model's architecture simplifies the traditional text-to-speech pipeline by directly training on a large amount of audio data, eliminating the need for phoneme extraction, and demonstrating high expressivity on a few audio examples in a multilingual setting.

5. BASE TTS shows improved speech naturalness compared to industry baselines, achieving high expressiveness, and being data efficient. It also showcases a streaming capability, enabling the generation of speech incrementally with minimal latency.

6. The model leverages semantic information from self-supervised embeddings and disentangles speaker information from the acoustic tokens, providing strong performance in English and Spanish. It also demonstrates emergent contextual understanding of the text with a wide range of styles without supervised training or annotation.

7. The study introduces a specialized dataset to measure the emergent abilities of text-to-speech models and presents linguistic expert evaluations and MUSHRA subjective evaluations to assess the model's quality and abilities.

8. BASE TTS is used for comparing its performance against industry baselines in terms of word error rate, speaker similarity, and overall synthesis efficiency, demonstrating superior naturalness and improved synthesis efficiency compared to the baselines.

9. While BASE TTS exhibits strong performance in English and Spanish, it has some limitations such as occasional production of hallucinations and cutoffs, which need further research to address. Additionally, due to potential misuse, the model is not open-sourced, and the impact of speech data composition on inclusivity in voice products is acknowledged, advocating for further research in this area.


Summary

Abstract
The paper presents BASE TTS, a text-to-speech (TTS) model trained on 100K hours of public domain speech data, making it the largest TTS model to-date. The model achieves a new state-of-the-art in speech naturalness through the use of a 1-billion-parameter autoregressive Transformer. It introduces a novel speech tokenization technique using speaker ID disentanglement and compression with byte-pair encoding. The paper demonstrates that BASE TTS variants built with increasing data and parameters begin to showcase natural prosody on complex sentences, exhibiting emergent abilities as the model scales.


Introduction

The paper highlights the model's performance by evaluating it against publicly available large-scale text-to-speech systems, YourTTS, Bark, and TortoiseTTS. Audio samples generated by BASE TTS can be heard at a provided link. The authors also design a specialized dataset to measure emergent abilities for text-to-speech and showcase the model's improved capability to render appropriate prosody for complex texts.


The paper emphasizes the development and training of BASE TTS and its breakthrough in speech naturalness, utilization of a large-scale dataset, innovative speech tokenization technique, and demonstration of emergent abilities when trained with increasing data. It also discusses how the model's performance compares to other publicly available text-to-speech systems and provides an overview of the specialized dataset designed to measure emergent abilities.


Additionally, the paper identifies the application potential of BASE TTS, where it could be used to mimic speaker characteristics with just a few seconds of reference audio, enabling enhanced user experiences and support for under-resourced languages.

Furthermore, the paper mentions the cautious decision against open-sourcing the model due to potential misuse and the need for further research to quantify the impact of data composition and identify methods to combat potential biases and foster inclusivity in voice products.



Reference: https://arxiv.org/abs/2402.080...