October 2024Miscellany

From Text to Talk: Analyzing Open Source TTS Alternatives

All the big cloud providers like AWS and Azure have an API for the synthesis of text into the spoken word. But there are also young startups like ElevenLabs that offer their innovative solutions in this space. A third option is open source software for those who either do not want to pay for the service of TTS (text-to-speech) or do need on-device TTS. Also, privacy reasons can play a role here.

That is why in this article I want to provide an overview of the most important open source TTS alternatives.

Piper

Piper is a project of the Open Home Foundation. They want to create privacy-preserving technology for homes. The voices are trained using a project called VITS that is based on the paper "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech" by Jaehyeon Kim, Jungil Kong and Juhee Son. What is new about this approach is that they do not use a separate vocoder to produce the audio waveform, but have packaged everything inside one neural network. They claim that this improves the quality of the spoken voice.

The VITS project uses PyTorch for training and inference.

Piper is simple to install, just run:

pip install piper-tts

Before running, you need to download the respective ONNX model for your language and voice. You can find the instructions here.

Piper supports a number of language. I tried some of those out and they sound natural and are great to listen to.

Coqui

Coqui was a project of the now defunct startup with the same name. Since the source code is open source it is still accessible and open to use. But how the project evolves in the future depends on community contribution.

It is also easy to install using Python:

pip install TTS

You can list all the supported language and voice combinations with the following command:

tts --list_models

Then to generate an audio file from text you need to select the appropriate model from the output of the command above.

For example:

tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav

If the specified model is not locally available, it will download the model automatically and then generate the output WAV file.

I tried Coqui with several languages and the quality was okay and in some languages natural sounding. For best results use the VITS models.

Whisper Speech

Whisper Speech is an open source project leveraging another open source speech-to-text model by OpenAI, Whisper. The project Whisper Speech currently has only a limited number of languages and voices supported. It is not really ready to be used, but the samples offered on the GitHub pages are encouraging, although not sounding 100% natural.

The future will show how this project evolves, but the ideas behind it are promising.

Mimic 3

Mimic 3 was developed by Mycroft for its voice assistant. It is a neural TTS engine. The number of features of Mimic 3 is quite impressive with SSML support and an interactive mode.

Mimic 3 has a number of voices available with reasonable quality. The project is no longer maintained with the last source commits three years ago at the time of writing. The license is AGPL which might limit the usefulness of Mimic 3 further.

Espeak-ng

The speech synthesizer espeak-ng is an open source project with the longest history on this list. It supports a large number of languages, more than 100. It uses a special synthesis method and therefore uses few system resources. It is easy to install, for example on Linux you can install it with a package manager.

Unfortunately, it produces very robotic sounding speech. But nonetheless it is intelligible. Thus, its use case is for example on certain embedded systems where natural sounding speech is not that important. Also, where support for less mainstream languages is needed, it might be the only open source option available.

Conclusion

From the four tested open source text-to-speech application, my favorite is clearly Piper. It has the most natural sounding speech. Whisper Speech has potential, but only time will show how it evolves.

With the main backer of Coqui out of business, its future is uncertain and depends on community support. The same is valid for Mimic 3, its future also being uncertain. Espeak-ng clearly sounds robotic, but due to its great language support and low resource needs it can be a good fit for certain niche applications.

So with that, I hope you have learned something in this article, and I am happy about any feedback. Just click on the contact button below.

References

Piper: https://github.com/rhasspy/piper
Coqui: Coqui
Mimic 3: Mimic 3
Whisper Speech: Whisper Speech
Espeak-ng: espeak-ng

Cover image by BroneArtUlm from Pixabay

Published
21 Oct 2024

This work is licensed under a Creative Commons Attribution 4.0 International License.

Written by Thomas Derflinger

I’m a software entrepreneur who builds innovative solutions. On this blog, I share practical guides and insights on web development, IoT and AI.