From Text to Talk: Analyzing Open Source TTS Alternatives
All the big cloud providers like AWS and Azure have an API for the sythesis of text into the spoken word. But there are also young startups like ElevenLabs that offer their innovative solutions in this space. A third option is open source software for those who either do not want to pay for the service of TTS (text-to-speech) or do need on-device TTS. Also, privacy reasons can play a role here.
That is why in this article I want to provide an overview of the most important open source TTS alternatives.
Piper
Piper is a project of the Open Home Foundation. They want to create privacy-preserving technology for homes. The voices are trained using a project called VITS that is based on the paper "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech" by Jaehyeon Kim, Jungil Kong and Juhee Son. What is new about this approach is that they do not use a separate vocoder to produce the audio waveform, but have packaged everything inside one neural network. They claim that this improves the quality of the spoken voice.
The VITS project uses PyTorch for training and inference.
Piper is simple to install, just run:
pip install piper-tts
Before running, you need to download the respective ONNX model for your language and voice. You can find the instructions here.
Piper supports a number of language. I tried some of those out and they sound natural and are great to listen to.
Coqui
Coqui was a project of the now defunct startup with the same name. Since the source code is open source it is still accessible and open to use. But how the project evolves in the future depends on community contribution.
It is also easy to install using Python:
pip install TTS
You can list all the supported language and voice combinations with the following command:
tts --list_models
Then to generate an audio file from text you need to select the appropriate model from the output of the command above.
For example:
tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
If the specified model is not locally available, it will download the model automatically and then generate the output WAV file.
I tried Coqui with several languages and the quality was okay and in some languages natural sounding. For best results use the VITS models.
Whisper Speech
Whisper Speech is an open source project leveraging another open source speech-to-text model by OpenAI, Whisper. The project Whisper Speech currently has only a limited number of languages and voices supported. It is not really ready to be used, but the samples offered on the GitHub pages are encouraging, although not sounding 100% natural.
The future will show how this project evolves, but the ideas behind it are promising.
Espeak-ng
The speech synthesizer espeak-ng is an open source project with the longest history on this list. It supports a large number of languages, more than 100. It uses a special synthesis method and therefore uses few system resources. It is easy to install, for example on Linux you can install it with a package manager.
Unfortunately, it produces very robotic sounding speech. But nonetheless it is intelligible. Thus, its use case is for example on certain embedded systems where natural sounding speech is not that important. Also, where support for less mainstream languages is needed, it might be the only open source option available.
Conclusion
From the four tested open source text-to-speech application, my favorite is clearly Piper. It has the most natural sounding speech. Whisper Speech has potential, but only time will show how it evolves. With the main backer of Coqui out of business, its future is uncertain and depends on community support. Espeak-ng clearly sounds robotic, but due to its great language support and low ressource needs it can be a good fit for certain niche applications.
So with that, I hope you have learned something in this article, and I am happy about any feedback. Just click on the contact button below.
References
- Piper: https://github.com/rhasspy/piper
- Coqui: Coqui
- Whisper Speech: Whisper Speech
- Espeak-ng: espeak-ng
Cover image by BroneArtUlm from Pixabay
Published
21 Oct 2024