TTS on Linux (espeak and flite)

Small exploration on doing text-to-speech on Linux.


Intro

Speech synthesis is difficult. If you're on the look for text-to-speech software, you don't have many options, at least not for a few more years, when computers will be stronger and machine learning will make a few more advancements. Meanwhile you have to resort to old TTS software, or give up on yourself and use a online service.

Here we'll explore two options:

Browse the related snippets on the dedicated lab page.


Option #1: espeak

Usage

You can provide the text to read as stdin, as a file with -f, or as arguments. It's very useful to make espeak output to stdout with the --stdout option.

espeak [option]... ["<words>"]
0

Options

Partial list of espeak options.

option description note
-f file file to read
-w file write to file out.wav instead of playing
-v string change the used voice
--voices
--voices=lang
list all voices
-a number amplitude, default=100 0-200
-g number word gap
-k number 1 - indicate capital letters with sound
2 - with the word "capitals"
-p number pitch adjustment, default=50 0-99
-s number speed in words per minute, default=175 80-450
--punct
--punct="charset"
speak the names of punctuation characters

Let's try it out

espeak "hello world"
0

This will greet us with a very robotic "Hello world". There's some kind of flanger effect going on. The result would be good enough if we were to build a Jarvis assistant in the 1980s.

Try making it say other things, and you'll notice that sometimes it's hard to understand what it's saying.

Explore other voices

You can install additional voices to use with espeak, for instance MBROLA voices.

Let's check out the list of voices available. To do that, run espeak --voices=en.

listing espeak voices

Let espeak know which voice to use with the -v option. Pass any of the voices that you discovered earlier (use the values from the VoiceName or the File column). To better test this, we'll switch from Hello world to something more useful.

espeak -v english-mb-en1 "you need to leave in 20 minutes"
0

Now that's interesting, it would make a fine voice for a Doom Jarvis. Still robotic, but deeper and at a slower pace. This particular voice is a MBROLA voice. Explore the other voices on your own.

You'll notice that all the voices (not pronounciations) sound kind of the same. The only good usecase I've found for espeak is for Stephen Hawking's simulator voice.


Option #2: Festival / Flite

Festival and Flite are the work of the CMU Speech Group. Flite is an alternative engine to Festival, which works with Festival voices, but is lighter and has a better interface. We'll focus on flite.

To get a deeper understanding of how Festival and the FestVox suite work, read the document.

Usage

flite gets the text to read from stdin, a file, or as arguments, and its output can be played or written to a wave file.

flite TEXT/FILE [output_file]
0

If the [output_file] is not specified or "play", flite will just play the output.

Options

Partial list of flite options.

option description note
-o string explicitly set the output file
-f string explicitly set the input file
-t string explicitly set the input text
-voice string set the used voice voice name from -lv
or a voice file
-lv list available voices

Let's try it out

flite "hello world"
0

FestVox voices sound more human, so we're on the right track to find a more modern voice for our imaginary friend, Jarvis.

Explore other voices

Just like with espeak, you can get more voices for flite. You can grab these ones for testing.

You can pass the voice files directly to flite using the -voice option.

echo "today we will learn how to code" \
| flite -voice ./cmu_us_axb.flitevox
0
1

Oh, the shivers. It sounds like every video course I don't take because of the indian accent (sorry guys, I'm no native english speaker either, but your accent is like acid for my ears).


Audio manipulation with SoX

SoX - the Swiss Army knife of audio manipulation - is a tool we can use to make some small adjustments to the audio that comes out from espeak and flite.

We'll use the play command that SoX provides for this.

echo "very interestingly" \
| flite -voice ./cmu_indic_aup_mr.flitevox -o tmp.wav -pw
play - pitch -50 speed 1.1 treble 10 < tmp.wav
0
1
2

Yes, this is a poor attempt at synthesizing Bisqwit's voice. You'd think that it would be easier to synthesize an already synthesized voice, but this is the best I could do.


That's about it

Playing with TTS is always fun. Go write your personal assistant, put it on an embedded computer, and let it speak to your brain through some unicorn-powered tech (like this one - recently expired, and before you ask - NO, DO NOT CHEAT ON YOUR EXAMS, YOU HUFFLEPUFF).