TTS algorithms (Re: Comments on the Text to Speech "algorithm")
speechd at knopper.net
Sun Feb 28 14:04:48 CET 2010
On Sun, Feb 28, 2010 at 11:30:56AM +0100, marc wrote:
> Klaus Knopper wrote:
> >Maybe I'm missing something, but as far as I understood the question,
> >marc just asked about a common procedure called "unit selection" which
> >is the algorithm of many text-to-speech synthesizers.
> Well, I know, in TTS they use some kind of FFT (Fourier Transform)
> etc that mathematically construct a soundwave.
But my point was that "Text-To-Speech" can be done with different
algorithms, not only the mathematical approach you mentioned.
-> Wikipedia. ;-)
(But you have probably already read this.)
> My guess is that mp3/ogg/wav/whatever would sound better.
You MEAN -> prerecorded voices sound better (or "more natural") than
purely generated ones.
I agree that this is usually the case, no doubt, a "real voice" sounds
more "real" than a synthetic one (though various movies tell a different
story). As said before, there are different methods in TTS, your
proposal is aready being done in the "unit selection" approach.
> >Back to unit selection: Because of time-critical issues, selection and
> >processing of real recordings requires a lot of IO throughput, so you
> >will need a very fast harddisk (maybe raid) or database, possibly cached
> >in RAM, or just accept the output data to be generated "offline" with
> >playback a few seconds or even minutes after the original text was sent,
> >output being in form of a WAV, Ogg or also the aforementioned MP3 if you
> >don't mind using a patented format with its problematic legal issues.
> Well, ever heard of a tree? If a word starts with:
> all words with a in a/
> all words with b in b/
> all words with aa in a/a/
> all words with ac in a/c/
> all words in aaa in /a/a/a
> all words in prononciation in /p/r/o/n/u/n/c/i/a/t/i/o/n/.words
> I suppose you have a fast access here, and after a while, you have a
> lot of these sound files in memory.
Sure, this is what a database does, giving you quick access to the data.
We would use DB format, or maybe a btree+-based filesystem with
hardlinks for the real data. Still, searching and merging (you just
can't simply "glue together" recordings, they always need cpu power for
processind and merging with the correct speech melody) requires
resources, more than most synthesized voices. The smaller the "real"
parts are, the better the "real time feeling". Many blind people put
the speed of the spoken words at incredible high rates to optimize
information transfer with good hearing.
The example you gave is kind of bad because you would never glue
together "letters", since they simply don't match the "sound" you
associate to them when reading, it depends on their surrounding context
how they are pronounced. Rather, combinations of sounds are
stored in the phoneme-based synthesis methods, which provide the
smallest recordings at still good quality. For phoneme-based recordings,
you need about 3000 recorded sounds (for german or english), which gives
databases of about 10 to 100 MB.
If you store entire words or sentences (combinations of words), you get
a huge database of hundreds of MB to several Gigs. Naturally, realtime
processing of this data has extended requirements to the hardware. You
can again use databases, or btree based filesystems to access the data.
> >"Mary" from the DFKI uses unit selection, it is open source and written
> >in JAVA, but there is no plugin for speechd yet. It may be possible to
> >use it as commandline-based external program for speechd still.
> I will take a look at it.
> My problem is that my own language (Dutch) does not sound very good
> in synthesised speech ..
Which TTS did you use?
- espeak uses the "pure synthetic" method.
- mbrola (proprietary non-commercial) uses phonemes (see above)
- festival can use different methods (phoneme, half-syllable, unit
selection) but is a little uneasy to operate if you don't have lisp
speechd itself, as mentioned before, does not do any speech synthesis on
its own. You probably used espeak as default when trying speechd.
> And I know some GPSses use ogg/mp3 words (eg the Mappy.fr uses Ogg).
Most Navigation systems and telephone navigators do NOT use
Text-To-Speech, they have prerecorded sentences that are just being
played whenever they fit. If your navigation system can correctly
pronounce arbitrary addresses and street names, it MAY have an
additional TTS engine.
Btw, some linux-based navigation systems from TomTom use concatenated
ogg files with about 50 entries of prerecorded texts. This is of course
easy to process for the quite slow hardware, since no postprocessing is
> In Belgium, blind/handicapped people get 90% subsidies for ability
> technologies (speech etc), but would an European funding be an idea?
It would be great to have an european standard database of
free or creative-commons-licensed unit selection or at least
diphone/half-syllable voices, and not proprietary-non-commercial like
the ones of MBROLA. Unfortunately most projects tend to keep a
"closed-source" copyright and only allow non-commercial or educational
use of their data, which is not compatible with true open source
projects or engines. So, much work is done twice or more, just because
licenses don't allow everyone to use - even publically funded - work
freely. Its sad.
More information about the Speechd