There are three approaches to producing digitized speech:
Approach [*] is space intensive. It works in a limited number of cases, but it has the advantage of producing the most natural sounding speech in a restricted domain.
Approach [*] is more widely applicable and provides an unlimited vocabulary. It is memory intensive, since the diphones ( numbering about [tex2html_wrap5910] for English) need to be accessed frequently. The approach is not compute intensive. Quality varies widely from barely intelligible to human-intelligible. This approach has been commercially applied by Apple in the form of MacinTalk-[tex2html_wrap5912] and MacinTalk-[tex2html_wrap5914]. The MacinTalk-[tex2html_wrap5916], also known as GalaTea, is fairly memory intensive, but the quality is among the best that has been achieved with this method of synthesis. The principal drawback with diphone synthesis is that the underlying model is fairly restrictive. Though systems like GalaTea achieve a fair amount of intonation, the intonational structure generated still leaves much to be desired. The model also allows only minimal variations in voice, e.g., pitch and speech rate. Changing voice parameters produces a significant deterioration in output.
Approach [*], which
models the human vocal tract, is compute but not memory
intensive. It is also the most flexible approach to speech
synthesis. Since it is based on a mathematical model of the
human vocal-tract, it permits a large number of variations in
voice quality (see [Her91][Her90][Her89][Kla87] for details). What is
more, it allows us to perform the same kind of scaling etc. on
the voice that we perform in the visual setting when working
with fonts. This is particularly useful in conveying complex
information and is exploited in our own work in presenting
spoken mathematics.