6.2 Summary of Work in Audio Interfaces

This section presents a brief executive summary of the various research projects that are related to incorporating audio as an additional dimension to computer interfaces.

Speech Synthesis

There are three approaches to producing digitized speech:

  1. Concatenative: Concatenate digitized utterances produced by a human to make up canned messages.
  2. Diphone: Use a library of diphones obtained by sampling a large number of utterances spoken by a human.
  3. Formant: Model the human vocal tract by using a series of cascading filters to produce the right wave forms and hence intelligible speech.

Approach 1 is space intensive. It works in a limited number of cases, but it has the advantage of producing the most natural sounding speech in a restricted domain.

Approach 2 is more widely applicable and provides an unlimited vocabulary. It is memory intensive, since the diphones ( numbering about 3,000 for English) need to be accessed frequently. The approach is not compute intensive. Quality varies widely from barely intelligible to human-intelligible. This approach has been commercially applied by Apple in the form of MacinTalk-1 and MacinTalk-2. The MacinTalk-2, also known as GalaTea, is fairly memory intensive, but the quality is among the best that has been achieved with this method of synthesis. The principal drawback with diphone synthesis is that the underlying model is fairly restrictive. Though systems like GalaTea achieve a fair amount of intonation, the intonational structure generated still leaves much to be desired. The model also allows only minimal variations in voice, e.g., pitch and speech rate. Changing voice parameters produces a significant deterioration in output.

Approach 3, which models the human vocal tract, is compute but not memory intensive. It is also the most flexible approach to speech synthesis. Since it is based on a mathematical model of the human vocal-tract, it permits a large number of variations in voice quality (see [Kla87Her89Her90Her91] for details). What is more, it allows us to perform the same kind of scaling etc. on the voice that we perform in the visual setting when working with fonts. This is particularly useful in conveying complex information and is exploited in our own work in presenting spoken mathematics.

Audio as a Data-Type

The practical problem of how audio data should be managed has been addressed in order to deal with the following issues:

The work done at DEC CRL on the AudioFile [LPT+93] project is particularly significant in using audio resources effectively. AudioFile, using the same conceptual model as the X-windows system, provides a client/server model for accessing audio devices on a variety of platforms. Several applications such as answering machines can be very easily built on top of AudioFile, which is publicly available from FTP://crl.dec.com/pub/DEC/AF. The speech skimmer project at the MIT Media Labs allows a listener to interactively skim recorded speech and listen to it at several levels of detail. See [Aro93bAro92bAro92aSASH93Aro91bAro93aSA89ABLS89ASea88Aro91aAro92c] for work carried out in the Speech Group at the MIT Media Labs on manipulating digitized speech and audio.

CSOUND, a music synthesis system developed at MIT by Barry Vercoe, can be used for real-time synthesis of complex audio. Researchers at NASA Ames have developed the convolvotron [WF90WWK91], a system for real-time synthesis of directional audio. The convolvotron is computationally intensive, but the power available on today’s desktop has seen the development of scaled-down versions of this technology in the form of QSOUND for the Apple and Intel-486 platforms.

Non-Speech Audio in User Interfaces

Non-speech audio can be used in innovative ways to augment conventional output devices such as a visual display. Today, most desktop computers can produce at least telephone-quality audio. Non-speech audio has been used for a long time on the Apple platform to provide the user with audio cues for specific events. This work has been formalized by the human-computer interaction community by introducing the notion of earcons [SMG90JSBG86BCK+93BGP93Ram89RK92BG93Gav93BGB88Bux89]. A screen access program (prototype) for Presentation Manager under OS2 demonstrated the effective use of such non-speech cues in providing the user with spatial information —see [F.92] for details. A similar approach is being used at Georgia Tech in developing Mercator3 [ME92], a screen access program for the X-windows system. The use of non-speech audio to display complex data sets has been investigated by the scientific visualization community, where audio provides an extra dimension (see [BGK92BLJ86Ram89RK92Bro92Bro91SB92] for several related examples).

Information Presentation in Audio

Work in speech synthesis and linguistics has considered the problem of presenting information using speech. The question of achieving the right intonational structure is addressed by [Gro86DH88Pie81HP86HW84HLPW87PH90HW91Hir91Hir90aHir90bWH91]. See [LOS76Str78OKDA73ZP86] for an analysis of the intonational cues used by human speakers when speaking mathematical expressions. A set of guidelines for presenting spoken mathematics is outlined in [Cha83] and has been used by the Recordings for the Blind (RFB) in producing mathematical texts in talking book format.

Presenting information orally can be applied in several situations. [Dav89] describes a system for providing oral instructions to an automobile driver. Refer to [DT87Dav88DS89DS90] for related work on this project.

Information Browsing

With the advent of remote access systems to information databases, the need for effective browsing techniques has received attention in several research projects. Notable among these is Paul Resnick’s PhD work [Res92] at MIT. His thesis proposes a flexible model for quick and effective information browsing by modeling the information structure as a series of linked lists.

Structure-based browsing has been in vogue in the hypertext community for several years. Many results from this area are directly relevant to information browsing in audio. Notable among these is the work in defining Hypertext Markup Language (HTML), a hypertext analogue to SGML. The World-Wide Web (WWW), an HTML-based hypertext information retrieval system is widely used on the Internet. WWW browsers allow a user to quickly access a wide variety of information sources. The Webb currently contains textual as well as audio and video resources. At present, only primitive browsing of audio/video data is possible, since there is very little structure available in digitized audio/video data.

Browsing Digitized Audio/Video Data

Relatively little research has been carried out in this area so far. The potential presented by the availability of a large volume of digitized audio/video data was first outlined in [LB87]. The need to browse such data efficiently will prove essential if we are to survive the age of the 1,000 television channels!

Approaches used so far are characterized by the use of word spotting to identify context from the digitized audio data and using such contextual information to access portions of interest from the audio/video data stream. On the global Internet, the availability of Internet Talk Radio and other sources of large audio data provide an excellent research test-bed.