3.2 The Speech Component

We present AFL in the simple context of an audio formatter having a single component —a speech synthesizer. The principal purpose of the speech component is to produce speech —it provides statement

This statement sends <text> to the speech device. In addition, the speech component provides primitives for producing the right intonational structure. The speech generation statements are summarized in Table 3.2 on page 74.

The speaking voice can be varied with respect to several speech synthesis parameters. We define the speech space as a multi-dimensional space, where each dimension corresponds to a synthesizer parameter. At any given time, the speech state, a point in this space, determines the kind of voice used when speech generation statements are executed. Changing a voice parameter amounts to moving in this space. The abstraction of a speech space imposes structure on the set of discrete states provided by the voice synthesizer. This structure will be used to advantage in producing renderings of nested information structures. The abstraction of a speech space also keeps the speech component of AFL hardware independent —synthesizers vary widely in both the kind of parameters provided as well as how these are modified.

In any AFL program segment, global variable *global-speech-state* and local variable *current-speech-state*(modifiable by the program segment) refer to one of these speech states. When a session of interactive rendering is begun, these variables are set to the initial state of the audio formatter.

The AFL Block

introduces a local instance of variable *current-speech-state*. This instance is set to the instance of *current-speech-state* that was referenceable just before execution of the block, and <statements> are executed within this new local scope. Within the block, all free occurrences of *current-speech-state* refer to the new local variable. Further, this local variable describes the state of the audio formatter, so changes to it immediately affect the voice synthesizer. Upon termination of the block, local variable *current-speech-state* is destroyed, and the audio formatter is reset to its pre-existing state. The programmer has no control over the name of this local state variable and cannot create other local variables using the AFL block.

Execution of statement (terminate-block) causes the currently-executing block to terminate immediately. A browser can execute this statement when the audio rendering of an object is to be terminated prematurely because of an interrupt from the user.

AFL blocks are simpler than the standard block construct provided by full-blown programming languages. For the purpose of audio formatting, where the major task is to control the parameters of the speech space, our experience has shown that the AFL block is more than adequate. Further, our simplified version of the block prevents rendering rules from making changes to the state of the audio formatter that could persist after termination of a block. Such changes, which would be possible with the conventional block, would also complicate the implementation, which has to maintain the connection between the state of local variable *current-speech-state* and the voice synthesizer itself.

Dimensions for the MultiVoice

In the previous subsection, we mentioned that the speech space has several dimensions, which might depend on the particular voice synthesizer being used. In order to make our presentation more concrete, we now describe the dimensions that can be used with the MultiVoice synthesizer.

The MultiVoice provides nine predefined voices, which are modeled as distinguished points (constants) in the speech space. We list below the MultiVoice names for the voices, together with the name used within AFL for them.


Perfect Paul	’afl:paul	Huge Harry	’afl:harry

Frail Frank	’afl:frank	Doctor Dennis	’afl:dennis

Beautiful Betty	’afl:betty	Rough Rita	’afl:rita

Uppity Ursula	’afl:ursula	Whispering Wendy	’afl:wendy

Kit the Kid	’afl:kid

A user can save a particular speech state in a variable and refer to it later. For example, execution of the statement¹

retrieves the value saved in variable <name>. Male and female voices are to be thought of as lying in distinct disconnected components of the speech space, since it is not possible to move from a male voice to a female voice simply by changing parameters that affect voice quality². Switching from a male to a female voice is thus analogous to changing fonts, while modifying voice quality parameters is like scaling different features of a specific font.

The MultiVoice parameters (and their default values) that are implemented as dimensions in AFL are shown in Table 3.1 on page 68. They are: the speech rate, the volumes of the speaker and the earphone port, five voice-quality parameters, and seven parameters that deal with pitch and intonation. The column labeled “step size” will be discussed later on.

Table 3.1:

Implemented MultiVoice parameters


Dimension	Min	Max	Initial	step size	Units


afl:speech-rate	120	550	180	25	Words/Min


Volume

afl:left-volume	0	100	50	5	dB

afl:right-volume	0	100	50	5	dB


Voice quality

afl:breathiness	0	100	0	10	DB

afl:lax-breathiness	0	100	0	25	%

afl:smoothness	0	100	3	20	%

afl:richness	0	100	70	10	%

afl:laryngilization	0	100	0	10	%


Pitch and Intonation

afl:baseline-fall	0	40	18	10	Hz

afl:hat-rise	2	100	18	10	Hz

afl:stress-rise	1	100	32	20	Hz

afl:assertiveness	0	100	100	25	%

afl:quickness	0	100	0	10	%

afl:average-pitch	50	350	122	10	Hz

afl:pitch-range	0	100	100	10	%

We assume that these dimensions are available and can be changed by the statements that are defined in the next subsection.

AFL Statements

We now describe five other AFL statements used to change speech-space variables. In the descriptions, <point> denotes any expression that evaluates to a point in the speech space, <name>is a variable that may contain a point in the speech space, and <dimension> is the name of a dimension in speech space.

Statement initialize-speech-space must be executed before any operations are performed on speech-space variables. It assigns default initial values to AFL variables *current-speech-state*and *global-speech-state*.

assign <point> to *current-speech-state* and *global-speech-state* respectively. Assignment statement local-set-state synchronizes implicitly with events on the speech component, i.e., execution of the assignment waits until all prior speech events have completed. This synchronization is necessary, since in general the host computer controlling the audio formatter executes instructions much faster than the speech synthesizer.

The two statements given above are conventional assignment statements, but they are only used to change the two AFL state variables.

Languages like TEX and PostScript provide for the application of a global scaling to a rendering. The speech space provides similar functionality. The speech component uses a final filter with the scale factor for each dimension initially set to unity, and execution of

changes the final scale factor for dimension <dimension> to <value>. As an example of its use, interrupting an audio rendering and executing

and then resuming causes speech to be heard twice as fast. Since the final scaling is applied to the result of applying user-defined audio-rendering rules, the relative changes in state effected by rendering rules are preserved.

AFL Expressions

We now define AFL expressions that yield a new speech state —these have no side-effects. The simplest expressions are the names of the nine predefined voices (e.g., afl:paul) and the names of variables to which states have been assigned. Using these expressions to move in the speech space makes the space a collection of discrete points. In addition, AFL provides four operators for generating new points in speech space. Each of these “move operators” expresses a change along a single dimension of the state space. While one move operator would have been sufficient, having multiple operators makes AFL easier to use. These operators allow us to express relative changes to the speech state. To give some intuition, they provide the same ability as scaling a font in the visual setting.

yields a state that is the same as <point> except that <offset> has been added to dimension <dimension>. For example, the following statement adds 50% to the assertiveness of ’afl:paul.

yields a new state, <point> with the value of dimension <dimension> multiplied by <factor>.

yields state <point> with the value for dimension <dimension> changed by <steps> steps. Each dimension has a default step size, which specifies the minimum change needed to be perceptible. The step sizes for the MultiVoice parameters are shown in the “step size” column of Table 3.1 on page 68. Using step-by, one can have the value of a dimension changed by a multiple of the step size.

The step-size for a particular dimension can also be changed by supplying the additional keyword afl:step-size to any of the AFL operators. For example, while the expression

yields a new state with the step size for afl:average-pitch changed to 2. Note that this expression makes use of named parameters in Common Lisp.

The four move operators are shown in their simple form. In general, these operators take a point and a list of dimension-value pairs specifying how to move.

Summary of the Speech Component

Table 3.2 on page 74 summarizes the speech generation statements provided by the speech component.

Table 3.2:

AFL statements for generating speech events.


Statement	Description


(send-text <text>)	Send text.

(speak-number <number>)	Speak number.

(force-speech)	Force speech.

(pause <msec>)	Insert silence.

(subclause-boundary)	Clause boundary.

(comma-intonation)	Comma intonation.

(exclamation)	exclamation intonation.

(interrogative)	Interrogative intonation.

(high-intonation)	Rising intonation.

(low-intonation)	Falling intonation.

(high-low-intonation)	Rise and fall.

(primary-stress)	Primary stress.

(secondary-stress)	Secondary stress.

(exclamatory-stress)	Exclamatory stress.

The intonational markers we use correspond to the description in [Pie81]. Note however, that MultiVoice does not provide all of the intonational markers described therein —see Chapter 5 of the MultiVoice reference manual [Tec91] for a detailed discussion on intonational markers on MultiVoice.

To conclude this section, here is a summary of the rest of the statements provided by the speech component. Many of these statements will be extended to work in other component spaces in the next section.

Using Common Lisp

Thus far, we have described the parts of AFL that deal directly with manipulating the speech state. Remember, however, that AFL is implemented on top of Common Lisp, so all Common Lisp statements may be used when writing AFL program segments. Common Lisphas conditional statements, loops, recursion, etc., so the full power of a conventional language is available. Naturally, were the audio-rendering system to be implemented on another programming platform, AFL would have to be fleshed out to include general programming-language statements.