4.3 Rendering Mathematics

This section defines an audio notation for mathematics and presents rendering rules that produce this audio notation.

There is little similarity between developing a written notation and its audio counterpart. However, the evolution of written notation shows the following. Any notational system is a combination of conventions and an intuitive use of the various dimensions that are provided by the perceptual modality and the means available to produce output appropriate for that modality. We use this insight to develop a concise audio notation for spoken mathematics that exploits the available audio dimensions. It is conceivable that the number of audio dimensions will increase with the improvement in the relevant technology, enabling more sophisticated notational systems in the future.

We characterize all of written mathematical notation as follows:

The visual cues used to project the tree structure are independent of the cues used to produce the attributes. Hence, attributes may themselves contain arbitrarily complex tree structures. Thus, conventional mathematical notation uses a consistent set of visual layout primitives to construct complex displays.

Written notation provides the ability to render mathematical objects without understanding their meaning. The underlying structure can be recreated by a reader familiar with the subject matter at hand and the notational system in use. Internalizing and browsing this structure is helped by the use of different types of visual delimiters such as (, [, {, , ◟◝◜◞and ◜◞◟◝ —these help the author mark off “interesting” subtrees within an expression.

In contrast, plain spoken renderings of mathematical expressions are completely linear, thereby losing much of this expressive power. Spoken descriptions of complex mathematics (found on talking books) compensate for this loss of expressive power by using extra-textual phrases, thereby making the renderings verbose.

To overcome these problems, we develop an equivalent audio notation. The first step is to identify dimensions in the audio space to parallel the functionality of the dimensions in the visual setting. The second step is to augment these audio dimensions with the use of pauses, intonational cues such as voice inflection, and descriptive phrases.

AS TE R implements this notational system by using fleeting and persistent cues, especially by exploiting the computer’s ability to vary the characteristics of a synthetic voice. Renderings produced are therefore much more concise.

Our audio notation minimizes the verbiage in math renderings. Concise renderings serve to convey the concepts involved succinctly, leaving the listener time to reason about the expression. More descriptive renderings (with explanatory phrases to cue structure) can be used when listening to unfamiliar material. Thus, there is a wide range of possible renderings of a math expression varying between fully descriptive and completely notational. The choice of how much to rely on the audio notation, and how descriptive renderings should be, is entirely subjective.

Here are the features we require of our audio notation for mathematics:

Producing Audio Notation for Mathematics

We exploit the abstraction of the audio space to define unique audio dimensions that make up the various pieces of the notation. These dimensions can be thought of as lines3 determined by a combination of the speech and non-speech dimensions described in Chapter 3. The AFL states used to produce different pieces of the audio notation are reached by “moving” along these dimensions. The functions used to generate new states are monotonic in the mathematical sense described in equation 4.1 on page 105.

We choose unique audio dimensions to map the quasi-prefix form into audio space. The quasi-prefix representation is a tree with attributes. We pick one audio dimension, denoted by dim-children(see Figure 4.2 on page 112), along which to vary the current AFL state as different levels of a tree are rendered. We next choose dimensions orthogonal to dim-childrento cue the visual attributes as follows. Let x and y denote two speech-space dimensions that are orthogonal to dim-children. Select three lines in the speech space, x = 0, x + y = 0, and xy = 0. Moving forward or backward along these three lines cues the six visual attributes.

Conventional mathematical notation has built up a strong association between the superscript and subscript, in that we intuitively think of them as opposites, i.e., the superscript moves up, and the subscript moves down. AS TE R takes advantage of this association by moving the AFL state “forward” along the line x y = 0 before rendering superscripts and “backward” along this same line before rendering subscripts. States along the line x + y = 0 cue left superscripts and subscripts; states along x = 0 cue accents and underbars. By our choice of x and y, these variations are independent of dimension dim-children. See Figure 4.3 on page 115 and Figure 4.4 on page 118 for the audio dimensions that are currently used for cueing superscripting and subscripting.

(afl:multi-step-by state

   '(afl:smoothness 2) '(afl:richness -1)   ;softer

   '(afl:loudness 2) '(afl:quickness 1) ;animated

   '(afl:hat-rise 2) '(afl:stress-rise 2)) ;animated

Figure 4.2: Audio dimension used for rendering subtrees.

The effect of moving along the audio dimension shown in Figure 4.2 on page 112 is to produce a softer, more animated voice. As deeper levels of nesting are entered, the change in voice characteristic produces a sense of falling off into the distance.

(afl:generalized-afl-operator  state

   '(afl:step-by afl:average-pitch 1.5)

   '(afl:step-by afl:head-size -.5)

   '(afl:scale-by  afl:average-pitch .5 :slot afl:step-size)

   '(afl:scale-by afl:head-size .5 :slot afl:step-size))

Figure 4.3: Audio dimension used for rendering superscripts.

A change along the audio dimension shown in Figure 4.3 on page 115 produces a higher pitched voice. The change in the head size keeps the voice from sounding unpleasant. The step size along both the average-pitch and head-size dimensions are reduced. This allows unambiguous rendering of subscripts in superscripts. The change in AFL state in Figure 4.4 on page 118 is the exact opposite of the change in Figure 4.3 on page 115.

(afl:generalized-afl-operator state

   '(afl:step-by afl:average-pitch -1.5)

   '(afl:step-by afl:head-size .5)

   '(afl:scale-by  afl:average-pitch .5 :slot afl:step-size)

   '(afl:scale-by afl:head-size .5 :slot afl:step-size))

Figure 4.4: Audio dimension used for rendering subscripts.

In cases where no contextual information is available, the visual attributes appearing on a math object are rendered in the following order:

  1. Subscript.
  2. Superscript.
  3. Underbar.
  4. Accent.
  5. Left-subscript.
  6. Left-superscript.

The above ordering is motivated by the fact that in traditional mathematical notation, the subscript binds4 the tightest. The order in which attributes are rendered is encapsulated in Lisp variable *attributes-reading-order*and may be changed by a user.

In style simple, a commonly used rendering style, subscripts and superscripts are rendered by first moving either backwards or forwards along the audio dimensions shown in Figure 4.3 on page 115 and Figure 4.4 on page 118. This produces extremely concise and unambiguous renderings. Consider the following expressions:

 eex exx
e  e  e

x+ y 2k+1-

Here, a plain verbal rendering produces an unnecessarily complicated description that makes it difficult to comprehend the inherent structure present in the expression.

Here is an example to illustrate the benefits of an audio notation when rendering unusual mathematical notation. In the following, +n denotes addition modulo n. Given this information,

x+n y +n z
could be spoken as “x plus mod n y plus mod n z”. However, if this information is unavailable, AS TE R can still produce a rendering that can be correctly interpreted by a listener who is aware of the fact that the + sign can be subscripted. Further, the listener who is familiar with +n denoting modulo arithmetic can now understand the expression.

In style descriptive, new AFL states are used only if necessary when rendering superscripts and subscripts. Typically, “x 1” in traditional spoken math means x1. Rendering style descriptive takes advantage of this convention to avoid using new AFL states when rendering subscripts that are simple. Note, however, that by doing so, rendering style descriptive does introduce ambiguity in the renderings; xk1 and xk1 will sound the same. In our experience, we have found that this ambiguity is not a problem when rendering mathematical texts; few authors write xk2 in place of the preferred x2k.

Parenthesizing in audio.

The technique used by written mathematical notation to cue tree structure is insufficient for audio renderings. Using a wide array of delimiters to write mathematics works, since the eye is able to quickly traverse the written formulae and pair off matching delimiters. The situation is slightly different in audio; merely announcing the delimiters as they appear is not enough —when listening to a delimited expression, the listener has to remember the enclosing delimiters. This insight was gained as a result of work in summer 915, when we implemented a prototype audio formatter for mathematical expressions. Fleeting sound cues (with the pitch conveying nesting level) were used to “display” mathematical delimiters, but deeply nested expressions were difficult to understand.

AS TE R enables a listener to keep track of the nesting level by using a persistent speech cue, achieved by moving along dim-children, when rendering the contents of a delimited expression. This, in combination with fleeting cues for signalling the enclosing delimiters, permits a listener to better comprehend deeply nested expressions. This is because the “nesting level information” is implicitly cued by the currently active voice (a persistent cue ) used to render the parenthesized expression.

To give some intuition, we can think of different visual delimiters as introducing different “functional colors” at different subtrees of the expression. Using different AFL states to render the various subtrees introduces an equivalent “audio coloring”. The structure imposed on the audio space by the AFL operators enables us to pick “audio colors” that introduce relative changes. This notion of relative change is vital in effectively conveying nested structures.

Mathematical expressions are spoken as infix or prefix depending on the operator and the currently active rendering style. The large operators such as , in addition to the mathematical functions like sin, are rendered as prefix. All other expressions are rendered as infix. A persistent speech cue indicates the nesting level —the AFL state is varied along audio dimension dim-children before rendering the children of an operator. The number of new states is minimized —complexity of math objects and precedence of mathematical operators determine if a new state is to be used (see Section 4.4 for details on the complexity measure used). Thus, while new AFL states are used when rendering the numerator and denominator of ac++bd, no new AFL state is introduced when rendering ab+ c + d. Similarly, when rendering sinx, no new AFL state is used to speak x, but when rendering sin(x + y), a new AFL state is used to render the argument to sin.

In the context of rendering sub-expressions, introducing new AFL states can be thought of as parenthesizing in the visual context. In the light of this statement, the above assertion about minimizing AFL states can be interpreted as avoiding the use of unnecessary parentheses in the visual context. Thus, we write a + bc + d, rather than a + (bc) + d, but we use parentheses to write (a + b)(c + d). Analogously, it is not necessary to introduce a new state for speaking the fraction when rendering a + bc + d, whereas a new rendering state is introduced to speak the numerator and denominator of a+b

Dimension dim-children has been chosen to provide five to six unique points. This means that deeply nested structures such as continuous fractions are rendered unambiguously.

Consider the following example:

1+ -------x------
   1+ 1+----xx-----

Here, the voice drops by one step as each level of the continuous fraction is rendered. Since this effect is cumulative, the listener can perceive the deeply nested nature of the expression. The rendering rule for fractions is shown in Figure 4.5 on page 124. Notice that this rendering rule handles simple fractions differently. When rendering fractions of the form ab, no new AFL states are used. In addition, there is a subtle verbal cue; when rendering simple fractions, AS TE R speaks “over” instead of “divided by”. This distinction seems to make the renderings more effective, and in some of the informal tests we have carried out, listeners disambiguated between expressions using this distinction without even being aware of it.

(def-reading-rule (fraction simple)

    "simple rendering  rule for fractions "

  (let ((pause-amount  (compute-pause fraction ))

        (numerator (numerator-of fraction))

        (denominator (denominator-of fraction )))

    (read-aloud "fraction") (afl:comma-intonation)

    (afl:with-surrounding-pause pause-amount


        ((and (leaf-p numerator)        ;simple fraction

              (leaf-p denominator))     ; form a /b

         (read-aloud numerator) (read-aloud "over" )

         (read-aloud denominator))

        (t (afl:new-block

            (afl:local-set-state        ; move along

             (reading-state 'dim-children)) ;dim-children

            (read-aloud numerator))

           (read-aloud  " divided by,  ") ;  old state


            (afl:local-set-state        ; move along

             (reading-state 'dim-children)) ; dim-children

            (read-aloud denominator )))))))

where statement (reading-state ’dim-children) generates an AFL state along dimension dim-children, see Figure 4.2 on page 112.

Figure 4.5: Rendering rule for fractions.

Using pauses.

The audio dimensions are supplemented by using pauses around subexpressions to indicate grouping. The duration of the pause is based on the weight of a subexpression (See Section 4.4 for details on weight of an object, a complexity measure). If the weight of an object is 1, then no pause is inserted; otherwise the weight of the object is scaled by a constant factor given by *pause-around-child* to determine the number of milliseconds of pause to be inserted around the rendering.

Using the above, AS TE R speaks a + b
c + d unambiguously by inserting a pause around the fraction. No pause is inserted in rendering the simple expression a, when it occurs by itself. Inserting a pause here is unnecessary and would have an adverse stuttering effect on the speech.