(ITDV01N4 ARTICLE2)
c. 1994 T.V. Raman
 
AsTeR: AUDIO SYSTEM FOR TECHNICAL READINGS
 
T. V. Raman
Digital Equipment Corporation
One Kendall Square, Building 650
Cambridge MA 02139
Tel: 1 617 621 6637
Email raman@crl.dec.com
 
WWW
http://www.research.digital.com/CRL/personal/raman/raman.html  
 
ABSTRACT
 
The advent of electronic documents makes information available in
more than its visual form; electronic information can now be
display-independent.  In this article, the author describes a
computing system, AsTeR, that audio formats electronic documents
to produce audio documents. AsTeR can speak both literary texts
and highly technical documents (presently in La)TeX) that contain
complex mathematics.  Visual communication is characterized by
the eye's ability to actively access parts of a two-dimensional 
display.  The reader is active, while the display is passive. 
This active-passive role is reversed by the temporal nature of
oral communication:  information flows actively past a passive
listener.  This prohibits multiple views - it is impossible to
first obtain a high-level view and then "look" at  details. 
These shortcomings become severe when presenting complex
mathematics orally. 
 
Audio formatting, which renders information structure in a manner
attuned to an auditory display, overcomes these problems.  AsTeR
is interactive, and the ability to browse information structure
and obtain multiple views enables active listening.
 
This article describes a system for producing audio renderings.
Print is not the ideal medium for describing such renderings,
(and ASCII is an even poorer one!).  RFB members can acquire an
audio formatted version of the author's thesis, (this article is
a slightly edited version of the first chapter) rendered by
AsTeR, from Recording for the Blind (RFB order number FB190).
Non-RFB customers may request a two track (standard commercial
format) tape of AsTeR examples. Requests should be addressed to
info@RFB.org; ask for Raman's Math Examples Tape. 
 
Finally, readers with access to the WWW can experience an
interactive demo of AsTeR at 
                          
http://www.cs.cornell.edu/Info/People/raman/aster/aster-toplevel.
html
                           or 
                           
http://www.research.digital.com/CRL/personal/raman/aster/aster-to
plevel.html
 
1. MOTIVATION
 
Documents encapsulate structured information.  Visual formatting
renders this structure on a two-dimensional display (paper or a
video screen) using accepted conventions.  The visual layout
helps the reader recreate, internalize and browse the underlying
structure.  The ability to selectively access portions of the
display, combined with the layout, enables multiple views. For
example, a reader can first skim a document to obtain a
high-level view and then read portions of it in detail. 
 
The rendering is attuned to the visual mode of communication,
which is characterized by the spatial nature of the display and
the eye's ability to actively access parts of this display.  The
reader is active, while the rendering itself is passive.
 
This active-passive role is reversed in oral communication:
information flows actively past a passive listener. This is
particularly evident in traditional forms of reproducing audio,
e.g., cassette tapes. Here, a listener can only browse the audio
with respect to the underlying time-line -- by rewinding or
forwarding the tape.  The passive nature of listening prohibits
multiple views -- it is impossible to first obtain a high-level
view and then "look" at portions of the information in detail. 
 
Traditionally, documents have been made available in audio by
trained readers speaking the contents onto a cassette tape to
produce  "talking books."  Being non-interactive, these do not
permit browsing. They do have the advantage that the reader can
interpret the information and convey a particular view of the
structure to the listener.  However, the listener is restricted
to the single view present on the tape.  In the early 1980's,
text-to-speech technology was combined with OCR (Optical
Character Recognition) to produce "reading machines."  In
addition to being non-interactive, renderings produced from
scanning visually formatted text convey very little structure. 
Thus, the true audio document was non-existent when we started
our work.
 
We overcome these problems of oral communication by developing
the notion of audio formatting-and a computing system that
implements it. Audio formatting renders information structure
orally, using speech augmented by non-speech sound cues.  The
renderings produced by this process are attuned to an auditory
display audio layout present in the output conveys information
structure.  Multiple audio views are enabled by making the
renderings interactive.  A listener can change how specific
information structures are rendered and browse them selectively.
Thus, the listener becomes an active participant in oral
communication.
 
In the past, information was available only in a visual form, and
it required a human to recreate its inherent structure. 
Electronic information has opened a new world: information can
now be captured in a display-independent manner -- using, e.g., 
tools like SGML and LaTeX (1). Though the principal mode of
display is still visual, we can now produce alternative
renderings, such as oral and tactile displays.  We take advantage
of this to audio-format information structure present in LaTeX
documents.  The resulting audio documents achieve effective oral
communication of structured information from a wide range of
sources, including literary texts and highly technical documents
containing complex mathematics.
 
The results of this thesis are equally applicable to producing
audio renderings of structured information from such diverse
sources as information databases and electronic libraries.  Audio
formatting clients can be developed to allow seamless access to a
variety of electronic information, available on both local and
remote servers. Thus, the server provides the information, and
various clients, such as visual or audio formatters, provide
appropriate views of the information.  Our work is therefore
significant in the area of developing adaptive computer
technologies. 
 
Today's computer interfaces are like the silent movies of the
past! As speech becomes a more integral part of human-computer
interaction, our work will become more relevant in the general
area of user-interface design, by adding audio as a new dimension
to computer interfaces.   
 
2. WHAT IS AsTeR?
  
AsTeR (2) is a computing system for producing audio renderings of
electronic documents. The present implementation works with
documents written in the TeX family of markup (3) languages,
i.e., TeX, LaTeX and AMSTeX. But the design of AsTeR is not
restricted  to any single markup language. Though motivated by
the need to render technical documents, our system works equally
well on  structured documents from the non-technical subjects.
 
AsTeR is founded on the belief that all information is
display-independent. Information has structure, and this
structure is rendered on paper or on a visual display, but the
information itself is not restricted to these output modes. Thus,
AsTeR renders this same information in audio. 
AsTeR recognizes the logical structure of a document as embodied
in the markup source and represents this structure internally. 
The internal representation is then rendered in audio by applying
a collection of rendering rules written in AFL, a language for
audio formatting. Think of AFL as a high-level audio analogue to
a visual rendering language like Postscript. Rendering an 
internalized high-level representation enables AsTeR to produce
different audio views of the information.  A user can either
listen to entire documents, or browse the internal structure and
selectively read portions of a document. The rendering and
browsing components of AsTeR can work equally well with
high-level representations we may get from sources such as
OCR-based document recognition.
 
This article gives a high-level view of how the various
components of AsTeR are used. AsTeR is implemented in CLOS (4)
with an Emacs front-end. The recommended way of using the system
is to run Lisp as a subprocess of Emacs. Throughout this chapter,
we will assume familiarity with basic Emacs concepts. Section 3
introduces the system by showing how simple documents can be read
and browsed. Section 4 explains how AsTeR can be extended to read
newly defined document structures in La)TeX (5). Section 5 gives
some examples of changing between different ways of rendering the
same information. Section 6 presents some advanced techniques
that can be used to advantage when reading complex documents such
as text books. AsTeR can render information produced by various
sources.  We give an example of this by demonstrating how AsTeR
can be used to interact with the Emacs calculator, a full-fledged
symbolic algebra system.   
 
3. READING DOCUMENTS
  
This section assumes that AsTeR has been installed and
initialized. At this point, text within any file being visited in
Emacs (in general, text in any Emacs buffer), can be rendered in
audio. To listen to a piece of text, mark it using standard Emacs
commands and invoke read-aloud-region (6). This results in the
marked text being audio formatted using a standard rendering
style.  The text can constitute an entire document or book; it
could also be a short paragraph or a single equation from a
document.  AsTeR renders both partial and complete documents.
 
This is the simplest and also the most common type of interaction
with AsTeR. All markup commands appearing in the text are
recognized to produce audio renderings that reflect the structure
represented by the markup. The input may be plain ASCII text; in
this case, AsTeR will still recognize the minimal document
structure present, i.e., paragraph breaks, quoted text etc. 
La)TeX markup helps the system recognize more of the document
logical structure, and as a consequence produce more
sophisticated renderings.
  
  
3.1 BROWSING THE DOCUMENT
    
Next to getting the system to speak, the most important thing is
to get it to stop speaking. Once an audio rendering has been
launched, rendering can be interrupted at any time by executing
reader-quit-reading (7) The listener can then traverse the
internal structure by moving the current selection, which
represents the current position in the document, by executing any
of the browser commands reader-move-previous, reader-move-next,
reader-move-up or reader-move-down.
  
To orient the user within the document structure, the current
selection is summarized by verbalizing a short message of the
form "<context > is <type >", e.g., moving down one level from
the top of the equation
  
        ABC = 0                                                   
         (1)
  
produces the message "left hand side is a product ". The user has
the option of either listening to just the current selection, or
reading the rest of the document. In the interest of brevity, we
will not give all of the browser key-bindings.
  
3.2    EXAMPLES OF USE
  
AsTeR can be used:
 
- To read technical articles and books: The files for such 
documents may be available on the local system or on the global
Internet (8). Resources retrieved over the network can be audio
formatted by AsTeR since they are just text in Emacs buffers.
Currently, the system audio formats 10 text books available to
the author on his local system. In addition, AsTeR also renders a
wide collection of technical documents available on the Internet
including technical reports and AMS bulletins.
 
- For entertainment: At present about 200 electronic texts are
available on the Internet, in addition to the complete works of
Shakespeare. The majority of these documents are in plain ASCII,
but the quality of audio renderings produced by AsTeR based on
the minimal document structure that can be recognized still
surpasses conventional reading machines. Increased availability
of electronic texts marked up in La)TeX, SGML and HTML will
enable better recognition of document structure, and as a
consequence, better audio renderings.
 
- In proof-reading: This feature is especially useful when
typesetting complex mathematical formulae. AsTeR can render both
partial and complete documents. Thus, although designed as a
system for reading documents, the flexible design, combined with
the power afforded by the Emacs editor, turns AsTeR into a very
useful document preparation aid. 
  
 
4. EXTENDING ASTER
  
As explained in the previous section, the quality of audio
renderings produced by AsTeR is dependent on how much of the
document logical structure is recognized. Authors of La)TeX
documents often use their own macros (9) to encapsulate specific
structures. AsTeR of course does not know of these extensions to
start with. Occurrences of user-defined La)TeX macros are
initially rendered in a canonical way; typically, the
user-defined macros are read aloud as they appear in the running
text. 
 
Thus, given a document containing
  
        $A \kronecker B$
  
AsTeR would produce
  
        cap a kronecker cap b
  
In this case, this canonical rendering is quite acceptable.    
In general, how AsTeR renders such user-defined structures is
fully customizable. The first step is to extend the recognizer to
handle the new construct, in this case \kronecker. Here, we give
the reader a brief example of how this mechanism is used in
practice.
  
The recognizer is extended by calling Lisp macro
define-text-object. In the case of the \kronecker macro, this
call takes the form:
 
(define-text-object :macro-name "kronecker" :number-         
args 0 :processing-function kronecker-expand :object- 
name kronecker :supers (binary-operator) :precedence   
multiplication)
 
 
This extends the recognizer to represent instances of macro 
"kronecker" as instances of object kronecker-product. The user
can now define any number of ways in which an instance of object 
kronecker-product should be rendered.
 
 
AFL, our language for audio formatting, is used to define
rendering rules. Here, we give a rendering rule for object
kronecker-product. 
  
(def-reading-rule (kronecker-product simple)
"Simple rendering rule for object kronecker-product."     
(read-aloud (first (children kronecker-product)))           
(read-aloud "kronecker product")
(read-aloud (second (children kronecker-product)))) 
  
 which produces
 
cap a kronecker product cap b
  
 for the input text shown earlier.
 
Notice, however, that the rendering rule is free to render the
use of the kronecker product in more complex ways; in particular,
the order in which the expression is spoken can be completely
independent of how it appears on paper.  Thus, it is
straightforward to write a rendering rule that produces 
 
 "The kronecker product of A and B "
 
AsTeR derives its power from representing document content 
internally as objects and by allowing several user-defined 
rendering rules for individual object types. Such rendering rules
can cause any number of audio events, ranging from speaking a
simple phrase to playing a digitized sound, when an instance of a
particular object type is rendered. The mechanism for extending
the recognizer affords this same power when rendering user-
defined constructs. Once the recognizer has been extended by an
appropriate call to define-text-object, such constructs can be
handled just as well as any standard La)TeX construct.
 
 
5. PRODUCING DIFFERENT RENDERINGS OF THE SAME OBJECT
 
AsTeR can produce more than one kind of rendering for a given
object. When perusing printed information, a reader has the
luxury of viewing a complex piece of mathematics from different
perspectives, and AsTeR provides this same functionality. The
listener can switch between any of several pre-defined renderings
for a given object, or add to these by defining new rendering
rules. Switching between different rendering rules produces
different audio views of a given object.
  
Activating a rendering rule is the simplest way of changing how a
given object is rendered. Statement
 
        (activate-rule <object-name > <rule-name >)
 
activates rule <rule-name > for object <object-name >. Thus, 
executing (activate-rule 'paragraph 'summarize) results in
paragraphs being summarized.
 
Suppose we wish to skip all instances of verbatim text in a LaTeX
document. We could define the following quiet rendering rule:    
            (def-reading-rule (verbatim quiet) nil)
 
and activate it by executing
 
            (activate-rule 'verbatim 'quiet)
 
To later hear the verbatim text in a document, rule quiet is 
deactivated by executing
 
            (deactivate-rule 'verbatim)
 
Notice that at any given time, only one rendering rule is active 
for any object. Hence, we only need specify the object when
deactivating a rendering rule. AsTeR provides an Emacs interface 
to activating and deactivating rendering rules.
 
Activating a single rendering rule is a convenient way of
changing how a specific object is rendered. Rendering styles
allow making more global changes to the renderings. Activating
style style-1 by executing 
  
       (activate-style 'style-1)
 
makes the rendering rule named style-1 active for all objects for
which this rendering rule is defined. All other objects continue
to be rendered as before. This is also true when a sequence of
rendering styles is successively activated.
  
Thus, activating rendering styles is a convenient way of
progressively customizing the rendering of a complex document.
The effect of activating a style can be undone at any time by
executing
 
       (deactivate-style <style-name >)
 
 AsTeR provides the following rendering styles:
  
- Variable-substitution: Use variable substitution when rendering
complex mathematical expressions.
 
- Use-special-pattern: Recognize special patterns in mathematical
expressions to produce context-specific renderings.
  
- Descriptive: Produce descriptive, context-specific renderings
for mathematical expressions.
  
- Simple: Produce a base-level audio notation for mathematical
expressions.
 
- Default: Produce default renderings.
 
- Summarize: Provide a short summary.
 
- Quiet: Skip objects.
  
When AsTeR is initialized, the following styles are active:    
 
       (use-special-pattern descriptive simple default) 
  
with the leftmost style the most recently activated style.    
Defining a new rendering style amounts to defining a collection
of rendering rules having the same name. Note that a rendering
style need not provide rendering rules for all objects in the
document logical structure. As explained earlier, activating a
rendering style only affects the renderings of those objects for
which the style provides a rule.
  
6.  USING THE FULL POWER OF AsTeR
 
This section demonstrates some advanced features of AsTeR that
are useful when rendering complex documents. AsTeR recognizes
cross-references and allows the listener to traverse these as
hypertext links. Cross-referenceable objects can be labelled
interactively and these labels used when referring to such
objects within renderings. The ability to switch between
rendering rules allows the listener to quickly locate portions of
interest in a document. By activating rendering rules, all
instances of a particular object can be floated to the end of the
containing hierarchical unit, or entirely skipped. This is
convenient when getting a quick overview of a document. AsTeR
also provides a simple bookmark facility for marking positions of
interest to be returned to later. Finally, AsTeR can be
interfaced with sources of structured information other than
electronic documents. We demonstrate this by interfacing AsTeR to
the Emacs calculator. 
  
 
6.1  Cross-References
 
Cross-reference tags occurring in the body of a document are
represented internally as instances of object cross-reference and
contain a link to the object being referenced. How such cross-
reference tags are rendered of course depends on the currently
active rule for object cross-reference . The default rendering
rule for cross-references presents the user with a summary of the
object being cross-referenced, e.g., the number and title of a
sectional unit. This is followed by a non-speech audio prompt.
Pressing a key at this prompt results in the entire
cross-referenced object being rendered at this point. Reading
continues if no key is pressed within a certain time interval. In
addition, the listener can interrupt the rendering and move
through the cross-reference tags. This is useful in cases where
many such tags occur within the same sentence.
 
 6.2  Labelling a cross-referenceable object
 
 Consider a proof that reads:
 
        By theorem 2.1 and lemma 3.5 we get equation 8 and        
 hence the result.
 
If the above looks abstruse in print, it sounds meaningless in
audio. This is in fact a serious drawback when listening to
mathematical books on cassette where it is practically impossible
to locate the cross-reference. AsTeR is more effective since
these cross-reference links can be traversed; but traversing each
link while listening to the above proof can be distracting.  
Typically, we only glance back at the cross-references to get
sufficient information about what theorem 2.1 is about. AsTeR
provides a convenient mechanism for building in such information
into the renderings. When a cross-referenceable object such as an
equation is rendered, the system verbalizes an automatically
generated label, i.e., the equation number, and then generates an
audible prompt. If the user presses a key at this prompt, he can
specify a more meaningful label which will be used in preference
to the system-generated label when rendering cross-reference
tags. 
 
To continue the current example, when listening to theorem 2.1,
the user could have specified the label "Fermat's theorem". Then
the proof shown earlier would be read as:
 
      By Fermat's theorem and lemma3 .5 we get equation 8       
and hence the result.
 
Of course, the user could have specified labels for the other
cross-referenced objects as well, in which case the rendering
produced almost obviates the need to look back at the cross-
referenced objects.
 
6.3  Locating portions of interest
  
Printed books allow the reader to skim through the text and
quickly locate portions of interest. Experienced readers use
several different techniques to achieve this. One of these is to
locate an equation or table of interest, and then read the text
surrounding this object. AsTeR provides this functionality to
some extent.
  
We explained in Section 4 that different rules can be activated
to change the type of renderings produced. Using this mechanism,
we can activate a rendering rule that only reads the equations
occurring in a document. Once an equation of interest is located,
rendering can be interrupted and the rendering rule changed.
Using the browser, the listener can now move the current
selection to the enclosing hierarchical unit and then read the
surrounding text.
 
6.4  Getting an overview of a document
  
Rendering rules can be activated to obtain different views of a
document. For instance, activating rendering rule quiet for
object paragraph provides a thumb-nail view of a document.
Activating rendering rule quiet is a convenient way of
temporarily skipping over all occurrences of a specific object.
We often do this when perusing printed documents; we skip over
complex material at the first reading and return to these later. 
We may skip instances of some objects entirely e.g., source code;
in other cases we may merely defer the reading. This notion of
delaying the reading  of an object is aptly captured by the
concept of floating an object to the end of the enclosing unit.
Typesetting systems like La)TeX permit the author to float all
figures and tables to the end of the containing section or
chapter. However, only specific objects can be floated, and this
is exclusively under the control of the author, not the consumer
of the document.
 
AsTeR provides a much more general framework for floating
objects. Any object can be floated to the end of any enclosing
hierarchical unit, e.g., instances of object footnote can be
floated to the end of the containing paragraph. The ability to
float objects is very useful when producing audio renderings.
This is because audio takes time, and it is advantageous to delay
the rendering of some objects when obtaining an overview. Printed
documents use footnotes and floating figures for precisely this
reason. The interactive nature of AsTeR allows us to extend this
functionality.
 
6.5   Bookmarks
 
The browser provides a simple bookmark facility for marking
positions of interest to be returned to later. Browser command
mark-read-pointer bound to C-b m prompts for a bookmark name and
marks the current selection. The listener can later read the
object at this marked position, or move the current selection to
the marked position by executing browser command follow-bookmark
and specifying the appropriate bookmark name.
 
6.6    Reading using variable substitution
 
When reading complex mathematics in print, we often get a high-
level view of an equation first, and read the leaves of an
expression once we have understood the top-level structure. Thus,
when presented with a complex equation, an experienced reader of
mathematics might view it as an equation with a double summation
on the left-hand-side and a double integral on the right-hand- 
side, and only then attempt to read the equation in full detail.
In an audio rendering that simply produces a linear rendering,
the temporal nature of audio prevents a listener from getting
such high-level views. We compensate by providing a variable
substitution rendering style. When active, this results in AsTeR
replacing sub-expressions in complex mathematics with meaningful
phrases. Having thus provided a top-level view, AsTeR then reads
the sub-expressions that were substituted for earlier upon
request.
 
6.7    Interfacing AsTeR with other information sources
  
AsTeR has been presented as a system for reading documents. More
generally, AsTeR is a system for presenting structured
information in audio. This fact is amply demonstrated by the
following example where we interface AsTeR to the Emacs
calculator, a full-fledged symbolic algebra system.
  
The Emacs calculator is a public domain symbolic algebra system
available under the terms of the GNU license. It provided an
excellent source of examples for trying out the variable
substitution rendering style for mathematical expressions. 
Providing an audio interface to a symbolic algebra system is 
challenging since the expressions produced are quite complex. The
flexible design of AsTeR and the power of Emacs makes this
interface easy. AsTeR can render any information present in an
Emacs buffer. The output of the Emacs calculator satisfies this
requirement. A collection of Emacs Lisp functions arranges for
the output from the calculator to be sent to AsTeR.
 
A user of the Emacs calculator can now perform a computation and
execute command read-previous-calc-answer to have the output
rendered by AsTeR. The expression can be browsed, summarized,
transformed by applying variable substitution, and the rendering
manipulated in any of the ways described so far in the context of
documents.
    
NOTES
 
(1) Standard Generalized Markup Language (SGML) captures
information in a layout independent form; LaTeX, designed by
Leslie Lamport, is a document preparation  system based on the
TeX typesetting system developed by Donald Knuth.
  
(2)  In real life, AsTeR is the name of the author's guide-dog, a
big friendly black Labrador.
 
(3) To most people, "markup" means an increase in the price of an
article.  Here, "markup" is a term from the publishing and
printing business, where it means the instructions for the
typesetter, written on a typescript or manuscript copy by an
editor. Typesetting systems like LaTeX  have these commands
embedded in the electronic source.  A markup language is a set of
means (constructs) to express how text (i.e., that which is not
markup) should be processed, or handled in other ways.
  
(4) clos (Common Lisp Object System) is an object oriented
extension of Common Lisp.
 
(5) In this article, the notation La)TeX represents the entire
"family" of markup languages including TeX, LaTeX, and AMSTex. 
 
(6) This is an Emacs Lisp command, and in the author's setup, it
is bound to C-z d.
 
(7) reader-quit-reading Bound to C-b q.
 
(8) ANGE-FTP, an Emacs utility written by Andy Norman,  allows
seamless access to such files.  In addition, Emacs clients are
available for networked information retrieval systems like
GOPHER, WWW and WAIS.  
 
(9) Macros permit an author to define new language constructs in
TeX and specify how these constructs should be rendered on paper.