T. V. Raman
Cambridge Research Lab
Digital Equipment Corp.
Bldg 650, One Kendall Square
Cambridge MA 02139
E-mail: hraman@crl.dec.comi
Voice-mail: 1 (617) 692-7637
Fax: 1 (617) 692-6650
Abstract
Screen-readers ---computer software that enables a visu-
ally impaired user to read the contents of a visual
display--- have been available for more than a decade.
Screen-readers are separate from the user application.
Consequently, they have little or no contextual inform-
ation about the contents of the display. The author has
used traditional screen-reading applications for the last
five years. The design of the speech-enabling approach
described here has been implemented in Emacspeak to
overcome many of the shortcomings he has encountered
with traditional screen-readers.
The approach used by Emacspeak is very different from
that of traditional screen-readers. Screen-readers allow
the user to listen to the contents appearing in different
parts of the display; but the user is entirely respons-
ible for building a mental model of the visual display in
order to interpret what an application is trying to con-
vey. Emacspeak, on the other hand, does not speak the
screen. Instead, applications provide both visual and
speech feedback, and the speech feedback is designed to
be sufficient by itself.
This approach reduces cognitive load on the user and
is relevant to providing general spoken access to in-
formation. Producing spoken output from within the
application, rather than speaking the visually displayed
information, vastly improves the quality of the spoken
feedback. Thus, an application can display its results
in a visually pleasing manner; the speech-enabling com-
ponent renders the same in an aurally pleasing way.
Keywords: Speech Interface, Direct Access, Spoken
Feedback, Audio Formatting, Speech as a first-class I/O
medium.
Introduction
A screen-reader is a computer application designed to
provide spoken feedback to a visually impaired user.
Screen-readers have been available since the mid-80's.
During the 80's, such applications relied on the character
representation of the contents of the screen to produce
the spoken feedback. The advent of bitmap displays
led to a complete breakdown of this approach, since the
contents of the screen were now light and dark pixels.
A significant amount of research and development has
been carried out to overcome this problem and provide
speech-access to the Graphical User Interface (GUI).
The best and perhaps the most complete speech access
system to the GUI is Screenreader/2 (ScreenReader For
OS/2) developed by Dr. Jim Thatcher at the IBM Wat-
son Research Center [Tha94]. This package provides
robust spoken access to applications under the OS2
Presentation Manager and Windows 3.1. Commercial
packages for Microsoft Windows 3.1 provide varying
levels of spoken access to the GUI. The Mercator pro-
ject [ME92, WKES94, MW94, Myn94] has focused on
providing spoken access to the X-Windows system.
A common feature of traditional DOS-based screen-
readers and speech access packages to the GUI is their
attempt to convey the contents of the visual display via
speech. In fact, a significant amount of the development
effort required to design speech-access packages to the
GUI has concentrated on building up robust off-screen
models ---a data structure that represents the contents
of the GUI's visual display. Construction of such an off-
screen model helps screen-readers regain the ground they
lost due to the advent of graphical displays. However,
the nature of spoken feedback provided does not change.
Shortcomings Of Reading The Screen
Screen-readers have helped open up the world of com-
puting to visually impaired users 1 . However, the spoken
interface they provide leaves a lot to be desired.
The primary shortcoming with such interfaces is their
1 The author has used these for the last five years.

inability to convey the structure present in visually dis-
played information. Since the screen-reading application
has only the contents of the visual display to examine, it
conveys little or no contextual information about what
is being displayed. Put another way:
A Screen-reader speaks what is on the screen
without conveying why it is there.
As a consequence, accessing applications that display
highly structured output in a visually pleasing manner
with screen-readers is cumbersome.
Here is a simple example to illustrate the above state-
ment. A typical calendar display is made up of a table
showing the days of the week. This information is visu-
ally laid out to allow the eye to quickly see what day a
particular date of the month falls on. Thus, given the
display shown in Fig. 1, it is easy to answer the ques-
tion ``What day is it today?''. When this same display is
Jan 1995
S M T W Th F Sa
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
Figure 1: A Typical Calendar Application
accessed with a screen-reader, the user hears the entire
contents of the table 1 spoken aloud. This results in the
following set of meaningless utterances:
pipe pipe 1 pipe 2 pipe 3 pipe 4 pipe 5 pipe 6 pipe
7 pipe pipe : : :
pipe pipe 29 pipe 30 pipe 31 pipe
pipe pipe pipe pipe pipe
Alternatively, the characters under the application cursor
can be spoken. In the case of Fig. 1, the listener would
hear the system say ``one''. To answer the question
``What day is it today?'' the user has to first build
a mental representation of the visual display, and then
navigate around the screen, examining the contents that
appear in the same screen column as the 1 in order to
infer the fact that the date is Sunday, January 1, 1995.
Screen-readers for both character-cell and graphical dis-
plays suffer from this shortcoming. This is a con-
sequence of trying to read the screen instead of providing
true spoken feedback. The rest of this paper describes
Emacspeak, an interface that treats speech as a first-
class output medium. Screen-readers speak the screen
contents after the application has displayed its results;
Emacspeak integrates spoken feedback into the applic-
ation itself. This tight integration between the spoken
output and the user application enables Emacspeak to
provide rich, context-sensitive spoken feedback. As a
case in point, when using the calendar application, the
user hears the current date as Sunday, January 1, 1995.
For related work in integrating speech as a first-class I/O
medium into general user applications, see [YLM95].
We conclude this introduction by pointing out that
visual layout plays an important role in cuing the reader
to information structure. Such visual cues reduce cog-
nitive load by allowing the perceptual system to per-
ceive the inherent structure present in the information,
thereby freeing the cognitive system to process the in-
formation. Spoken feedback produced from the visual
layout proves difficult to understand because many of
the structural cues are lost; to make things worse, other
structural cues turn into noise (the ``pipe pipe : : :'' above
is a case in point). This results in the listener having
to spend a large number of cognitive cycles in trying to
parse the spoken utterance, making understanding the
information considerably harder. Speaking the informa-
tion in an aurally pleasing manner alleviates this burden,
leading to better aural comprehension.
A Different Approach
We tightly integrate spoken output with the user ap-
plication. Such tight integration allows the functions
providing spoken feedback direct access to the applica-
tion context. Thus, in the case of the calendar example
shown in Fig. 1, the speech feedback routines can access
the runtime environment of the calendar application to
find out that the current date is Sunday, January 1, 1995
instead of trying to guess this from the visual presenta-
tion of the calendar.
Thus, using speech as a first-class output medium
provides direct access to the information displayed by
an application ---traditional screen-readers provide what
can at best be described as indirect access.
Motivation
Every computer application (big or small) can be char-
acterized as having the following structure:
ffl Accept user input
ffl Compute on the data
ffl Display results of the computation
Human computer interaction focuses on the first and
third of these stages. Traditional WIMP interfaces 2
have assumed a purely visual interaction; applications
designed for such interfaces naturally optimize their dis-
plays to this mode of interaction.
However, visual layout is not optimal for spoken inter-
action as evinced by the calendar application (See Sec-
tion ). By having the user interface (UI) components
2 Windows, Icons, Menus, and Pointer

of the application communicate directly with the speech
subsystem, Emacspeak produces more usable output.
Contrast this with the screen-reading paradigm, where
spoken output is produced by a program that is unaware
of and separate from the user application.
Implementation
We have motivated the design of Emacspeak with the
help of the calendar example. However, Emacspeak is
much more than a simple talking calendar; it extends
all of GNU Emacs to provide full spoken feedback. The
author uses Emacspeak on his Alpha AXP 3 workstation
running Digital UNIX and on his laptop running Linux.
Emacspeak has been made available on the Internet 4
and is currently being used by an increasing number of
Digital's customers.
This paper will not go into implementation details ---our
purpose is to highlight the novel interface provided by
Emacspeak. For the sake of completeness, here is a brief
sketch of how the system is implemented.
Emacspeak consists of a core speech module that
provides basic speech services to the rest of the system,
e.g., functions that speak characters, words and lines.
The advice facility of Emacs Lisp is used to integrate the
speech feedback provided by these functions into Emacs.
This facility allows us to specify program fragments that
are to be run either before, after, or around any function.
Since the user interface level of GNU Emacs is imple-
mented entirely in Emacs Lisp, the functions making up
this interface can be advised to speak. The primary
advantage of this approach is that we have been able
to speech-enable all of GNU Emacs ---a large system---
without modifying a single line of source code from the
original Emacs distribution.
We conclude this sketch with an example. Function
next-line implements movement of the editing cursor
to the next line in GNU Emacs. Emacspeak provides
the following advice to this function:
(defadvice next-line (after emacspeak )
''Speak the line you moved to.''
(when (interactive-p)
(emacspeak-speak-line )))
This advice specifies that if function next-line is called
interactively (As the result of the user pressing a key.)
then function emacspeak-speak-line should be called
after next-line has done its work.
The next section Section gives examples of the spoken
interaction provided by Emacspeak when performing
several day-to-day computing tasks. All of the facilities
3 For the first time in five years, I can sit in front of a worksta-
tion, rather than in front of a DOS PC functioning as a terminal!
UNIX is a trademark of Unix Systems Laboratories.
The following are trademarks of Digital Equipment Corporation:
Alpha AXP, DEC, DECstation, DECtalk.
4 URL http://www.research.digital.com/CRL
described are implemented using the model described
above.
Examples Of Common Computing Tasks
This section describes the user interface provided by
Emacspeak when performing common-place computing
tasks like editing and proof-reading, surfing the WWW,
reading and replying to electronic mail and Usenet news.
This paper description suffers from the natural short-
coming of elucidating in print what is essentially aural.
Here are some features of the spoken feedback that are
common to the different interaction scenarios:
ffl Speech output is always interruptible. Actions
causing new information to be spoken first inter-
rupt any ongoing output.
ffl Emacspeak provides a voice-lock facility that per-
mits association of syntactic units of text with dif-
ferent voices. This is a powerful method of convey-
ing structure succinctly and was first described in
[Ram94]. Audio Formatting is used to aurally set
apart different syntactic units, for example, high-
light regions of text.
ffl Emacspeak uses auditory icons [SMG90, BGB88,
Gav93, BGP93, JSBG86] ---short snippets of
sounds (under 0:25--0:5 seconds) to cue common
events such as selecting, opening and closing an
object. Used consistently throughout the interface,
these cues speed up user interaction ---an exper-
ienced user can often continue to the next task
when an aural cue 5 is heard without waiting for
the spoken confirmation.
Editing Documents
Emacspeak speaks each character as it is typed. Press-
ing the space-bar causes the previous word to be spoken.
Cursoring through a file speaks each line; speech is in-
terrupted if the cursor is moved while a line is being
spoken. This allows the user to efficiently browse files.
All of the standard Emacs navigation commands, e.g.,
move to the next paragraph, skip this S-expression, give
appropriate auditory feedback.
Emacs' knowledge of the syntax of what is being ed-
ited is used to advantage in enabling sophisticated nav-
igation. For instance, the user can move across state-
ments when browsing program source code. When nav-
igating through a file of C code, the user gets relevant
spoken feedback that conveys the structure of the pro-
gram. Different syntactic units are spoken in different
voices to increase the band-width of aural communica-
tion. In addition, the user can have the semantics of a
line of source code spoken upon request. Thus, when the
5 Emacspeak will still produce the spoken confirmation, but
continuing to the next task will interrupt this speech.

editing cursor is on the closing brace that ends a func-
tion block, Emacspeak says ``brace that closes function''
and then speaks the opening line of that function. This
provides the listener the same kind of feedback that users
of traditional visual interfaces have come to expect.
Spell Checking
Emacspeak provides a fluent aural interface to ispell,
a powerful interactive spell checker. Here is a brief de-
scription of the visual interface provided by the spell
checker for those unfamiliar with this system.
Typically, a file opened with Emacs can be spell-checked
by invoking ispell. Errors are visually highlighted, with
a separate window showing a list of possible corrections.
The user can type a number to pick a choice from the
list of corrections; alternatively, a replacement can be
directly typed in.
Using this interface with a traditional screen-reader is
painful to say the least 6 . A user of a screen-reader needs
to query the position of the cursor to find out the erro-
neous word, then locate the window of corrections on the
screen before continuing.
With Emacspeak, the fact that the list of possible cor-
rections appears in a separate window is completely hid-
den from the listener. When running the spell checker,
Emacspeak speaks the line containing the erroneous text
with the incorrect word aurally highlighted. Next, the
list of possible corrections is spoken; the user can pick a
choice at any time. Based on the user action, the spell
checker inserts the appropriate correction and continues
to the next error.
A similar approach is used to provide aural feedback
to the common editing task of interactively replacing
a string by another. Emacspeak speaks the line con-
taining the instance of the text being replaced, with the
instance that will be replaced aurally highlighted. This
allows the listener to respond correctly when there are
multiple occurrences of the text being replaced within a
line. Thus, the task of replacing the first occurrence of
foo with bar while leaving the second instance of foo
intact in the example
Change this food, but do not touch this fool.
is trivial; the same task using a screen-reader is much
harder.
Electronic Mail
Emacspeak provides a fluent spoken interface to elec-
tronic mail. Instead of having to listen to verbose ut-
terances consisting of email headers, the listener hears a
succinct summary of the form `` sender name on sub-
ject.''. Emacspeak also infers the dialogue structure
present in electronic mail messages based on standard
6 Believe me, I've done it!
conventions used to cite the contents of previous mes-
sages in a conversation thread. When such dialogue
structure is detected, the different parts of the dialogue
are spoken using different voice characteristics. Hitting
any key while a part of the dialogue is being spoken
results in the system skipping to the next portion of the
dialogue.
Usenet News
Emacspeak provides a fluent spoken extension to GNUS,
the GNU Emacs news-reader. The interface permits the
user to browse news using the four arrow keys.
We present the user with a simple metaphor of opening
and closing objects. The up and down arrows navig-
ate through objects at the current level; the right arrow
opens the current object, while the left arrow closes it.
To begin with, the user opens up Usenet news. The
up and down arrows navigate through the list of news-
groups, providing a succinct verbal description of the
current group and the number of articles that are unread.
Opening a group with the right arrow results in the up
and down arrow keys moving through the list of unread
articles; again, the article is succinctly summarized us-
ing utterances of the form ``Sender on topic, 33 lines.''.
Opening an article by pressing right arrow speaks it; the
listener can move to the next article merely by press-
ing the down arrow, which will interrupt the reading
of the current article, and summarize the next article.
The auditory icons described earlier are especially useful
when browsing news; the aural cues for opening, closing
and selecting objects allow the listener to quickly move
to the next task in the interface.
All of the features described in the section on read-
ing email are available when reading news; Emacspeak
presents the dialogue structure present in news articles
using the voice-lock feature described above.
Surfing The WWW
The WWW presents two interesting challenges to a
spoken interface.
ffl Presence of hypertext links.
ffl Presence of interactive elements, e.g., fill-out forms
consisting of UI elements such as input fields, check
boxes and radio buttons.
Emacspeak provides a spoken extension to W3, the
excellent Emacs-based WWW browser developed and
maintained by William Perry.
Browsing A WWW Page. The listener can browse
a WWW page just like any other document. Hyperlinks
are spoken in a different voice. The listener can inter-
rupt speech at any time and activate the link that was
most recently spoken. The listener can also move the

application focus between the various links on a page;
jumping to a link results in the anchor text being spoken
along with an auditory cue indicating a large movement.
Activating a link plays the auditory icon for opening an
object, retrieves the document, and finally announces
the title of the newly opened WWW document.
Interactive WWW Documents. The W3 browser
parses a WWW document before displaying it. Emac-
speak relies on this internal representation to provide
the spoken rendering, rather than examining the visu-
ally displayed document. This fits well with the over-
all design of Emacspeak; it also enables Emacspeak to
produce spoken feedback that would be impossible to
generate by merely examining the screen.
A typical interaction with a form element consists of:
ffl Moving system focus to the element.
ffl Changing the state of the form element, e.g., press-
ing a button or entering a value.
ffl Obtaining confirmation from the system about the
recently performed action.
We illustrate this with examples of what happens when
the user interacts with different form elements that are
found on WWW documents.
Text Field ffl Emacspeak summarizes the element
under the focus with an utterance of the form
``text field field name set to value.''. The name
of the text field and its value if any are re-
trieved from the internal representation.
ffl Pressing enter results in the spoken prompt
``Enter value for field name.''.
ffl After the value has been input, Emacspeak
confirms this with the announcement ``text
field field name set to value.''.
Check Box ffl Emacspeak summarizes the check
box with an utterance of the form ``Check-box
name is checked.'', assuming the box has been
previously checked.
ffl Pressing enter produces a button click.
ffl Emacspeak says ``unchecked check box
name.''.
Radio Button The interaction parallels that described
above for check boxes. The utterance uses the
phrase ``is pressed'' to distinguish radio buttons
from check boxes.
Navigating The File System
Emacs' dired mode, which is used to navigate the file
system and perform operations such as moving, copy-
ing and deleting files, is extended to provide succinct
aural feedback. When navigating through the file listing,
the user hears the name of the current file or directory
spoken; different file types e.g., directories, executables
and symbolic links are distinguished by speaking their
names in different voices. Opening a file plays the aud-
itory icon for opening an object, and then speaks the
name of the file just opened. Marking a file for later
processing, deleting a file etc. all produce auditory icons.
The auditory icons in this context are very useful be-
cause typically, performing an action such as deleting
a file when using dired affects the current object and
moves the focus. Visually, the file marked for deletion is
set apart and the focus is moved. Combining the sound
of a file being deleted with the speaking of the current
object introduces the same level of parallelism in the
aural interaction.
When navigating the dired buffer for the directory con-
taining this paper, a screen-reader would speak a typical
line shown below
-rw-r--r-- 1 raman users 11905 Aug 17 16:04 ex-
amples.tex
as `` dash rw dash r dash dash r dash dash 1 raman users
11905 Aug 17 16:04 examples.tex'', an utterance that is
hard to parse and comprehend. In contrast, Emacspeak
merely speaks the filename; the listener can repeatedly
press the tab key to hear the various fields of the file
listing. Below, we list the utterances produced by each
repeated press of the tab key.
Permissions rw r r
Links 1
Owner raman
Group users
Size 11905
Last Modified Aug 17 16:04
Filename examples.tex
Figure 2: Tabbing through a file listing.
Notice that Emacspeak infers the meaning of each field
in the file listing. Pressing the tab key while a field
is being described interrupts speech immediately and
moves to the next field.
Conclusion
We conclude with a summary of what we have learnt
from the work on Emacspeak. Firstly, the design of
Emacspeak as a speech interface as opposed to a system
that reads the screen is radically different from what has
been attempted in the past. The current implementation

has achieved a remarkable level of success in providing
fluent speech access to day-to-day computing tasks.
The convoluted interfaces provided by screen-readers
proved moderately effective in the case of visually im-
paired users ---there was no other choice and con-
sequently, users had the motivation to learn and use
these interfaces. However, general users who wish to
use speech as an extra modality to enhance their interac-
tion with the computer are unlikely to put up with such
interfaces. The direct access provided by the speech-
enabling approach is likely to produce more acceptable
output and make deploying speech interfaces easier.
Finally, our implementation of Emacspeak has provided
the first true speech access interface to UNIX worksta-
tions. To date, the only available solution for visually
impaired users has been to access these using PC's run-
ning screen-readers as a talking terminal. Our work
provides a viable alternative to accessing the power of
UNIX and the wealth of communication and develop-
ment tools that are commonplace in this environment.
Acknowledgements
We would like to thank the authors of the various Emacs
subsystems such as the WWW browser, email and news
readers. Without their work, Emacspeak would have re-
mained a speech interface to a text editor; in itself not
a very useful artifact. Special thanks go to Hans Cha-
lupsky, author of the advice package, without which the
implementation of Emacspeak would have been difficult,
if not impossible. I would also like to thank Win Treese
for drawing my attention to the power of the advice fa-
cility and Dave Wecker 7 for goading me into writing
Emacspeak.
REFERENCES
[BGB88] W. Buxton, W. Gaver, and S. Bly. The use
of nonspeech audio at the interface.
Tutorial Notes, CHI '88., 1988.
[BGP93] Meera M. Blattner, Ephraim P. Glinert,
and Albert L. Papp. Sonic Enhancements
for 2-D Graphic Displays, and Auditory
Displays. To be published by
Addison-Wesley in the Santa Fe Institute
Series. IEEE, 1993.
[Gav93] William Gaver. Synthesizing auditory
icons. Proceedings of INTERCHI 1993,
pages 228--235, April 1993.
[JSBG86] K. I. Joy, D. A. Sumikawa, M. M. Blattner,
and R. M. Greenberg. Guidelines for the
syntactic design of audio cues in computer
interfaces. Nineteenth Annual Hawaii
7 He got tired of listening to my complaints about how inad-
equate screen-readers were.
International Conference on System
Sciences, 1986.
[ME92] Elizabeth D. Mynatt and W. Keith
Edwards. Mapping GUIs to auditory
interfaces. Proceedings ACM UIST92,
pages 61--70, 1992.
[MW94] E.D. Mynatt and G. Weber. Nonvisual
presentation of graphical user interfaces:
Contrasting two approaches. Proceedings of
the 1994 ACM Conference on Human
Factors in Computing Systems (CHI'94),
April 1994.
[Myn94] E.D. Mynatt. Auditory Presentation of
Graphical User Interfaces. Santa Fe.
Addison-Wesley: Reading MA.., 1994.
[Ram94] T. V. Raman. Audio System for Technical
Readings. PhD thesis, Cornell University,
May 1994.
URL http://www.research.digital.com/CRL
/personal/raman/raman.html.
[SMG90] D. A. Sumikawa, Blattner M. M., and
R. M. Greenberg. Earcons and icons: Their
structure and common design principles.
Visual Programming Environments, 1990.
[Tha94] James Thatcher. Screen reader/2: Access
to os/2 and the graphical user interface.
Proc. of The First Annual ACM
Conference on Assistive Technologies
(ASSETS '94), pages 39--47, Nov 1994.
[WKES94] E. D. Mynatt W. K. Edwards and
K. Stockton. Providing access to graphical
user interfaces - not graphical screens.
Proc. Of The First Annual ACM
Conference on Assistive Technologies
(ASSETS '94), pages 47--54, Nov 1994.
[YLM95] Nicole Yankelovich, Gina Anne Levow, and
Matt Marx. Designing speechacts: Issues
in speech user interfaces. In Proceedings of
CHI95, Human Factors In Computing
Systems, pages 369--376. Sun Micro
Systems, May 1995.