A desktop is a workspace that one uses to organize the tools of one's trade . Graphical desktops provide rich visual interaction for performing day-to-day computing tasks; the goal of the audio desktop is to enable similar efficiencies in an eyes-free environment. Thus, the primary goal of an audio desktop is to use the expressiveness of auditory output (both verbal and nonverbal) to enable the end user to perform a full range of computing tasks:
Communication through the full range of electronic messaging services
Ready access to local documents on the client and global documents on the Web
Ability to develop software effectively in an eyes-free environment
The Emacspeak audio desktop was motivated by the following insight: to provide effective auditory renderings of information, one needs to start from the actual information being presented, rather than a visual presentation of that information. This had earlier led me to develop AsTeR, Audio System For Technical Readings ( http://emacspeak.sf.net/raman/aster/aster-toplevel.html). The primary motivation then was to apply the lessons learned in the context of aural documents to user interfaces—after all, the document is the interface.
The primary goal was not to merely carry the visual interface over to the auditory modality, but rather to create an eyes-free user interface that is both pleasant and productive to use.
Contrast this with the traditional screen-reader approach where GUI widgets such as sliders and tree controls are directly translated to spoken output. Though such direct translation can give the appearance of providing full eyes-free access, the resulting auditory user interface can be inefficient to use.
These prerequisites meant that the environment selected for the audio desktop needed:
A core set of speech and nonspeech audio output services
A rich suite of pre-existing applications to speech-enable
Access to application context to produce contextual feedback
I started implementing Emacspeak in October 1994. The target environments were a Linux laptop and my office workstation. To produce speech output, I used a DECTalk Express (a hardware speech synthesizer) on the laptop and a software version of the DECTalk on the office workstation.
The most natural way to design the system to leverage both speech options was to first implement a speech server that abstracted away the distinction between the two output solutions. The speech server abstraction has withstood the test of time well; I was able to add support for the IBM ViaVoice engine later, in 1999. Moreover, the simplicity of the client/server API has enabled open source programmers to implement speech servers for other speech engines.
Emacspeak speech servers are implemented in the
TCL language. The speech server for the DECTalk Express communicated with the hardware synthesizer over a serial line. As an example, the command to speak a string of text was a
proc
that took a string argument and wrote it to the serial device. A simplified version of this looks like:
proc tts_say {text} {puts -nonewline $tts(write) "$text"}
The speech server for the software DECTalk implemented an equivalent, simplified
tts_say
version that looks like:
proc say {text} {_say "$text"}
where
_say
calls the underlying C implementation provided by the DECTalk software.
The net result of this design was to create separate speech servers for each available engine, where each speech server was a simple script that invoked TCL's default readeval-print loop after loading in the relevant definitions. The client/server API therefore came down to the client (Emacspeak) launching the appropriate speech server, caching this connection, and invoking server commands by issuing appropriate procedure calls over this connection.
Notice that so far I have said nothing explicit about how this client/server connection was opened; this late-binding proved beneficial later when it came to making Emacspeak network-aware. Thus, the initial implementation worked by the
Emacspeak client communicating to the speech server using
stdio
. Later, making this client/server communication go over the network required the addition of a few lines of code that opened a server socket and connected
stdin/stdout
to the resulting connection.
Thus, designing a clean client/server abstraction, and relying on the power of Unix I/O, has made it trivial to later run Emacspeak on a remote machine and have it connect back to a speech server running on a local client. This enables me to run Emacspeak inside screen on my work machine, and access this running session from anywhere in the world. Upon connecting, I have the remote Emacspeak session connect to a speech server on my laptop, the audio equivalent of setting up X to use a remote display.
The simplicity of the speech server abstraction described above meant that version 0 of the speech server was running within an hour after I started implementing the system. This meant that I could then move on to the more interesting part of the project: producing good quality spoken output. Version 0 of the speech server was by no means perfect; it was improved as I built the Emacspeak speech client.
A friend of mine had pointed me at the marvels of Emacs Lisp advice a few weeks earlier. So when I sat down to speech-enable Emacs, advice was the natural choice. The first task was to have Emacs automatically speak the line under the cursor whenever the user pressed the up/down arrow keys.
In Emacs, all user actions invoke appropriate Emacs Lisp functions. In standard editing modes, pressing the down arrow invokes function
next-line
, while pressing the up arrow invokes
previous-line
. To speech-enable these commands, version 0 of Emacspeak implemented the following rather simple advice fragment:
(defadvice next-line (after emacspeak) "Speak line after moving." (when (interactive-p) ( emacspeak-speak-line)))
The
emacspeak-speak-line
function implemented the necessary logic to grab the text of the line under the cursor and send it to the speech server. With the previous definition in place, Emacspeak 0.0 was up and running; it provided the scaffolding for building the actual system.
The next iteration returned to the speech server to enhance it with a well-defined
eventing loop. Rather than simply executing each speech command as it was received, the speech server queued client requests and provided a
launch
command that caused the server to execute queued requests.
The server used the
select
system call to check for newly arrived commands after sending each clause to the speech engine. This enabled immediate silencing of speech; with the somewhat naïve implementation described in version 0 of the speech server, the command to stop speech would not take immediate effect since the speech server would first process previously issued
speak
commands to completion. With the speech queue in place, the client application could now queue up arbitrary amounts of
text and still get a high degree of responsiveness when issuing higher-priority commands such as requests to stop speech.
Implementing an event queue inside the speech server also gave the client application finer control over how text was split into chunks before synthesis. This turns out to be crucial for producing good intonation structure. The rules by which text should be split up into clauses varies depending on the nature of the text being spoken. As an example, newline characters in programming languages such as Python are statement delimiters and determine clause boundaries, but newlines do not constitute clause delimiters in English text.
As an example, a clause boundary is inserted after each line when speaking the following Python code:
i=1 j=2
See the section "Augmenting Emacs to create aural display lists," later in this chapter, for details on how Python code is distinguished and its semantics are transferred to the speech layer.
With the speech server now capable of smart text handling, the
Emacspeak client could become more sophisticated with respect to its handling of text. The
emacspeak-speak-line
function turned into a library of speech-generation functions that implemented the following steps:
Parse text to split it into a sequence of clauses.
Preprocess text—e.g., handle repeated strings of punctuation marks.
Carry out a number of other functions that got added over time.
Queue each clause to the speech server, and issue the
launch
command.
From here on, the rest of Emacspeak was implemented using Emacspeak as the development environment. This has been significant in how the code base has evolved. New features are tested immediately because badly implemented features can render the entire system unusable. Lisp's incremental code development fits naturally with the former; to cover the latter, the Emacspeak code base has evolved to be "bushy"—i.e., most parts of the higher-level system are mutually independent and depend on a small core that is carefully maintained.
Lisp
advice
is key to the
Emacspeak implementation, and this chapter would not be complete without a brief overview. The
advice
facility allows one to modify existing functions
without changing the original implementation
. What's more, once a function
f
has been modified by
advice
m
, all calls to function
f
are affected by
advice
.
advice comes in three flavors:
All
advice
forms get access to the arguments of the
adviced
function; in addition,
around
and after get access to the return value computed by the original function. The Lisp implementation achieves this magic by:
Caching the original implementation of the function
Evaluating the advice form to generate a new function definition
Storing this definition as the adviced function
Thus, when the
advice
fragment shown in the earlier section "A Simple First-Cut Implementation" is evaluated,
Emacs' original
next-line
function is replaced by a modified version that speaks the current line
after
the original
next-line
function has completed its work.
At this point in its evolution, here is what the overall design looked like:
Emacs' interactive commands are speech-enabled or adviced to produce auditory output.
advice definitions are collected into modules, one each for every Emacs application being speech-enabled.
The advice forms forward text to core speech functions.
These functions extract the text to be spoken and forward it to the
tts-speak
function.
The
tts-speak
function produces auditory output by preprocessing its
text
argument and sending it to the speech server.
The speech server handles queued requests to produce perceptible output.
Text is preprocessed by placing the text in a special scratch buffer. Buffers acquire specialized behavior via buffer-specific syntax tables that define the grammar of buffer contents and buffer-local variables that affect behavior. When text is handed off to the Emacspeak core, all of these buffer-specific settings are propagated to the special scratch buffer where the text is preprocessed. This automatically ensures that text is meaningfully parsed into clauses based on its underlying grammar.
Emacs uses
font-lock
to syntactically color text. For creating the visual presentation, Emacs adds a text property called
face
to text strings; the value of this
face
property specifies the font, color, and style to be used to display that text. Text strings with
face
properties can be thought of as a conceptual
visual display list
.
Emacspeak augments these visual display lists with
personality
text properties whose values specify the auditory properties to use when rendering a given piece of text; this is called
voice-lock
in Emacspeak. The value of the
personality
property is an Aural CSS (ACSS) setting that encodes various voice properties—e.g., the pitch of the speaking voice. Notice that such ACSS settings are not specific to any given TTS engine. Emacspeak implements ACSS-to-TTS
mappings in engine-specific modules that take care of mapping high-level aural properties—e.g., mapping
pitch
or
pitch-range
to engine-specific control codes.
The next few sections describe how Emacspeak augments Emacs to create aural display lists and to process these aural display lists to produce engine-specific output.
Emacs modules that implement
font-lock
call the Emacs built-in function
put-text-property
to attach the relevant
face
property. Emacspeak defines an advice fragment that
advices
the
put-text-property
function to add in the corresponding
personality
property when it is asked to add a face property. Note that the value of both display properties (
face
and
personality
) can be lists; values of these properties are thus designed to
cascade
to create the final (visual or auditory) presentation. This also means that different parts of an application can progressively add display property values.
The
put-text-property
function has the following signature:
(put-text-property START END PROPERTY VALUE &optional OBJECT)
The advice implementation is:
(defadvice put-text-property (after emacspeak-personality pre act) "Used by emacspeak to augment font lock." (let ((start (ad-get-arg 0)) ;; Bind arguments (end (ad-get-arg 1 )) (prop (ad-get-arg 2)) ;; name of property being added (value (ad-get-arg 3 )) (object (ad-get-arg 4)) (voice nil)) ;; voice it maps to (when (and (eq prop 'face) ;; avoid infinite recursion (not (= start end)) ;; non-nil text range emacspeak-personality-voiceify-faces) (condition-case nil ;; safely look up face mapping (progn (cond ((symbolp value) (setq voice (voice-setup-get-voice-for-face value))) ((ems-plain-cons-p value)) ;;pass on plain cons ( (listp value) (setq voice (delq nil (mapcar #'voice-setup-get-voice-for-face value)))) (t (message "Got %s" value))) (when voice ;; voice holds list of personalities (funcall emacspeak-personality-voiceify-faces start end voice object))) (error nil)))))
Here is a brief explanation of this advice definition:
First, the function uses the
advice
built-in
ad-get-arg
to locally bind a set of lexical variables to the arguments being passed to the
adviced
function.
The mapping of faces to personalities is controlled by user customizable variable
emacspeak-personality-voiceify-faces
. If non-nil, this variable specifies a function with the following signature:
( emacspeak-personality-put START END PERSONALITY OBJECT)
Emacspeak provides different implementations of this function that either append or prepend the new personality value to any existing personality properties.
Along with checking for a non-nil
emacspeak-personality-voiceify-faces
, the function performs additional checks to determine whether this advice definition should do anything. The function continues to act if:
The text range is non-nil.
The property being added is a
face
.
The first of these checks is required to avoid edge cases where
put-text-property
is called with a zero-length text range. The second ensures that we attempt to add the
personality
property only when the property being added is
face
. Notice that failure to include this second test would cause infinite recursion because the eventual
put-text-property
call that adds the
personality
property also triggers the advice definition.
Next, the function
safely
looks up the voice mapping of the face (or faces) being applied. If applying a single
face
, the function looks up the corresponding personality mapping; if applying a list of faces, it creates a corresponding list of personalities.
Finally, the function checks that it found a valid voice mapping and, if so, calls
emacspeak-personality-voiceify-faces
with the set of personalities saved in the
voice
variable.
With the
advice
definitions from the previous section in place, text fragments that are visually styled acquire a corresponding
personality
property that holds an ACSS setting for audio formatting the content. The result is to turn text in
Emacs into
rich aural display lists. This section describes how the output layer of
Emacspeak is enhanced to convert these aural display lists into perceptible spoken output.
The Emacspeak
tts-speak
module handles
text preprocessing before finally sending it to the speech server. As described earlier, this preprocessing comprises a number of steps, including:
This section describes the
tts-format-text-and-speak
function, which handles the conversion of aural display lists into audio-formatted output. First, here is the code for the function
tts-format-text-and-speak
:
(defsubst tts-format-text-and-speak (start end ) "Format and speak text between start and end." (when (and emacspeak-use-auditory-icons (get-text-property start 'auditory-icon)) ;;queue icon (emacspeak-queue-auditory-icon (get-text-property start 'auditory-icon))) (tts-interp-queue (format "%s\n" tts-voice-reset-code)) (cond (voice-lock-mode ;; audio format only if voice-lock-mode is on (let ((last nil) ;; initialize (personality (get-text-property start 'personality ))) (while (and ( < start end ) ;; chunk at personality changes (setq last (next-single-property-change start 'personality (current-buffer) end))) (if personality ;; audio format chunk (tts-speak-using-voice personality (buffer-substring start last )) (tts-interp-queue (buffer-substring start last))) (setq start last ;; prepare for next chunk personality (get-text-property last 'personality))))) ;; no voice-lock just send the text (t (tts-interp-queue (buffer-substring start end )))))
The
tts-format-text-and-speak
function is called one clause at a time, with arguments
start
and
end
set to the start and end of the clause. If
voice-lock-mode
is turned on, this function further splits the clause into chunks at each point in the text where there is a change in value of the
personality
property. Once such a transition point has been determined,
tts-format-text-and-speak
calls the function
tts-speak-using-voice
, passing the personality to use and the text to be spoken. This function, described next, looks up the appropriate device-specific codes before dispatching the audio-formatted output to the speech server:
(defsubst tts-speak-using-voice (voice text) "Use voice VOICE to speak text TEXT." (unless (or (eq ' inaudible voice ) ;; not spoken if voice inaudible (and (listp voice) (member 'inaudible voice))) (tts-interp-queue (format "%s%s %s \n" (cond ((symbolp voice) (tts-get-voice-command (if (boundp voice ) (symbol-value voice ) voice))) ((listp voice) (mapconcat #'(lambda (v) (tts-get-voice-command (if (boundp v ) (symbol-value v ) v))) voice " ")) (t "")) text tts-voice-reset-code))))
The
tts-speak-using-voice
function returns immediately if the specified voice is
inaudible
. Here,
inaudible
is a special personality that
Emacspeak uses to prevent pieces of text from being spoken. The
inaudible
personality can be used to advantage when selectively hiding portions of text to produce more succinct output.
If the specified voice (or list of voices) is not
inaudible
, the function looks up the speech codes for the voice and queues the result of wrapping the text to be spoken between
voice-code
and
tts-reset-code
to the speech server.
I first formalized audio formatting within AsTeR, where rendering rules were written in a specialized language called Audio Formatting Language ( AFL). AFL structured the available parameters in auditory space—e.g., the pitch of the speaking voice—into a multidimensional space, and encapsulated the state of the rendering engine as a point in this multidimensional space.
AFL provided a block-structured language that encapsulated the current rendering state by a lexically scoped variable, and provided operators to move within this structured space. When these notions were later mapped to the declarative world of HTML and CSS, dimensions making up the AFL rendering state became Aural CSS parameters, provided as accessibility measures in CSS2 ( http://www.w3.org/Press/1998/CSS2-REC).
Though designed for styling HTML (and, in general, XML) markup trees, Aural CSS turned out to be a good abstraction for building Emacspeak's audio formatting layer while keeping the implementation independent of any given TTS engine.
Here is the definition of the data structure that encapsulates ACSS settings:
(defstruct acss family gain left-volume right-volume average- pitch pitch-range stress richness punctuations)
Emacspeak provides a collection of predefined
voice overlays
for use within
speech extensions. Voice overlays are designed to
cascade
in the spirit of Aural CSS. As an example, here is the ACSS setting that corresponds to
voice-monotone
:
[cl-struct-acss nil nil nil nil nil 0 0 nil all]
Notice that most fields of this
acss
structure are
nil
—that is, unset. The setting creates a voice overlay that:
Sets
pitch
to 0 to create a flat voice.
Sets
pitch-range
to 0 to create a
monotone voice with no inflection.
This setting is used as the value of the
personality
property for audio formatting comments in all programming language modes. Because its value is an overlay, it can interact effectively with other aural display properties. As an example, if portions of a comment are displayed in a bold font, those portions can have the
voice-bolden
personality (another predefined overlay) added; this results in setting the
personality
property to a list of two values: (
voice-bolden voice-monotone
). The final effect is for the text to get spoken with a distinctive voice that conveys both aspects of the text: namely, a sequence of words that are emphasized within a comment.
Sets
punctuations
to
all
so that all punctuation marks are spoken.
Rich visual user interfaces contain both text and icons. Similarly, once Emacspeak had the ability to speak intelligently, the next step was to increase the bandwidth of aural communication by augmenting the output with auditory icons.
Auditory icons in Emacspeak are short sound snippets (no more than two seconds in duration) and are used to indicate frequently occurring events in the user interface. As an example, every time the user saves a file, the system plays a confirmatory sound. Similarly, opening or closing an object (anything from a file to a web site) produces a corresponding auditory icon. The set of auditory icons were arrived at iteratively and cover common events such as objects being opened, closed, or deleted. This section describes how these auditory icons are injected into Emacspeak's output stream.
Auditory icons are produced by the following user interactions:
Auditory icons that confirm user actions—e.g., a file being saved successfully—are produced by adding an
after
advice
to the various
Emacs built-ins. To provide a consistent sound and feel across the
Emacspeak desktop, such extensions are attached to code that is called from many places in Emacs.
Here is an example of such an extension, implemented via an advice fragment:
(defadvice save-buffer (after emacspeak pre act) "Produce an auditory icon if possible." (when (interactive-p) (emacspeak-auditory-icon 'save-object) (or emacspeak-last-message (message "Wrote %s" (buffer-file-name)))))
Extensions can also be implemented via an Emacs-provided hook. As explained in the brief
advice
tutorial given earlier,
advice
allows the behavior of existing software to be extended or modified without having to modify the underlying source code. Emacs is itself an extensible system, and well-written Lisp code has a tradition of providing appropriate extension hooks for common use cases. As an example, Emacspeak attaches auditory feedback to Emacs' default prompting mechanism (the Emacs minibuffer) by adding the function
emacspeak-minibuffer-setup-hook
to Emacs'
minibuffer-setup-hook
:
(defun emacspeak-minibuffer-setup-hook () "Actions to take when entering the minibuffer." (let ((inhibit-field-text-motion t)) (when emacspeak-minibuffer-enter-auditory-icon (emacspeak-auditory-icon 'open-object)) (tts-with-punctuations 'all (emacspeak-speak-buffer)))) (add-hook 'minibuffer-setup-hook 'emacspeak-minibuffer-setup-hook)
This is a good example of using built-in extensibility where available. However, Emac-speak uses advice in a lot of cases because the Emacspeak requirement of adding auditory feedback to all of Emacs was not originally envisioned when Emacs was implemented. Thus, the Emacspeak implementation demonstrates a powerful technique for discovering extension points.
Lack of an advice -like feature in a programming language often makes experimentation difficult, especially when it comes to discovering useful extension points. This is because software engineers are faced with the following trade-off:
Make the system arbitrarily extensible (and arbitrarily complex)
Guess at some reasonable extension points and hardcode these
Once extension points are implemented, experimenting with new ones requires rewriting existing code, and the resulting inertia often means that over time, such extension points remain mostly undiscovered. Lisp advice , and its Java counterpart Aspects, offer software engineers the opportunity to experiment without worrying about adversely affecting an existing body of source code.
In addition to using auditory icons to cue the results of user interaction, Emacspeak uses auditory icons to augment what is being spoken. Examples of such auditory icons include:
Auditory icons are implemented by attaching the text property
emacspeak-auditory-icon
with a value equal to the name of the auditory icon to be played on the relevant text.
As an example, commands to set breakpoints in the Grand Unified Debugger
Emacs package (GUD) are
adviced
to add the property
emacspeak-auditory-icon
to the line containing the breakpoint. When the user moves across such a line, the function
tts-format-text-and-speak
queues the auditory icon at the right point in the output stream.
To summarize the story so far, Emacspeak has the ability to:
Produce auditory output from within the context of an application
Audio-format output to increase the bandwidth of spoken communication
Augment spoken output with auditory icons
This section explains some of the enhancements that the design makes possible.
I started implementing Emacspeak in October 1994 as a quick means of developing a speech solution for Linux. It was when I speech-enabled the Emacs Calendar in the first week of November 1994 that I realized that in fact I had created something far better than any other speech-access solution I had used before.
A calendar is a good example of using a specific type of visual layout that is optimized both for the visual medium as well as for the information that is being conveyed. We intuitively think in terms of weeks and months when reasoning about dates; using a tabular layout that organizes dates in a grid with each week appearing on a row by itself matches this perfectly. With this form of layout, the human eye can rapidly move by days, weeks, or months through the calendar and easily answer such questions as "What day is it tomorrow?" and "Am I free on the third Wednesday of next month?"
Notice, however, that simply speaking this two-dimensional layout does not transfer the efficiencies achieved in the visual context to auditory interaction. This is a good example of where the right auditory feedback has to be generated directly from the underlying information being conveyed, rather than from its visual representation. When producing auditory output from visually formatted information, one has to rediscover the underlying semantics of the information before speaking it.
In contrast, when producing spoken feedback via advice definitions that extend the under-lying application, one has full access to the application's runtime context. Thus, rather than guessing based on visual layout, one can essentially instruct the underlying application to speak the right thing !
The
emacspeak-calendar
module
speech-enables the
Emacs Calendar by defining utility functions that speak calendar information and advising all calendar navigation commands to call these functions. Thus, Emacs Calendar produces specialized behavior by binding the arrow keys to calendar navigation commands rather than the default cursor navigation found in regular editing modes. Emacspeak specializes this behavior by advising the calendar-specific commands to speak the relevant information in the context of the calendar.
The net effect is that from an end user's perspective, things just work . In regular editing modes, pressing up/down arrows speaks the current line; pressing up/down arrows in the calendar navigates by weeks and speaks the current date.
The
emacspeak-calendar-speak-date
function, defined in the
emacspeak-calendar
module, is shown here. Notice that it uses all of the facilities described so far to access and audio-format the relevant contextual information from the calendar:
(defsubst emacspeak-calendar-entry-marked-p( ) (member 'diary (mapcar #'overlay-face (overlays-at (point))))) (defun emacspeak-calendar-speak-date( ) "Speak the date under point when called in Calendar Mode. " (let ((date (calendar-date-string (calendar-cursor-to-date t)))) (cond ((emacspeak-calendar-entry-marked-p) (tts-speak-using-voice mark-personality date)) (t (tts-speak date)))))
Emacs marks dates that have a diary entry with a special overlay. In the previous definition, the helper function
emacspeak-calendar-entry-marked-p
checks this overlay to implement a predicate that can be used to test if a date has a diary entry. The
emacspeak-calendar-speak-date
function uses this predicate to decide whether the date needs to be rendered in a different voice; dates that have calendar entries are spoken using the
mark-personality
voice. Notice that the
emacspeak-calendar-speak-date
function accesses the calendar's runtime context in the call:
(calendar-date-string (calendar-cursor-to-date t))
The
emacspeak-calendar-speak-date
function is called from
advice
definitions attached to all calendar navigation functions. Here is the
advice
definition for function
calendar-forward-week
:
(defadvice calendar-forward-week (after emacspeak pre act) "Speak the date. " (when (interactive-p) (emacspeak-speak-calendar-date ) (emacspeak-auditory-icon 'large-movement)))
This is an
after
advice
, because we want the spoken feedback to be produced
after
the original navigation command has done its work.
The body of the
advice
definition first calls the function
emacspeak-calendar-speak-date
to speak the date under the cursor; next, it calls
emacspeak-auditory-icon
to produce a short sound indicating that we have successfully moved.
With all the necessary affordances to generate rich auditory output in place, speech-enabling Emacs applications using Emacs Lisp's advice facility requires surprisingly small amounts of specialized code. With the TTS layer and the Emacspeak core handling the complex details of producing good quality output, the speech-enabling extensions focus purely on the specialized semantics of individual applications; this leads to simple and consequently beautiful code. This section illustrates the concept with a few choice examples taken from Emacspeak's rich suite of information access tools.
Right around the time I started Emacspeak, a far more profound revolution was taking place in the world of computing: the World Wide Web went from being a tool for academic research to a mainstream forum for everyday tasks. This was 1994, when writing a browser was still a comparatively easy task. The complexity that has been progressively added to the Web in the subsequent 12 years often tends to obscure the fact that the Web is still a fundamentally simple design where:
Content creators publish web resources addressable via URIs.
URI-addressable content is retrievable via open protocols.
Retrieved content is in HTML, a well-understood markup language.
Notice that the basic architecture just sketched out says little to nothing about how the content is made available to the end user. The mid-1990s saw the Web move toward increasingly complex visual interaction. The commercial Web with its penchant for flashy visual interaction increasingly moved away from the simple data-oriented interaction that had characterized early web sites. By 1998, I found that the Web had a lot of useful interactive sites; to my dismay, I also found that I was using progressively fewer of these sites because of the time it took to complete tasks when using spoken output.
This led me to create a suite of web-oriented tools within Emacspeak that went back to the basics of web interaction. Emacs was already capable of rendering simple HTML into interactive hypertext documents. As the Web became complex, Emacspeak acquired a collection of interaction wizards built on top of Emacs' HTML rendering capability that progressively factored out the complexity of web interaction to create an auditory interface that allowed the user to quickly and painlessly listen to desired information.
Emacs W3 is a bare-bones web browser first implemented in the mid-1990s. Emacs W3 implemented CSS (Cascading Style Sheets)
early on, and this was the basis of the first Aural CSS implementation, which was released at the time I wrote the Aural CSS draft in February 1996.
Emacspeak speech-enables Emacs W3 via the
emacspeak-w3
module, which implements the following extensions:
An
aural media
section in the default stylesheet for Aural CSS.
advice added to all interactive commands to produce auditory feedback.
Special patterns to recognize and silence decorative images on web pages.
Aural rendering of HTML form fields along with the associated
label
, which underlay the design of the
label
element in HTML 4.
Context-sensitive rendering rules for HTML form controls. As an example, given a group of radio buttons for answering the question:
Do you accept? |
Emacspeak extends Emacs W3 to produce a spoken message of the form:
Radio group
Do you accept?
has
Yes
pressed.
|
and:
Press this to change radio group
Do you accept?
from
Yes
to
No
.
|
A
before
advice defined for the Emacs W3 function
w3-parse-buffer
that applies user-requested XSLT transforms to HTML pages.
By 1997, interactive sites on the Web, ranging from Altavista for searching to Yahoo! Maps for online directions, required the user to go through a highly visual process that included:
Filling in a set of form fields
Submitting the resulting form
Spotting the results in the resulting complex HTML page
The first and third of these steps were the ones that took time when using spoken output. I needed to first locate the various form fields on a visually busy page and wade through a lot of complex boilerplate material on result pages before I found the answer.
Notice that from the software design point of view, these steps neatly map into pre-action and post-action hooks. Because web interaction follows a very simple architecture based on URIs, the pre-action step of prompting the user for the right pieces of input can be factored out of a web site and placed in a small piece of code that runs locally; this obviates the need for the user to open the initial launch page and seek out the various input fields.
Similarly, the post-action step of spotting the actual results amid the rest of the noise on the resulting page can also be delegated to software.
Finally, notice that even though these pre-action and post-action steps are each specific to particular web sites, the overall design pattern is one that can be generalized. This insight led to the
emacspeak-websearch
module, a collection of task-oriented web
tools that:
Prompted the user
Constructed an appropriate URI and pulled the content at that URI
Filtered the result before rendering the relevant content via Emacs W3
Here is the
emacspeak-websearch
tool for accessing directions from
Yahoo! Maps:
(defsubst emacspeak-websearch-yahoo-map-directions-get-locations ( ) "Convenience function for prompting and constructing the route component." (concat (format "&newaddr=%s" (emacspeak-url-encode (read-from-minibuffer "Start Address: "))) (format "&newcsz=%s" (emacspeak-url-encode (read-from-minibuffer "City/State or Zip:"))) (format "&newtaddr=%s" (emacspeak-url-encode (read-from-minibuffer "Destination Address: "))) (format "&newtcsz=%s" (emacspeak-url-encode (read-from-minibuffer "City/State or Zip:"))))) (defun emacspeak-websearch-yahoo-map-directions-search (query ) "Get driving directions from Yahoo." (interactive (list (emacspeak-websearch-yahoo-map-directions-get-locations)) (emacspeak-w3-extract-table-by-match "Start" (concat emacspeak-websearch-yahoo-maps-uri query))))
A brief explanation of the previous code follows:
The
emacspeak-websearch-yahoo-map-directions-get-locations
function prompts the user for the start and end locations. Notice that this function hardwires the names of the query parameters used by Yahoo! Maps. On the surface, this looks like a kluge that is guaranteed to break. In fact, this kluge has not broken since it was first defined in 1997. The reason is obvious: once a web application has published a set of query parameters, those parameters get hardcoded in a number of places, including within a large number of HTML pages on the originating web site. Depending on parameter names may feel brittle to the software architect used to structured, top-down APIs, but the use of such URL parameters to define bottom-up web services leads to the notion of RESTful web APIs.
The URL for retrieving directions is constructed by concatenating the user input to the base URI for Yahoo! Maps.
The resulting URI is passed to the function
emacspeak-w3-extract-table-by-match
along with a search pattern
Start
to:
Retrieve the content using Emacs W3.
Apply an XSLT transform to extract the table containing
Start
.
Render this table using Emacs W3's HTML formatter.
Unlike the query parameters, the layout of the results page does change about once a year, on average. But keeping this tool current with Yahoo! Maps comes down to maintaining the post-action portion of this utility. In over eight years of use, I have had to modify it about half a dozen times, and given that the underlying platform provides many of the tools for filtering the result page, the actual lines of code that need to be written for each layout change is minimal.
The
emacspeak-w3-extract-table-by-match
function uses an XSLT transformation that filters a document to return tables that contain a specified search pattern. For this example, the function constructs the following XPath expression:
(/descendant::table[contains(., Start)])[last( )]
This effectively picks out the list of tables that contain the string
Start
and returns the last element of that list.
Seven years after this utility was written, Google launched Google Maps to great excitement in February 2005. Many blogs on the Web put Google Maps under the microscope and quickly discovered the query parameters used by that application. I used that to build a corresponding Google Maps tool in Emacspeak that provides similar functionality. The user experience is smoother with the Google Maps tool because the start and end locations can be specified within the same parameter. Here is the code for the Google Maps wizard:
(defun emacspeak-websearch-emaps-search (query &optional use-near) "Perform EmapSpeak search. Query is in plain English." (interactive (list (emacspeak-websearch-read-query (if current-prefix-arg (format "Find what near %s: " emacspeak-websearch-emapspeak-my-location) "EMap Query: ")) current-prefix-arg)) (let ((near-p ;; determine query type (unless use-near (save-match-data (and (string-match "near" query) (match-end 0))))) (near nil) (uri nil)) (when near-p ;; determine location from query (setq near (substring query near-p)) (setq emacspeak-websearch-emapspeak-my-location near)) (setq uri (cond (use-near (format emacspeak-websearch-google-maps-uri (emacspeak-url-encode (format "%s near %s" query near)))) (t (format emacspeak-websearch-google-maps-uri (emacspeak-url-encode query))))) (add-hook 'emacspeak-w3-post-process-hook 'emacspeak-speak-buffer) (add-hook 'emacspeak-w3-post-process-hook #'(lambda nil (emacspeak-pronounce-add-buffer-local-dictionary-entry "ðmi" " miles "))) (browse-url-of-buffer (emacspeak-xslt-xml-url (expand-file-name "kml2html.xsl" emacspeak-xslt-directory) uri))))
A brief explanation of the code follows:
Parse the input to decide whether it's a direction or a search query.
In case of search queries, cache the user's location for future use.
Construct a URI for retrieving results.
Browse the results of filtering the contents of the URI through the XSLT filter
kml2html
, which converts the retrieved content into a simple hypertext document.
Set up custom pronunciations in the results to pronounce
mi
as "miles."
Notice that, as before, most of the code focuses on application-specific tasks. Rich spoken output is produced by creating the results as a well-structured HTML document with the appropriate Aural CSS rules producing an audio-formatted presentation.
With more and more services becoming available on the Web, another useful pattern emerged by early 2000: web sites started creating smart client-side interaction via Java-Script. One typical use of such scripts was to construct URLs on the clientside for accessing specific pieces of content based on user input. As examples, Major League Baseball constructs the URL for retrieving scores for a given game by piecing together the date and the names of the home and visiting teams, and NPR creates URLs by piecing together the date with the program code of a given NPR show.
To enable fast access to such services, I added an
emacspeak-url-template
module in late 2000. This module has become a powerful companion to the
emacspeak-websearch
module described in the previous section. Together, these modules turn the Emacs minibuffer into a powerful web command line that provides rapid access to web content.
Many web services require the user to specify a date. One can usefully default the date by using the user's calendar to provide the context. Thus, Emacspeak tools for playing an NPR program or retrieving MLB scores default to using the date under the cursor when invoked from within the Emacs calendar buffer.
URL templates in Emacspeak are implemented using the following data structure:
(defstruct (emacspeak-url-template (:constructor emacspeak-ut-constructor)) name ;; Human-readable name template ;; template URL string generators;; list of param generator post-action ;; action to perform after opening documentation ;; resource documentation fetcher)
Users invoke URL templates via the Emacspeak command
emacspeak-url-template-fetch
command, which prompts for a URL template and:
Looks up the named template.
Prompts the user by calling the specified
generator
.
Applies the Lisp function
format
to the template string and the collected arguments to create the final URI.
Sets up any post actions performed after the content has been rendered.
Applies the specified fetcher to render the content.
The use of this structure is best explained with an example. The following is the URL template for playing NPR programs:
(emacspeak-url-template-define "NPR On Demand" "http://www.npr.org/dmg/dmg.php?prgCode=%s&showDate=%s&segNum=%s&mediaPref=RM" (list #'(lambda ( ) (upcase (read-from-minibuffer "Program code:"))) #'(lambda ( ) (emacspeak-url-template-collect-date "Date:" "%d-%b-%Y")) "Segment:") nil; no post actions "Play NPR shows on demand. Program is specified as a program code: ME Morning Edition ATC All Things Considered day Day To Day newsnotes News And Notes totn Talk Of The Nation fa Fresh Air wesat Weekend Edition Saturday wesun Weekend Edition Sunday fool The Motley Fool Segment is specified as a two digit number --specifying a blank value plays entire program." #'(lambda (url) (funcall emacspeak-media-player url 'play-list) (emacspeak-w3-browse-xml-url-with-style (expand-file-name "smil-anchors.xsl" emacspeak-xslt-directory) url)))
In this example, the custom
fetcher
performs two actions:
Launches a media player to start playing the audio stream.
Filters the associated SMIL document via the XSLT file smil-anchors.xsl .
When I implemented the
emacspeak-websearch
and
emacspeak-url-template
modules, Emacspeak needed to screen-scrape HTML pages to speak the relevant
information. But as the Web grew in complexity, the need to readily get beyond the superficial presentation of pages to the real content took on a wider value than eyes-free access. Even users capable of working with complex visual interfaces found themselves under a serious information overload. This led to the advent of RSS and
Atom feeds, and the concomitant arrival of
feed reading software.
These developments have had a very positive effect on the Emacspeak code base. During the past few years, the code has become more beautiful as I have progressively deleted screen-scraping logic and replaced it with direct content access. As an example, here is the Emacspeak URL template for retrieving the weather for a given city/state:
(emacspeak-url-template-define "rss weather from wunderground" "http://www.wunderground.com/auto/rss_full/%s.xml?units=both" (list "State/City e.g.: MA/Boston") nil "Pull RSS weather feed for specified state/city." 'emacspeak-rss-display)
And here is the URL template for Google News searches via Atom feeds:
(emacspeak-url-template-define "Google News Search" "http://news.google.com/news?hl=en&ned=tus&q=%s&btnG=Google+Search&output=atom" (list "Search news for: ") nil "Search Google news." 'emacspeak-atom-display )
Both of these
tools use all of the facilities provided by the
emacspeak-url-template
module and consequently need to do very little on their own. Finally, notice that by relying on standardized feed formats such as
RSS and Atom, these templates now have very little in the way of site-specific kluges, in contrast to older tools like the Yahoo! Maps wizard, which hardwired specific patterns from the results page.
Emacspeak was conceived as a full-fledged, eyes-free user interface to everyday computing tasks. To be full-fledged , the system needed to provide direct access to every aspect of computing on desktop workstations. To enable fluent eyes-free interaction, the system needed to treat spoken output and the auditory medium as a first-class citizen—i.e., merely reading out information displayed on the screen was not sufficient.
To provide a complete audio desktop , the target environment needed to be an interaction framework that was both widely deployed and fully extensible. To be able to do more than just speak the screen, the system needed to build interactive speech capability into the various applications.
Finally, this had to be done without modifying the source code of any of the underlying applications; the project could not afford to fork a suite of applications in the name of adding eyes-free interaction, because I wanted to limit myself to the task of maintaining the speech extensions.
To meet all these design requirements, I picked Emacs as the user interaction environment. As an interaction framework, Emacs had the advantage of having a very large developer community. Unlike other popular interaction frameworks available in 1994 when I began the project, it had the significant advantage of being a free software environment. (Now, 12 years later, Firefox affords similar opportunities.)
The enormous flexibility afforded by Emacs Lisp as an extension language was an essential prerequisite in speech-enabling the various applications. The open source nature of the platform was just as crucial; even though I had made an explicit decision that I would modify no existing code, being able to study how various applications were implemented made speech-enabling them tractable. Finally, the availability of a high-quality advice implementation for Emacs Lisp (note that Lisp's advice facility was the prime motivator behind Aspect Oriented Programming) made it possible to speech-enable applications authored in Emacs Lisp without modifying the original source code.
Emacspeak is a direct consequence of the matching up of the needs previously outlined and the affordances provided by Emacs as a user interaction environment.
The Emacspeak code base has evolved over a period of 12 years. Except for the first six weeks of development, the code base has been developed and maintained using Emacspeak itself. This section summarizes some of the lessons learned with respect to managing code complexity over time.
Throughout its existence, Emacspeak has always remained a spare-time project. Looking at the code base across time, I believe this has had a significant impact on how it has evolved. When working on large, complex software systems as a full-time project, one has the luxury of focusing one's entire concentration on the code base for reasonable stretches of time—e.g., 6 to 12 weeks. This results in tightly implemented code that creates deep code bases.
Despite one's best intentions, this can also result in code that becomes hard to understand with the passage of time. Large software systems where a single engineer focuses exclusively on the project for a number of years are almost nonexistent; that form of single-minded focus usually leads to rapid burnout!
In contrast, Emacspeak is an example of a large software system that has had a single engineer focused on it over a period of 12 years, but only in his spare time. A consequence of developing the system single-handedly over a number of years is that the code base has tended to be naturally "bushy." Notice the distribution of files and lines of code summarized in Table 31-1, “Summary of Emacspeak codebase ”.
Table 31-1. Summary of Emacspeak codebase
Layer |
Files |
Lines |
Percentage |
---|---|---|---|
TTS core |
6 |
3866 |
6.0 |
Emacspeak core |
16 |
12174 |
18.9 |
Emacspeak extensions |
160 |
48339 |
75.0 |
Total |
182 |
64379 |
99.9 |
Table 31-1, “Summary of Emacspeak codebase ” highlights the following points:
The TTS core responsible for high-quality speech output is isolated in 6 out of 182 files, and makes up six percent of the code base.
The Emacspeak core—which provides high-level speech services to Emacspeak extensions, in addition to speech-enabling all basic Emacs functionality—is isolated to 16 files, and makes up about 19 percent of the code base.
The rest of the system is split across 160 files, which can be independently improved (or broken) without affecting the rest of the system. Many modules, such as
emacspeak-url-template
, are themselves bushy—i.e., an individual URL template can be modified without affecting any of the other URL templates.
advice reduces code size. The Emacspeak code base, which has approximately 60,000 lines of Lisp code, is a fraction of the size of the underlying system being speech-enabled. A rough count at the end of December 2006 shows that Emacs 22 has over a million lines of Lisp code; in addition, Emacspeak speech-enables a large number of applications not bundled by default with Emacs.
Here is a brief summary of the insights gained from implementing and using Emacspeak:
Lisp advice , and its object-oriented equivalent Aspect Oriented Programming, are very effective means for implementing cross-cutting concerns—e.g., speech-enabling a visual interface.
advice is a powerful means for discovering potential points of extension in a complex software system.
Focusing on basic web architecture, and relying on a data-oriented web backed by standardized protocols and formats, leads to powerful spoken web access.
Focusing on the final user experience, as opposed to individual interaction widgets such as sliders and tree controls, leads to a highly efficient, eyes-free environment.
Visual interaction relies heavily on the human eye's ability to rapidly scan the visual display. Effective eyes-free interaction requires transferring some of this responsibility to the computer because listening to large amounts of information is time-consuming. Thus, search in every form is critical for delivering effective eyes-free interaction, on the continuum from the smallest scale (such as Emacs' incremental search to find the right item in a local document) to the largest (such as a Google search to quickly find the right document on the global Web).
Visual complexity, which may become merely an irritant for users capable of using complex visual interfaces, is a show-stopper for eyes-free interaction. Conversely, tools that emerge early in an eyes-free environment eventually show up in the mainstream when the nuisance value of complex visual interfaces crosses a certain threshold. Two examples of this from the Emacspeak experience are:
—RSS and Atom feeds replacing the need for screen-scraping just to retrieve essential information such as the titles of articles. |
—Emacspeak's use of XSLT to filter content in 2000 parallels the advent of Grease-monkey for applying custom client-side JavaScript to web pages in 2005. |
Emacspeak would not exist without Emacs and the ever-vibrant Emacs developer community that has made it possible to do everything from within Emacs. The Emacspeak implementation would not have been possible without Hans Chalupsky's excellent advice implementation for Emacs Lisp.
Project libxslt from the GNOME project has helped breathe fresh life into William Perry's Emacs W3 browser; Emacs W3 was one of the early HTML rendering engines, but the code has not been updated in over eight years. That the W3 code base is still usable and extensible bears testimony to the flexibility and power afforded by Lisp as the implementation language.
Andy Oram
Beautiful code surveys the range of human invention and ingenuity in one area of endeavor: the development of computer systems. The beauty in each chapter comes from the discovery of unique solutions, a discovery springing from the authors' power to look beyond set boundaries, to recognize needs overlooked by others, and to find surprising solutions to troubling problems.
Many of the authors confronted limitations—in the physical environment, in the resources available, or in the very definition of their requirements—that made it hard even to imagine solutions. Others entered domains where solutions already existed, but brought in a new vision and a conviction that something much better could be achieved.
All the authors in this book have drawn lessons from their projects. But we can also draw some broader lessons after making the long and eventful journey through the whole book.
First, there are times when tried and true rules really do work. So, often one encounters difficulties when trying to maintain standards for robustness, readability, or other tenets of good software engineering. In such situations, it is not always necessary to abandon the principles that hold such promise. Sometimes, getting up and taking a walk around the problem can reveal a new facet that allows one to meet the requirements without sacrificing good technique.
On the other hand, some chapters confirm the old cliché that one must know the rules before one can break them. Some of the authors built up decades of experience before taking a different path toward solving one thorny problem—and this experience gave them the confidence to break the rules in a constructive way.
On the other hand, cross-disciplinary studies are also championed by the lessons in this book. Many authors came into new domains and had to fight their way in relative darkness. In these situations, a particularly pure form of creativity and intelligence triumphed.
Finally, we learn from this book that beautiful solutions don't last for all time. New circumstances always require a new look. So, if you read the book and thought, "I can't use these authors' solutions on any of my own projects," don't worry—next time these authors have projects, they will use different solutions, too.
For about two months I worked intensively on this book by helping authors hone their themes and express their points. This immersion in the work of superbly talented inventors proved to be inspiring and even uplifting. It gave me the impulse to try new things, and I hope this book does the same for its readers.