Beautiful Code


Chapter 31. Emacspeak: The Complete Audio Desktop

T. V. Raman

A desktop is a workspace that one uses to organize the tools of one's trade . Graphical desktops provide rich visual interaction for performing day-to-day computing tasks; the goal of the audio desktop is to enable similar efficiencies in an eyes-free environment. Thus, the primary goal of an audio desktop is to use the expressiveness of auditory output (both verbal and nonverbal) to enable the end user to perform a full range of computing tasks:

  • Communication through the full range of electronic messaging services

  • Ready access to local documents on the client and global documents on the Web

  • Ability to develop software effectively in an eyes-free environment

The Emacspeak audio desktop was motivated by the following insight: to provide effective auditory renderings of information, one needs to start from the actual information being presented, rather than a visual presentation of that information. This had earlier led me to develop AsTeR, Audio System For Technical Readings ( http://emacspeak.sf.net/raman/aster/aster-toplevel.html). The primary motivation then was to apply the lessons learned in the context of aural documents to user interfaces—after all, the document is the interface.

The primary goal was not to merely carry the visual interface over to the auditory modality, but rather to create an eyes-free user interface that is both pleasant and productive to use.

Contrast this with the traditional screen-reader approach where GUI widgets such as sliders and tree controls are directly translated to spoken output. Though such direct translation can give the appearance of providing full eyes-free access, the resulting auditory user interface can be inefficient to use.

These prerequisites meant that the environment selected for the audio desktop needed:

  • A core set of speech and nonspeech audio output services

  • A rich suite of pre-existing applications to speech-enable

  • Access to application context to produce contextual feedback

31.1. Producing Spoken Output

I started implementing Emacspeak in October 1994. The target environments were a Linux laptop and my office workstation. To produce speech output, I used a DECTalk Express (a hardware speech synthesizer) on the laptop and a software version of the DECTalk on the office workstation.

The most natural way to design the system to leverage both speech options was to first implement a speech server that abstracted away the distinction between the two output solutions. The speech server abstraction has withstood the test of time well; I was able to add support for the IBM ViaVoice engine later, in 1999. Moreover, the simplicity of the client/server API has enabled open source programmers to implement speech servers for other speech engines.

Emacspeak speech servers are implemented in the TCL language. The speech server for the DECTalk Express communicated with the hardware synthesizer over a serial line. As an example, the command to speak a string of text was a proc that took a string argument and wrote it to the serial device. A simplified version of this looks like:

	proc tts_say {text} {puts -nonewline $tts(write) "$text"}

The speech server for the software DECTalk implemented an equivalent, simplified tts_say version that looks like:

	proc say {text} {_say "$text"}

where _say calls the underlying C implementation provided by the DECTalk software.

The net result of this design was to create separate speech servers for each available engine, where each speech server was a simple script that invoked TCL's default readeval-print loop after loading in the relevant definitions. The client/server API therefore came down to the client (Emacspeak) launching the appropriate speech server, caching this connection, and invoking server commands by issuing appropriate procedure calls over this connection.

Notice that so far I have said nothing explicit about how this client/server connection was opened; this late-binding proved beneficial later when it came to making Emacspeak network-aware. Thus, the initial implementation worked by the Emacspeak client communicating to the speech server using stdio . Later, making this client/server communication go over the network required the addition of a few lines of code that opened a server socket and connected stdin/stdout to the resulting connection.

Thus, designing a clean client/server abstraction, and relying on the power of Unix I/O, has made it trivial to later run Emacspeak on a remote machine and have it connect back to a speech server running on a local client. This enables me to run Emacspeak inside screen on my work machine, and access this running session from anywhere in the world. Upon connecting, I have the remote Emacspeak session connect to a speech server on my laptop, the audio equivalent of setting up X to use a remote display.

31.2. Speech-Enabling Emacs

The simplicity of the speech server abstraction described above meant that version 0 of the speech server was running within an hour after I started implementing the system. This meant that I could then move on to the more interesting part of the project: producing good quality spoken output. Version 0 of the speech server was by no means perfect; it was improved as I built the Emacspeak speech client.

A Simple First-Cut Implementation

A friend of mine had pointed me at the marvels of Emacs Lisp advice a few weeks earlier. So when I sat down to speech-enable Emacs, advice was the natural choice. The first task was to have Emacs automatically speak the line under the cursor whenever the user pressed the up/down arrow keys.

In Emacs, all user actions invoke appropriate Emacs Lisp functions. In standard editing modes, pressing the down arrow invokes function next-line , while pressing the up arrow invokes previous-line . To speech-enable these commands, version 0 of Emacspeak implemented the following rather simple advice fragment:

	(defadvice next-line (after emacspeak)
	  "Speak line after moving."
	  (when (interactive-p) (
emacspeak-speak-line)))

The emacspeak-speak-line function implemented the necessary logic to grab the text of the line under the cursor and send it to the speech server. With the previous definition in place, Emacspeak 0.0 was up and running; it provided the scaffolding for building the actual system.

Iterating on the First-Cut Implementation

The next iteration returned to the speech server to enhance it with a well-defined eventing loop. Rather than simply executing each speech command as it was received, the speech server queued client requests and provided a launch command that caused the server to execute queued requests.

The server used the select system call to check for newly arrived commands after sending each clause to the speech engine. This enabled immediate silencing of speech; with the somewhat naïve implementation described in version 0 of the speech server, the command to stop speech would not take immediate effect since the speech server would first process previously issued speak commands to completion. With the speech queue in place, the client application could now queue up arbitrary amounts of text and still get a high degree of responsiveness when issuing higher-priority commands such as requests to stop speech.

Implementing an event queue inside the speech server also gave the client application finer control over how text was split into chunks before synthesis. This turns out to be crucial for producing good intonation structure. The rules by which text should be split up into clauses varies depending on the nature of the text being spoken. As an example, newline characters in programming languages such as Python are statement delimiters and determine clause boundaries, but newlines do not constitute clause delimiters in English text.

As an example, a clause boundary is inserted after each line when speaking the following Python code:

	i=1
	j=2

See the section "Augmenting Emacs to create aural display lists," later in this chapter, for details on how Python code is distinguished and its semantics are transferred to the speech layer.

With the speech server now capable of smart text handling, the Emacspeak client could become more sophisticated with respect to its handling of text. The emacspeak-speak-line function turned into a library of speech-generation functions that implemented the following steps:

  • Parse text to split it into a sequence of clauses.

  • Preprocess text—e.g., handle repeated strings of punctuation marks.

  • Carry out a number of other functions that got added over time.

  • Queue each clause to the speech server, and issue the launch command.

From here on, the rest of Emacspeak was implemented using Emacspeak as the development environment. This has been significant in how the code base has evolved. New features are tested immediately because badly implemented features can render the entire system unusable. Lisp's incremental code development fits naturally with the former; to cover the latter, the Emacspeak code base has evolved to be "bushy"—i.e., most parts of the higher-level system are mutually independent and depend on a small core that is carefully maintained.

A Brief advice Tutorial

Lisp advice is key to the Emacspeak implementation, and this chapter would not be complete without a brief overview. The advice facility allows one to modify existing functions without changing the original implementation . What's more, once a function f has been modified by advice m , all calls to function f are affected by advice .

advice comes in three flavors:

before

The advice body is run before the original function is invoked.

after

The advice body is run after the original function has completed.

around

The advice body is run instead of the original function. The around advice can call the original function if desired.

All advice forms get access to the arguments of the adviced function; in addition, around and after get access to the return value computed by the original function. The Lisp implementation achieves this magic by:

  1. Caching the original implementation of the function

  2. Evaluating the advice form to generate a new function definition

  3. Storing this definition as the adviced function

Thus, when the advice fragment shown in the earlier section "A Simple First-Cut Implementation" is evaluated, Emacs' original next-line function is replaced by a modified version that speaks the current line after the original next-line function has completed its work.

Generating Rich Auditory Output

At this point in its evolution, here is what the overall design looked like:

  1. Emacs' interactive commands are speech-enabled or adviced to produce auditory output.

  2. advice definitions are collected into modules, one each for every Emacs application being speech-enabled.

  3. The advice forms forward text to core speech functions.

  4. These functions extract the text to be spoken and forward it to the tts-speak function.

  5. The tts-speak function produces auditory output by preprocessing its text argument and sending it to the speech server.

  6. The speech server handles queued requests to produce perceptible output.

Text is preprocessed by placing the text in a special scratch buffer. Buffers acquire specialized behavior via buffer-specific syntax tables that define the grammar of buffer contents and buffer-local variables that affect behavior. When text is handed off to the Emacspeak core, all of these buffer-specific settings are propagated to the special scratch buffer where the text is preprocessed. This automatically ensures that text is meaningfully parsed into clauses based on its underlying grammar.

Audio formatting using voice-lock

Emacs uses font-lock to syntactically color text. For creating the visual presentation, Emacs adds a text property called face to text strings; the value of this face property specifies the font, color, and style to be used to display that text. Text strings with face properties can be thought of as a conceptual visual display list .

Emacspeak augments these visual display lists with personality text properties whose values specify the auditory properties to use when rendering a given piece of text; this is called voice-lock in Emacspeak. The value of the personality property is an Aural CSS (ACSS) setting that encodes various voice properties—e.g., the pitch of the speaking voice. Notice that such ACSS settings are not specific to any given TTS engine. Emacspeak implements ACSS-to-TTS mappings in engine-specific modules that take care of mapping high-level aural properties—e.g., mapping pitch or pitch-range to engine-specific control codes.

The next few sections describe how Emacspeak augments Emacs to create aural display lists and to process these aural display lists to produce engine-specific output.

Augmenting Emacs to create aural display lists

Emacs modules that implement font-lock call the Emacs built-in function put-text-property to attach the relevant face property. Emacspeak defines an advice fragment that advices the put-text-property function to add in the corresponding personality property when it is asked to add a face property. Note that the value of both display properties ( face and personality ) can be lists; values of these properties are thus designed to cascade to create the final (visual or auditory) presentation. This also means that different parts of an application can progressively add display property values.

The put-text-property function has the following signature:

	(put-text-property START END PROPERTY VALUE &optional OBJECT)

The advice implementation is:

	(defadvice put-text-property (after emacspeak-personality pre act)
	  "Used by emacspeak to augment font lock."
	  (let ((start (ad-get-arg 0)) ;; Bind arguments
	        (end (ad-get-arg 1 ))
	        (prop (ad-get-arg 2)) ;; name of property being added
	        (value (ad-get-arg 3 ))
	        (object (ad-get-arg 4))
	        (voice nil))                   ;; voice it maps to
	    (when (and (eq prop 'face)      ;; avoid infinite recursion
	               (not (= start end))  ;; non-nil text range
	               
emacspeak-personality-voiceify-faces)
	      (condition-case nil ;; safely look up face mapping
	          (progn
	            (cond
	             ((symbolp value)
	              (setq voice (voice-setup-get-voice-for-face value)))
	             ((ems-plain-cons-p value)) ;;pass on plain cons
	             ( (listp value)
	               (setq voice
	                     (delq nil
	                           (mapcar   #'voice-setup-get-voice-for-face value))))
	             (t (message "Got %s" value)))
	            (when voice ;; voice holds list of personalities
	              (funcall 
emacspeak-personality-voiceify-faces start end voice object)))
	        (error nil)))))

Here is a brief explanation of this advice definition:

Bind arguments

First, the function uses the advice built-in ad-get-arg to locally bind a set of lexical variables to the arguments being passed to the adviced function.

Personality setter

The mapping of faces to personalities is controlled by user customizable variable emacspeak-personality-voiceify-faces . If non-nil, this variable specifies a function with the following signature:

	(
emacspeak-personality-put START END PERSONALITY OBJECT)

Emacspeak provides different implementations of this function that either append or prepend the new personality value to any existing personality properties.

Guard

Along with checking for a non-nil emacspeak-personality-voiceify-faces , the function performs additional checks to determine whether this advice definition should do anything. The function continues to act if:

  • The text range is non-nil.

  • The property being added is a face .

The first of these checks is required to avoid edge cases where put-text-property is called with a zero-length text range. The second ensures that we attempt to add the personality property only when the property being added is face . Notice that failure to include this second test would cause infinite recursion because the eventual put-text-property call that adds the personality property also triggers the advice definition.

Get mapping

Next, the function safely looks up the voice mapping of the face (or faces) being applied. If applying a single face , the function looks up the corresponding personality mapping; if applying a list of faces, it creates a corresponding list of personalities.

Apply personality

Finally, the function checks that it found a valid voice mapping and, if so, calls emacspeak-personality-voiceify-faces with the set of personalities saved in the voice variable.

Audio-formatted output from aural display lists

With the advice definitions from the previous section in place, text fragments that are visually styled acquire a corresponding personality property that holds an ACSS setting for audio formatting the content. The result is to turn text in Emacs into rich aural display lists. This section describes how the output layer of Emacspeak is enhanced to convert these aural display lists into perceptible spoken output.

The Emacspeak tts-speak module handles text preprocessing before finally sending it to the speech server. As described earlier, this preprocessing comprises a number of steps, including:

  1. Applying pronunciation rules

  2. Processing repeated strings of punctuation characters

  3. Splitting text into appropriate clauses based on context

  4. Converting the personality property into audio formatting codes

This section describes the tts-format-text-and-speak function, which handles the conversion of aural display lists into audio-formatted output. First, here is the code for the function tts-format-text-and-speak :

	(defsubst tts-format-text-and-speak (start end )
	  "Format and speak text between start and end."
	  (when (and emacspeak-use-auditory-icons
	             (get-text-property start 'auditory-icon)) ;;queue icon
	    (emacspeak-queue-auditory-icon (get-text-property start 'auditory-icon)))
	  (tts-interp-queue (format "%s\n" tts-voice-reset-code))
	  (cond
	   (voice-lock-mode ;; audio format only if voice-lock-mode is on
	    (let ((last nil) ;; initialize
	          (personality (get-text-property start 'personality )))
	      (while (and (
	< start end ) ;; chunk at personality changes
	                      (setq last
	                            (next-single-property-change start 'personality
	                                                         (current-buffer) end)))
	           (if personality ;; audio format chunk
	               (tts-speak-using-voice personality (buffer-substring start last ))
	             (tts-interp-queue (buffer-substring start last)))
	           (setq start last ;; prepare for next chunk
	                 personality (get-text-property last 'personality)))))
	      ;; no voice-lock just send the text
	      (t (tts-interp-queue (buffer-substring start end )))))

The tts-format-text-and-speak function is called one clause at a time, with arguments start and end set to the start and end of the clause. If voice-lock-mode is turned on, this function further splits the clause into chunks at each point in the text where there is a change in value of the personality property. Once such a transition point has been determined, tts-format-text-and-speak calls the function tts-speak-using-voice , passing the personality to use and the text to be spoken. This function, described next, looks up the appropriate device-specific codes before dispatching the audio-formatted output to the speech server:

	(defsubst tts-speak-using-voice (voice text)
	  "Use voice VOICE to speak text TEXT."
	  (unless (or (eq '
inaudible voice ) ;; not spoken if voice inaudible
	              (and (listp voice) (member 'inaudible voice)))
	    (tts-interp-queue
	     (format
	      "%s%s %s \n"
	      (cond
	       ((symbolp voice)
	        (tts-get-voice-command
	         (if (boundp voice ) (symbol-value voice ) voice)))
	       ((listp voice)
	        (mapconcat #'(lambda (v)
	                       (tts-get-voice-command
	                        (if (boundp v ) (symbol-value v ) v)))
	                   voice
	                   " "))
	       (t      ""))
	      text tts-voice-reset-code))))

The tts-speak-using-voice function returns immediately if the specified voice is inaudible . Here, inaudible is a special personality that Emacspeak uses to prevent pieces of text from being spoken. The inaudible personality can be used to advantage when selectively hiding portions of text to produce more succinct output.

If the specified voice (or list of voices) is not inaudible , the function looks up the speech codes for the voice and queues the result of wrapping the text to be spoken between voice-code and tts-reset-code to the speech server.

Using Aural CSS (ACSS) for Styling Speech Output

I first formalized audio formatting within AsTeR, where rendering rules were written in a specialized language called Audio Formatting Language ( AFL). AFL structured the available parameters in auditory space—e.g., the pitch of the speaking voice—into a multidimensional space, and encapsulated the state of the rendering engine as a point in this multidimensional space.

AFL provided a block-structured language that encapsulated the current rendering state by a lexically scoped variable, and provided operators to move within this structured space. When these notions were later mapped to the declarative world of HTML and CSS, dimensions making up the AFL rendering state became Aural CSS parameters, provided as accessibility measures in CSS2 ( http://www.w3.org/Press/1998/CSS2-REC).

Though designed for styling HTML (and, in general, XML) markup trees, Aural CSS turned out to be a good abstraction for building Emacspeak's audio formatting layer while keeping the implementation independent of any given TTS engine.

Here is the definition of the data structure that encapsulates ACSS settings:

	(defstruct acss
	  family gain left-volume right-volume
	  average-
pitch 
pitch-range stress richness 
punctuations)

Emacspeak provides a collection of predefined voice overlays for use within speech extensions. Voice overlays are designed to cascade in the spirit of Aural CSS. As an example, here is the ACSS setting that corresponds to voice-monotone :

	[cl-struct-acss nil nil nil nil nil 0 0 nil all]

Notice that most fields of this acss structure are nil —that is, unset. The setting creates a voice overlay that:

  1. Sets pitch to 0 to create a flat voice.

  2. Sets pitch-range to 0 to create a monotone voice with no inflection.

    This setting is used as the value of the personality property for audio formatting comments in all programming language modes. Because its value is an overlay, it can interact effectively with other aural display properties. As an example, if portions of a comment are displayed in a bold font, those portions can have the voice-bolden personality (another predefined overlay) added; this results in setting the personality property to a list of two values: ( voice-bolden voice-monotone ). The final effect is for the text to get spoken with a distinctive voice that conveys both aspects of the text: namely, a sequence of words that are emphasized within a comment.

  3. Sets punctuations to all so that all punctuation marks are spoken.

Adding Auditory Icons

Rich visual user interfaces contain both text and icons. Similarly, once Emacspeak had the ability to speak intelligently, the next step was to increase the bandwidth of aural communication by augmenting the output with auditory icons.

Auditory icons in Emacspeak are short sound snippets (no more than two seconds in duration) and are used to indicate frequently occurring events in the user interface. As an example, every time the user saves a file, the system plays a confirmatory sound. Similarly, opening or closing an object (anything from a file to a web site) produces a corresponding auditory icon. The set of auditory icons were arrived at iteratively and cover common events such as objects being opened, closed, or deleted. This section describes how these auditory icons are injected into Emacspeak's output stream.

Auditory icons are produced by the following user interactions:

  • To cue explicit user actions

  • To add additional cues to spoken output

Auditory icons that confirm user actions—e.g., a file being saved successfully—are produced by adding an after advice to the various Emacs built-ins. To provide a consistent sound and feel across the Emacspeak desktop, such extensions are attached to code that is called from many places in Emacs.

Here is an example of such an extension, implemented via an advice fragment:

	(defadvice save-buffer (after emacspeak pre act)
	  "Produce an auditory icon if possible."
	  (when (interactive-p) (emacspeak-auditory-icon 'save-object)
	    (or emacspeak-last-message (message "Wrote %s" (buffer-file-name)))))

Extensions can also be implemented via an Emacs-provided hook. As explained in the brief advice tutorial given earlier, advice allows the behavior of existing software to be extended or modified without having to modify the underlying source code. Emacs is itself an extensible system, and well-written Lisp code has a tradition of providing appropriate extension hooks for common use cases. As an example, Emacspeak attaches auditory feedback to Emacs' default prompting mechanism (the Emacs minibuffer) by adding the function emacspeak-minibuffer-setup-hook to Emacs' minibuffer-setup-hook :

	(defun emacspeak-minibuffer-setup-hook ()
	  "Actions to take when entering the minibuffer."
	  (let ((inhibit-field-text-motion t))
	    (when emacspeak-minibuffer-enter-auditory-icon
	      (emacspeak-auditory-icon 'open-object))
	    (tts-with-punctuations 'all (emacspeak-speak-buffer))))
	(add-hook 'minibuffer-setup-hook 'emacspeak-minibuffer-setup-hook)

This is a good example of using built-in extensibility where available. However, Emac-speak uses advice in a lot of cases because the Emacspeak requirement of adding auditory feedback to all of Emacs was not originally envisioned when Emacs was implemented. Thus, the Emacspeak implementation demonstrates a powerful technique for discovering extension points.

Lack of an advice -like feature in a programming language often makes experimentation difficult, especially when it comes to discovering useful extension points. This is because software engineers are faced with the following trade-off:

  • Make the system arbitrarily extensible (and arbitrarily complex)

  • Guess at some reasonable extension points and hardcode these

Once extension points are implemented, experimenting with new ones requires rewriting existing code, and the resulting inertia often means that over time, such extension points remain mostly undiscovered. Lisp advice , and its Java counterpart Aspects, offer software engineers the opportunity to experiment without worrying about adversely affecting an existing body of source code.

Producing Auditory Icons While Speaking Content

In addition to using auditory icons to cue the results of user interaction, Emacspeak uses auditory icons to augment what is being spoken. Examples of such auditory icons include:

  • A short icon at the beginning of paragraphs

  • The auditory icon mark-object when moving across source lines that have a breakpoint set on them

Auditory icons are implemented by attaching the text property emacspeak-auditory-icon with a value equal to the name of the auditory icon to be played on the relevant text.

As an example, commands to set breakpoints in the Grand Unified Debugger Emacs package (GUD) are adviced to add the property emacspeak-auditory-icon to the line containing the breakpoint. When the user moves across such a line, the function tts-format-text-and-speak queues the auditory icon at the right point in the output stream.

The Calendar: Enhancing Spoken Output with Context-Sensitive Semantics

To summarize the story so far, Emacspeak has the ability to:

  • Produce auditory output from within the context of an application

  • Audio-format output to increase the bandwidth of spoken communication

  • Augment spoken output with auditory icons

This section explains some of the enhancements that the design makes possible.

I started implementing Emacspeak in October 1994 as a quick means of developing a speech solution for Linux. It was when I speech-enabled the Emacs Calendar in the first week of November 1994 that I realized that in fact I had created something far better than any other speech-access solution I had used before.

A calendar is a good example of using a specific type of visual layout that is optimized both for the visual medium as well as for the information that is being conveyed. We intuitively think in terms of weeks and months when reasoning about dates; using a tabular layout that organizes dates in a grid with each week appearing on a row by itself matches this perfectly. With this form of layout, the human eye can rapidly move by days, weeks, or months through the calendar and easily answer such questions as "What day is it tomorrow?" and "Am I free on the third Wednesday of next month?"

Notice, however, that simply speaking this two-dimensional layout does not transfer the efficiencies achieved in the visual context to auditory interaction. This is a good example of where the right auditory feedback has to be generated directly from the underlying information being conveyed, rather than from its visual representation. When producing auditory output from visually formatted information, one has to rediscover the underlying semantics of the information before speaking it.

In contrast, when producing spoken feedback via advice definitions that extend the under-lying application, one has full access to the application's runtime context. Thus, rather than guessing based on visual layout, one can essentially instruct the underlying application to speak the right thing !

The emacspeak-calendar module speech-enables the Emacs Calendar by defining utility functions that speak calendar information and advising all calendar navigation commands to call these functions. Thus, Emacs Calendar produces specialized behavior by binding the arrow keys to calendar navigation commands rather than the default cursor navigation found in regular editing modes. Emacspeak specializes this behavior by advising the calendar-specific commands to speak the relevant information in the context of the calendar.

The net effect is that from an end user's perspective, things just work . In regular editing modes, pressing up/down arrows speaks the current line; pressing up/down arrows in the calendar navigates by weeks and speaks the current date.

The emacspeak-calendar-speak-date function, defined in the emacspeak-calendar module, is shown here. Notice that it uses all of the facilities described so far to access and audio-format the relevant contextual information from the calendar:

	(defsubst 
emacspeak-calendar-entry-marked-p( )
	  (member 'diary (mapcar #'overlay-face (overlays-at (point)))))
	(defun 
emacspeak-calendar-speak-date( )
	  "Speak the date under point when called in Calendar Mode. "
	  (let ((date (calendar-date-string (calendar-cursor-to-date t))))
	    (cond
	     ((emacspeak-calendar-entry-marked-p) (tts-speak-using-voice mark-personality
	date))
	     (t (tts-speak date)))))

Emacs marks dates that have a diary entry with a special overlay. In the previous definition, the helper function emacspeak-calendar-entry-marked-p checks this overlay to implement a predicate that can be used to test if a date has a diary entry. The emacspeak-calendar-speak-date function uses this predicate to decide whether the date needs to be rendered in a different voice; dates that have calendar entries are spoken using the mark-personality voice. Notice that the emacspeak-calendar-speak-date function accesses the calendar's runtime context in the call:

	(calendar-date-string (calendar-cursor-to-date t))

The emacspeak-calendar-speak-date function is called from advice definitions attached to all calendar navigation functions. Here is the advice definition for function calendar-forward-week :

	(defadvice calendar-forward-week (after emacspeak pre act)
	  "Speak the date. "
	  (when (interactive-p) (emacspeak-speak-calendar-date )
	    (emacspeak-auditory-icon 'large-movement)))

This is an after advice , because we want the spoken feedback to be produced after the original navigation command has done its work.

The body of the advice definition first calls the function emacspeak-calendar-speak-date to speak the date under the cursor; next, it calls emacspeak-auditory-icon to produce a short sound indicating that we have successfully moved.

31.3. Painless Access to Online Information

With all the necessary affordances to generate rich auditory output in place, speech-enabling Emacs applications using Emacs Lisp's advice facility requires surprisingly small amounts of specialized code. With the TTS layer and the Emacspeak core handling the complex details of producing good quality output, the speech-enabling extensions focus purely on the specialized semantics of individual applications; this leads to simple and consequently beautiful code. This section illustrates the concept with a few choice examples taken from Emacspeak's rich suite of information access tools.

Right around the time I started Emacspeak, a far more profound revolution was taking place in the world of computing: the World Wide Web went from being a tool for academic research to a mainstream forum for everyday tasks. This was 1994, when writing a browser was still a comparatively easy task. The complexity that has been progressively added to the Web in the subsequent 12 years often tends to obscure the fact that the Web is still a fundamentally simple design where:

  • Content creators publish web resources addressable via URIs.

  • URI-addressable content is retrievable via open protocols.

  • Retrieved content is in HTML, a well-understood markup language.

Notice that the basic architecture just sketched out says little to nothing about how the content is made available to the end user. The mid-1990s saw the Web move toward increasingly complex visual interaction. The commercial Web with its penchant for flashy visual interaction increasingly moved away from the simple data-oriented interaction that had characterized early web sites. By 1998, I found that the Web had a lot of useful interactive sites; to my dismay, I also found that I was using progressively fewer of these sites because of the time it took to complete tasks when using spoken output.

This led me to create a suite of web-oriented tools within Emacspeak that went back to the basics of web interaction. Emacs was already capable of rendering simple HTML into interactive hypertext documents. As the Web became complex, Emacspeak acquired a collection of interaction wizards built on top of Emacs' HTML rendering capability that progressively factored out the complexity of web interaction to create an auditory interface that allowed the user to quickly and painlessly listen to desired information.

Basic HTML with Emacs W3 and Aural CSS

Emacs W3 is a bare-bones web browser first implemented in the mid-1990s. Emacs W3 implemented CSS (Cascading Style Sheets) early on, and this was the basis of the first Aural CSS implementation, which was released at the time I wrote the Aural CSS draft in February 1996. Emacspeak speech-enables Emacs W3 via the emacspeak-w3 module, which implements the following extensions:

  • An aural media section in the default stylesheet for Aural CSS.

  • advice added to all interactive commands to produce auditory feedback.

  • Special patterns to recognize and silence decorative images on web pages.

  • Aural rendering of HTML form fields along with the associated label , which underlay the design of the label element in HTML 4.

  • Context-sensitive rendering rules for HTML form controls. As an example, given a group of radio buttons for answering the question:

    Do you accept?

    Emacspeak extends Emacs W3 to produce a spoken message of the form:

    Radio group Do you accept? has Yes pressed.

    and:

    Press this to change radio group Do you accept? from Yes to No .
  • A before advice defined for the Emacs W3 function w3-parse-buffer that applies user-requested XSLT transforms to HTML pages.

The emacspeak-websearch Module for Task-Oriented Search

By 1997, interactive sites on the Web, ranging from Altavista for searching to Yahoo! Maps for online directions, required the user to go through a highly visual process that included:

  1. Filling in a set of form fields

  2. Submitting the resulting form

  3. Spotting the results in the resulting complex HTML page

The first and third of these steps were the ones that took time when using spoken output. I needed to first locate the various form fields on a visually busy page and wade through a lot of complex boilerplate material on result pages before I found the answer.

Notice that from the software design point of view, these steps neatly map into pre-action and post-action hooks. Because web interaction follows a very simple architecture based on URIs, the pre-action step of prompting the user for the right pieces of input can be factored out of a web site and placed in a small piece of code that runs locally; this obviates the need for the user to open the initial launch page and seek out the various input fields.

Similarly, the post-action step of spotting the actual results amid the rest of the noise on the resulting page can also be delegated to software.

Finally, notice that even though these pre-action and post-action steps are each specific to particular web sites, the overall design pattern is one that can be generalized. This insight led to the emacspeak-websearch module, a collection of task-oriented web tools that:

  1. Prompted the user

  2. Constructed an appropriate URI and pulled the content at that URI

  3. Filtered the result before rendering the relevant content via Emacs W3

Here is the emacspeak-websearch tool for accessing directions from Yahoo! Maps:

	(defsubst 
emacspeak-websearch-yahoo-map-directions-get-locations ( )
	  "Convenience 
function for prompting and constructing the route component."
	  (concat
	   (format "&newaddr=%s"
	           (emacspeak-url-encode (read-from-minibuffer "Start Address: ")))
	   (format "&newcsz=%s"
	           (emacspeak-url-encode (read-from-minibuffer "City/State or Zip:")))
	   (format "&newtaddr=%s"
	           (emacspeak-url-encode (read-from-minibuffer "Destination Address: ")))
	   (format "&newtcsz=%s"
	           (emacspeak-url-encode (read-from-minibuffer "City/State or Zip:")))))
	(defun emacspeak-websearch-yahoo-map-directions-search (query )
	  "Get driving directions from Yahoo."
	  (interactive
	   (list (emacspeak-websearch-yahoo-map-directions-get-locations))
	   (emacspeak-w3-extract-table-by-match
	    "Start"
	    (concat emacspeak-websearch-yahoo-maps-uri query))))

A brief explanation of the previous code follows:

Pre-action

The emacspeak-websearch-yahoo-map-directions-get-locations function prompts the user for the start and end locations. Notice that this function hardwires the names of the query parameters used by Yahoo! Maps. On the surface, this looks like a kluge that is guaranteed to break. In fact, this kluge has not broken since it was first defined in 1997. The reason is obvious: once a web application has published a set of query parameters, those parameters get hardcoded in a number of places, including within a large number of HTML pages on the originating web site. Depending on parameter names may feel brittle to the software architect used to structured, top-down APIs, but the use of such URL parameters to define bottom-up web services leads to the notion of RESTful web APIs.

Retrieve content

The URL for retrieving directions is constructed by concatenating the user input to the base URI for Yahoo! Maps.

Post-action

The resulting URI is passed to the function emacspeak-w3-extract-table-by-match along with a search pattern Start to:

  • Retrieve the content using Emacs W3.

  • Apply an XSLT transform to extract the table containing Start .

  • Render this table using Emacs W3's HTML formatter.

Unlike the query parameters, the layout of the results page does change about once a year, on average. But keeping this tool current with Yahoo! Maps comes down to maintaining the post-action portion of this utility. In over eight years of use, I have had to modify it about half a dozen times, and given that the underlying platform provides many of the tools for filtering the result page, the actual lines of code that need to be written for each layout change is minimal.

The emacspeak-w3-extract-table-by-match function uses an XSLT transformation that filters a document to return tables that contain a specified search pattern. For this example, the function constructs the following XPath expression:

	(/descendant::table[contains(., Start)])[last( )]

This effectively picks out the list of tables that contain the string Start and returns the last element of that list.

Seven years after this utility was written, Google launched Google Maps to great excitement in February 2005. Many blogs on the Web put Google Maps under the microscope and quickly discovered the query parameters used by that application. I used that to build a corresponding Google Maps tool in Emacspeak that provides similar functionality. The user experience is smoother with the Google Maps tool because the start and end locations can be specified within the same parameter. Here is the code for the Google Maps wizard:

	(defun 
emacspeak-websearch-emaps-search (query &optional use-near)
	  "Perform EmapSpeak search. Query is in plain English."
	  (interactive
	   (list
	    (emacspeak-websearch-read-query
	     (if current-prefix-arg
	         (format "Find what near %s: "
	                 emacspeak-websearch-emapspeak-my-location)
	       "EMap Query: "))
	    current-prefix-arg))
	  (let ((near-p ;; determine query type
	         (unless use-near
	           (save-match-data (and (string-match "near" query) (match-end 0)))))
	        (near nil)
	        (uri nil))
	    (when near-p ;; determine location from query
	      (setq near (substring query near-p))
	      (setq emacspeak-websearch-emapspeak-my-location near))
	    (setq uri
	          (cond
	           (use-near
	            (format 
emacspeak-websearch-google-maps-uri
	                    (emacspeak-url-encode
	                     (format "%s near %s" query near))))
	           (t (format 
emacspeak-websearch-google-maps-uri
	                     (emacspeak-url-encode query)))))
	    (add-hook 'emacspeak-w3-post-process-hook 'emacspeak-speak-buffer)
	    (add-hook 'emacspeak-w3-post-process-hook
	              #'(lambda nil
	                  (emacspeak-pronounce-add-buffer-local-dictionary-entry
	                   "ðmi" " miles ")))
	    (browse-url-of-buffer
	     (emacspeak-xslt-xml-url
	      (expand-file-name "kml2html.xsl" emacspeak-xslt-directory)
	      uri))))

A brief explanation of the code follows:

  1. Parse the input to decide whether it's a direction or a search query.

  2. In case of search queries, cache the user's location for future use.

  3. Construct a URI for retrieving results.

  4. Browse the results of filtering the contents of the URI through the XSLT filter kml2html , which converts the retrieved content into a simple hypertext document.

  5. Set up custom pronunciations in the results to pronounce mi as "miles."

Notice that, as before, most of the code focuses on application-specific tasks. Rich spoken output is produced by creating the results as a well-structured HTML document with the appropriate Aural CSS rules producing an audio-formatted presentation.

The Web Command Line and URL Templates

With more and more services becoming available on the Web, another useful pattern emerged by early 2000: web sites started creating smart client-side interaction via Java-Script. One typical use of such scripts was to construct URLs on the clientside for accessing specific pieces of content based on user input. As examples, Major League Baseball constructs the URL for retrieving scores for a given game by piecing together the date and the names of the home and visiting teams, and NPR creates URLs by piecing together the date with the program code of a given NPR show.

To enable fast access to such services, I added an emacspeak-url-template module in late 2000. This module has become a powerful companion to the emacspeak-websearch module described in the previous section. Together, these modules turn the Emacs minibuffer into a powerful web command line that provides rapid access to web content.

Many web services require the user to specify a date. One can usefully default the date by using the user's calendar to provide the context. Thus, Emacspeak tools for playing an NPR program or retrieving MLB scores default to using the date under the cursor when invoked from within the Emacs calendar buffer.

URL templates in Emacspeak are implemented using the following data structure:

	(defstruct (emacspeak-url-template (:constructor emacspeak-ut-constructor))
	  name                                  ;; Human-readable name
	  template                              ;; template URL string
	  generators;; list of param generator
	  post-action                    ;; action to perform after opening
	  documentation                         ;; resource documentation
	  fetcher)

Users invoke URL templates via the Emacspeak command emacspeak-url-template-fetch command, which prompts for a URL template and:

  1. Looks up the named template.

  2. Prompts the user by calling the specified generator .

  3. Applies the Lisp function format to the template string and the collected arguments to create the final URI.

  4. Sets up any post actions performed after the content has been rendered.

  5. Applies the specified fetcher to render the content.

The use of this structure is best explained with an example. The following is the URL template for playing NPR programs:

	(emacspeak-url-template-define
	 "NPR On Demand"
	 "http://www.npr.org/dmg/dmg.php?prgCode=%s&showDate=%s&segNum=%s&mediaPref=RM"
	 (list
	  #'(lambda ( ) (upcase (read-from-minibuffer "Program code:")))
	  #'(lambda ( )
	      (emacspeak-url-template-collect-date "Date:" "%d-%b-%Y"))
	  "Segment:")
	 nil; no post actions
	 "Play NPR shows on demand.
	Program is specified as a program code:
	ME              Morning Edition
	ATC             All Things Considered
	day             Day To Day
	newsnotes       News And Notes
	totn            Talk Of The Nation
	fa              Fresh Air
	wesat           Weekend Edition Saturday
	wesun           Weekend Edition Sunday
	fool            The Motley Fool
	Segment is specified as a two digit number --specifying a blank value
	plays entire program."
	 #'(lambda (url)
	     (funcall emacspeak-media-player url 'play-list)
	     (emacspeak-w3-browse-xml-url-with-style
	      (expand-file-name "smil-anchors.xsl" emacspeak-xslt-directory)
	      url)))

In this example, the custom fetcher performs two actions:

  1. Launches a media player to start playing the audio stream.

  2. Filters the associated SMIL document via the XSLT file smil-anchors.xsl .

The Advent of Feed Readers

When I implemented the emacspeak-websearch and emacspeak-url-template modules, Emacspeak needed to screen-scrape HTML pages to speak the relevant information. But as the Web grew in complexity, the need to readily get beyond the superficial presentation of pages to the real content took on a wider value than eyes-free access. Even users capable of working with complex visual interfaces found themselves under a serious information overload. This led to the advent of RSS and Atom feeds, and the concomitant arrival of feed reading software.

These developments have had a very positive effect on the Emacspeak code base. During the past few years, the code has become more beautiful as I have progressively deleted screen-scraping logic and replaced it with direct content access. As an example, here is the Emacspeak URL template for retrieving the weather for a given city/state:

	(emacspeak-url-template-define
	 "rss weather from wunderground"
	 "http://www.wunderground.com/auto/rss_full/%s.xml?units=both"
	 (list "State/City e.g.: MA/Boston") nil
	 "Pull RSS weather feed for specified state/city."
	 'emacspeak-rss-display)

And here is the URL template for Google News searches via Atom feeds:

	(emacspeak-url-template-define
	 "Google News Search"
	 "http://news.google.com/news?hl=en&ned=tus&q=%s&btnG=Google+Search&output=atom"
	 (list "Search news for: ") nil "Search Google news."
	 'emacspeak-atom-display )

Both of these tools use all of the facilities provided by the emacspeak-url-template module and consequently need to do very little on their own. Finally, notice that by relying on standardized feed formats such as RSS and Atom, these templates now have very little in the way of site-specific kluges, in contrast to older tools like the Yahoo! Maps wizard, which hardwired specific patterns from the results page.

31.4. Summary

Emacspeak was conceived as a full-fledged, eyes-free user interface to everyday computing tasks. To be full-fledged , the system needed to provide direct access to every aspect of computing on desktop workstations. To enable fluent eyes-free interaction, the system needed to treat spoken output and the auditory medium as a first-class citizen—i.e., merely reading out information displayed on the screen was not sufficient.

To provide a complete audio desktop , the target environment needed to be an interaction framework that was both widely deployed and fully extensible. To be able to do more than just speak the screen, the system needed to build interactive speech capability into the various applications.

Finally, this had to be done without modifying the source code of any of the underlying applications; the project could not afford to fork a suite of applications in the name of adding eyes-free interaction, because I wanted to limit myself to the task of maintaining the speech extensions.

To meet all these design requirements, I picked Emacs as the user interaction environment. As an interaction framework, Emacs had the advantage of having a very large developer community. Unlike other popular interaction frameworks available in 1994 when I began the project, it had the significant advantage of being a free software environment. (Now, 12 years later, Firefox affords similar opportunities.)

The enormous flexibility afforded by Emacs Lisp as an extension language was an essential prerequisite in speech-enabling the various applications. The open source nature of the platform was just as crucial; even though I had made an explicit decision that I would modify no existing code, being able to study how various applications were implemented made speech-enabling them tractable. Finally, the availability of a high-quality advice implementation for Emacs Lisp (note that Lisp's advice facility was the prime motivator behind Aspect Oriented Programming) made it possible to speech-enable applications authored in Emacs Lisp without modifying the original source code.

Emacspeak is a direct consequence of the matching up of the needs previously outlined and the affordances provided by Emacs as a user interaction environment.

Managing Code Complexity Over Time

The Emacspeak code base has evolved over a period of 12 years. Except for the first six weeks of development, the code base has been developed and maintained using Emacspeak itself. This section summarizes some of the lessons learned with respect to managing code complexity over time.

Throughout its existence, Emacspeak has always remained a spare-time project. Looking at the code base across time, I believe this has had a significant impact on how it has evolved. When working on large, complex software systems as a full-time project, one has the luxury of focusing one's entire concentration on the code base for reasonable stretches of time—e.g., 6 to 12 weeks. This results in tightly implemented code that creates deep code bases.

Despite one's best intentions, this can also result in code that becomes hard to understand with the passage of time. Large software systems where a single engineer focuses exclusively on the project for a number of years are almost nonexistent; that form of single-minded focus usually leads to rapid burnout!

In contrast, Emacspeak is an example of a large software system that has had a single engineer focused on it over a period of 12 years, but only in his spare time. A consequence of developing the system single-handedly over a number of years is that the code base has tended to be naturally "bushy." Notice the distribution of files and lines of code summarized in Table 31-1, “Summary of Emacspeak codebase ”.

Table 31-1. Summary of Emacspeak codebase

Layer

Files

Lines

Percentage

TTS core

6

3866

6.0

Emacspeak core

16

12174

18.9

Emacspeak extensions

160

48339

75.0

Total

182

64379

99.9

Table 31-1, “Summary of Emacspeak codebase ” highlights the following points:

  • The TTS core responsible for high-quality speech output is isolated in 6 out of 182 files, and makes up six percent of the code base.

  • The Emacspeak core—which provides high-level speech services to Emacspeak extensions, in addition to speech-enabling all basic Emacs functionality—is isolated to 16 files, and makes up about 19 percent of the code base.

  • The rest of the system is split across 160 files, which can be independently improved (or broken) without affecting the rest of the system. Many modules, such as emacspeak-url-template , are themselves bushy—i.e., an individual URL template can be modified without affecting any of the other URL templates.

  • advice reduces code size. The Emacspeak code base, which has approximately 60,000 lines of Lisp code, is a fraction of the size of the underlying system being speech-enabled. A rough count at the end of December 2006 shows that Emacs 22 has over a million lines of Lisp code; in addition, Emacspeak speech-enables a large number of applications not bundled by default with Emacs.

Conclusion

Here is a brief summary of the insights gained from implementing and using Emacspeak:

  • Lisp advice , and its object-oriented equivalent Aspect Oriented Programming, are very effective means for implementing cross-cutting concerns—e.g., speech-enabling a visual interface.

  • advice is a powerful means for discovering potential points of extension in a complex software system.

  • Focusing on basic web architecture, and relying on a data-oriented web backed by standardized protocols and formats, leads to powerful spoken web access.

  • Focusing on the final user experience, as opposed to individual interaction widgets such as sliders and tree controls, leads to a highly efficient, eyes-free environment.

  • Visual interaction relies heavily on the human eye's ability to rapidly scan the visual display. Effective eyes-free interaction requires transferring some of this responsibility to the computer because listening to large amounts of information is time-consuming. Thus, search in every form is critical for delivering effective eyes-free interaction, on the continuum from the smallest scale (such as Emacs' incremental search to find the right item in a local document) to the largest (such as a Google search to quickly find the right document on the global Web).

  • Visual complexity, which may become merely an irritant for users capable of using complex visual interfaces, is a show-stopper for eyes-free interaction. Conversely, tools that emerge early in an eyes-free environment eventually show up in the mainstream when the nuisance value of complex visual interfaces crosses a certain threshold. Two examples of this from the Emacspeak experience are:

—RSS and Atom feeds replacing the need for screen-scraping just to retrieve essential information such as the titles of articles.
—Emacspeak's use of XSLT to filter content in 2000 parallels the advent of Grease-monkey for applying custom client-side JavaScript to web pages in 2005.

31.5. Acknowledgments

Emacspeak would not exist without Emacs and the ever-vibrant Emacs developer community that has made it possible to do everything from within Emacs. The Emacspeak implementation would not have been possible without Hans Chalupsky's excellent advice implementation for Emacs Lisp.

Project libxslt from the GNOME project has helped breathe fresh life into William Perry's Emacs W3 browser; Emacs W3 was one of the early HTML rendering engines, but the code has not been updated in over eight years. That the W3 code base is still usable and extensible bears testimony to the flexibility and power afforded by Lisp as the implementation language.

Afterword

Andy Oram

Beautiful code surveys the range of human invention and ingenuity in one area of endeavor: the development of computer systems. The beauty in each chapter comes from the discovery of unique solutions, a discovery springing from the authors' power to look beyond set boundaries, to recognize needs overlooked by others, and to find surprising solutions to troubling problems.

Many of the authors confronted limitations—in the physical environment, in the resources available, or in the very definition of their requirements—that made it hard even to imagine solutions. Others entered domains where solutions already existed, but brought in a new vision and a conviction that something much better could be achieved.

All the authors in this book have drawn lessons from their projects. But we can also draw some broader lessons after making the long and eventful journey through the whole book.

First, there are times when tried and true rules really do work. So, often one encounters difficulties when trying to maintain standards for robustness, readability, or other tenets of good software engineering. In such situations, it is not always necessary to abandon the principles that hold such promise. Sometimes, getting up and taking a walk around the problem can reveal a new facet that allows one to meet the requirements without sacrificing good technique.

On the other hand, some chapters confirm the old cliché that one must know the rules before one can break them. Some of the authors built up decades of experience before taking a different path toward solving one thorny problem—and this experience gave them the confidence to break the rules in a constructive way.

On the other hand, cross-disciplinary studies are also championed by the lessons in this book. Many authors came into new domains and had to fight their way in relative darkness. In these situations, a particularly pure form of creativity and intelligence triumphed.

Finally, we learn from this book that beautiful solutions don't last for all time. New circumstances always require a new look. So, if you read the book and thought, "I can't use these authors' solutions on any of my own projects," don't worry—next time these authors have projects, they will use different solutions, too.

For about two months I worked intensively on this book by helping authors hone their themes and express their points. This immersion in the work of superbly talented inventors proved to be inspiring and even uplifting. It gave me the impulse to try new things, and I hope this book does the same for its readers.