Conversational Gestures For Direct Manipulation On The
                                 Audio Desktop
                                      T. V. Raman
                            Advanced Technology Group
                                    Adobe Systems
                          E-mail : &lt;raman@adobe.com&gt;
                         WWW: http://cs.cornell.edu/home/raman
Abstract
We describe the speech-enabling approach to
building auditory interfaces that treat speech as
a first-class modality. The process of designing
effective auditory interfaces is decomposed into
identifying the atomic actions that make up the
user interaction and the conversational gestures
that enable these actions. The auditory interface
is then synthesized by mapping these
conversational gestures to appropriate
primitives in the auditory environment.
We illustrate this process with a concrete
example by developing an auditory interface to
the visually intensive task of playing tetris.
Playing Tetris is a fun activity1  that has many of
the same demands as day-to-day activities on
the electronic desktop. Speech-enabling Tetris
thus not only provides a fun way to exercise
ones geometric reasoning abilities --it provides
useful lessons in speech-enabling
common-place computing tasks.
   1This paper was seriously delayed because the author
was too busy playing the game.
 0                                              1   Introduction
                                               The phrase desktop no longer conjures up the
                                               image of a polished high-quality wooden
                                               surface. The pervasiveness of computing in the
                                               workplace during the last decade has led to the
                                               concept of a virtual electronic desktop --a
                                               logical workspace made up of the documents
                                               one works with and the applications used to
                                               operate on these documents. Progressive
                                               innovations in the Graphical User Interface
                                               (GUI) have helped strengthen this metaphor
                                               --today, the typical desktop enables the user to
                                               organize the tools of his trade by dragging and
                                               dropping graphical icons into a visual
                                               two dimensional workspace represented on the
                                               computer monitor. Given this tight association
                                               between visual interaction and today's
                                               electronic desktop, the phrase audio desktop is
                                               likely to raise a few eyebrows (Or should it be
                                               ear lobes)!
                                               This paper focuses on specific aspects of
                                               auditory interaction with a view to enabling an
                                               audio desktop. The audio desktop is defined in
                                               detail in Chapter 3 of [Ram97a]. Using the
                                               speech-enabling approach first introduced in
                                               [Ram96a, Ram96b, Ram97b], we demonstrate
                                               how the functionality of the electronic desktop
                                               can be exposed through an auditory interface.
                                               The attempt is not to speak the visual desktop;
                                               rather, we identify the key user-level
                                               functionality enabled by the modern electronic
                                               desktop and briefly describe how this can be
                                                                                     Page 1
translated to an auditory environment.
This paper specifically illustrates the auditory
analogue to gestures available on the visual
desktop by describing an auditory interface to
the popular game of Tetris. For a detailed
overview of a full implementation of an
auditory desktop, see Chapter 4 of [Ram97a].
Though speech-enabling a game like Tetris
might seem a somewhat light-hearted (and
perhaps even pointless) activity, there are
important lessons to be learned from
speech-enabling such a visually intensive task.
Many of the demands placed on the user by a
game like Tetris are closely paralleled by the
functional abilities demanded by today's
computer interfaces. Speech-enabling Tetris
thus provides a fun activity on the surface while
exposing deeper research ideas that have
wide-ranging applicability in the general design
of auditory interfaces to tomorrow's
information systems.
In visual interaction, the user actively browses
different portions of a relatively static
two dimensional display to locate and
manipulate objects of interest. Such
manipulations are aided by hand-eye
coordination in visually intensive tasks like
playing Tetris. Contrast this with auditory
displays that are characterized by the temporal
nature of aural interaction; here, the display --a
one-dimensional stream of auditory output--
scrolls continuously past a passive listener. This
disparity between aural and visual interaction
influences the organizational paradigms that are
effective in auditory interaction. The purpose of
this paper is to systematically investigate the
design of an effective audio interaction to a
visually intensive task like playing Tetris. The
steps in evolving such an interface can be
enumerated as:
 <IMG SRC="http://emacspeak.sourceforge.net/releases/raman/publications/assets-98/cmsy10-c-f.gif"ALT="*"> Identify user functionality enabled by the
   visual interface,
 <IMG SRC="http://emacspeak.sourceforge.net/releases/raman/publications/assets-98/cmsy10-c-f.gif"ALT="*"> Exploit features of auditory displays to
   enable equivalent functionality and               <IMG SRC="http://emacspeak.sourceforge.net/releases/raman/publications/assets-98/cmsy10-c-f.gif"ALT="*"> Evolve navigational and organizational
                                                    paradigms for aural interaction that
                                                    compensate for the temporal,
                                                    one-dimensional nature of audio by
                                                    exploiting other features of aural
                                                    interaction.
                                               2   Conversational Gestures
                                               User interface is a means to enabling
                                               man-machine communication. This
                                               man-machine dialogue takes place by means of
                                               a set of simple conversational gestures designed
                                               to overcome the impedance mismatch in the
                                               abilities of man and machine. These gestures
                                               are realized in today's Graphical User
                                               Interfaces (GUIs) by user interface widgets
                                               such as list boxes and scroll bars.
                                                    ___________________________________________________________NaturalLanguage
      Edit widgets        Message widgets  
              Answering Yes Or No              
          Toggles                Check boxes      
                   Select From Set                   
     Radio groups             List boxes        
                 Select From Range                 
           Sliders                  Scroll Bars       
       Traversing Complex Structures       
   Previous     Next     Parent      Child     
       Left        Right       Up        Down    
      First         Last       Root        Exit      
                                               Figure 1:  Conversational gestures --the build-
                                               ing blocks for dialogues.  User interface design
                                               tries to bridge the impedance mismatch in man-
                                               machine communication by inventing a basic set
                                               of conversational gestures that can be effectively
                                               generated and interpreted by both man and ma-
                                               chine.
                                               We first enumerate the basic conversational
                                               gestures that constitute today's user interfaces
                                               in figure 1. Separating conversational gestures
                                               e.g., select an element from a list from the
                                                                                     Page 2
modality-specific realization of that gesture --a
list box in the case of the GUI-- (see Figure 1
on the preceding page) is the first step in
evolving speech-centric man-machine
dialogues.
3   The Game Of Tetris
This section briefly describes the game of Tetris
and enumerates the conversational gestures
involved in playing the game. These gestures
are introduced with respect to the familiar
visual interface; later sections translate these to
appropriate gestures in an auditory interface.
The game involves forming rows by arranging
interlocking shapes. When complete these rows
disappear from the board. Tetris shapes are the
seven possible arrangements of four square tiles
--see Figure 2. The shapes drop from the top of
the screen, and the user has to move and rotate
the shape before dropping it to fit in with those
at the bottom of the playing area. In this paper,
we consider an instance of the game where the
playing area is ten columns wide and twenty
rows high.
Playing Tetris involves geometric reasoning to
decide where best to fit the current tile. In the
visual interface, the user can use the
two-dimensional nature of the visual display
backed up by hand-eye coordination to line up
the current shape with the available openings on
the bottom row. The conversational gestures
involved are:
Indicate Current Shape The current shape is
   indicated to the user by dropping it from
   the top of the playing area.
Choose Location The user examines the
   available openings on the bottom row to
   mentally construct a set of available
   positions that the shape can be placed in.
Choose Orientation The user selects a valid
   orientation for the shape and its chosen             location.
                                               Fit Shape The user fits this shape at the chosen
                                                    location and is given the next shape.
                                               Update State If fitting this shape completed a
                                                    row, the user is cued appropriately. The
                                                    playing area is redrawn to indicate the
                                                    available openings for the next shape.
                                                 _______________________________________________________________________11
 1  1____________________________________
                         Box 2 × 2 square_________________________
                                2  2  2
           2__________________________________
  Right L 2 × 3 matrix with (2, 1), (2, 2) empty__
                                3  3  3
 3         ___________________________________
   Left L 2 × 3 matrix with (2, 2), (2, 3) empty___
                                4  4    
      4  4__________________________________
       Z 2 × 3 matrix with (1, 3), (2, 1) empty_______
                                    5  5
 5  5    ___________________________________
       S 2 × 3 matrix with (1, 1), (2, 3) empty_______
                                6  6  6
      6    ___________________________________
       T 2 × 3 matrix with (2, 1), (2, 3) empty_______
                              7  7  7  7_______________________________
                        Edge 1 × 4 matrix
                                                    Figure 2: The seven shapes of Tetris.
                                               4   Direct Manipulation In A
                                                    Visual Environment
                                               Playing Tetris exercises one of the basic
                                               functionalities of the electronic desktop,
                                                                                     Page 3
namely, the user's ability to directly manipulate
objects in the interface. Here, the user expresses
actions by selecting and moving the shape with
apointing device or appropriate keyboard
events. The continuous visual feedback loop
that is a direct consequence of hand-eye
coordination enables the user to line up the
shape with the available openings on the bottom
of the playing area. Using different colors for
the various shapes helps the user quickly
identify each new shape as it arrives. The eye's
ability to quickly scan different portions of the
two-dimensional display helps the user identify
possible locations for the current shape.
5   Direct Manipulation In An
   Auditory Environment
The temporal one-dimensional nature of aural
interaction can be a significant hurdle in
attempting a visually intensive task such as
playing Tetris. We compensate for these
shortcomings by enabling appropriate gestures
in the auditory interface that permit the user to
obtain the necessary information and express
appropriate actions in a timely and effective
manner. See Table 1 on page 6 for a full list of
commands provided in the auditory interface.
The design of the auditory interface to Tetris is
predicated by the following:
 <IMG SRC="http://emacspeak.sourceforge.net/releases/raman/publications/assets-98/cmsy10-c-f.gif"ALT="*"> The geometric reasoning required for
   fitting interlocking shapes can be carried
   out mentally once the user has been given
   sufficient information.
 <IMG SRC="http://emacspeak.sourceforge.net/releases/raman/publications/assets-98/cmsy10-c-f.gif"ALT="*"> It is possible to translate actions predicated
   by visual geometric reasoning such as
   &#34;move a little to the left&#34; to functionally
   precise actions such as &#34;move two steps
   left&#34; given sufficient information.              5.1   Conveying The Shapes
                                               The seven shapes of Tetris are shown
                                               in Figure 2 on the page before. Each shape is
                                               given a mnemonic name based on its visual
                                               appearance. These mnemonics are used when
                                               announcing the current shape in the auditory
                                               interface. Thus the user hears spoken utterances
                                               of the form
                                                    Left Elbow at rotation 0 next is Right
                                                    Elbow.
                                               We use the digits 1-7 as functional colors in
                                               place of physical colors such as red and blue.
                                               This helps the listener recall what shapes were
                                               fitted when examining the state of the game.
                                               Each shape in figure Figure 2 on the preceding
                                               page is accompanied by a detailed verbal
                                               description to make the information readily
                                               accessible when reading this paper in
                                               alternative formats.
                                               5.2   Expressing Actions
                                               The auditory interface enhances the available
                                               gestures by providing keyboard commands for
                                               moving the current shape to a given absolute
                                               position. This is the single most important
                                               enhancement that the auditory interaction needs
                                               over visual interaction. Unlike in visual
                                               interaction, a user of the auditory interface does
                                               not have the continuous visual feedback loop
                                               that allows the current shape to be lined up with
                                               the available openings. In the case of the
                                               auditory interface, the listener needs to mentally
                                               track these openings in order to play the game
                                               fluently. Having to then line up the current
                                               shape using only relative translations becomes
                                               an undue mental burden. The ability to position
                                               the shape with absolute coordinates, e.g., &#34;move
                                               to column 3&#34;, compensates for this shortcoming
                                               and allows the user to play the game effectively.
                                                                                     Page 4
5.3   Providing Feedback
Providing prompt and immediate feedback is
essential in providing effective interaction. The
auditory interface indicates the dropping of
each shape to the bottom with an auditory icon.
As the user drops each shape, the system
produces a distinctive click; when the piece
drops to form a complete row, this click is
replaced by a short chime. Use of such auditory
icons (each cue is about 0.5 seconds long) is
extremely effective in designing fluent aural
interaction.
5.4   Communicating state of the
    game
The two-dimensional display allows a user of
the visual interface to implicitly query different
aspects of the state of the game such as
 <IMG SRC="http://emacspeak.sourceforge.net/releases/raman/publications/assets-98/cmsy10-c-f.gif"ALT="*"> What does the top row look like?
 <IMG SRC="http://emacspeak.sourceforge.net/releases/raman/publications/assets-98/cmsy10-c-f.gif"ALT="*"> What does the bottom row look like?
 <IMG SRC="http://emacspeak.sourceforge.net/releases/raman/publications/assets-98/cmsy10-c-f.gif"ALT="*"> How high is the stack of shapes?
 <IMG SRC="http://emacspeak.sourceforge.net/releases/raman/publications/assets-98/cmsy10-c-f.gif"ALT="*"> How well am I currently doing?
with simple eye movements. In the auditory
interface, we make these actions explicit by
providing keyboard actions that speak the
response to these queries --see Table 1 on the
next page for a complete list.
                                                                                     Page 5
 _______________________________________________________________________________________________________________________________________________________KeyAction
                                                                      Relative Motion                                                                      
                                                           h                                                                         Move left                
                                                            l                                                                        Move right               
                                                            j                                                             Rotate counter-clockwise   
                                                           k                                                                   Rotate clockwise          
                                                                     Absolute Motion                                                                     
   a Move to left edge e Move to right edge Digit Move to column &lt;Digit&gt;_____________________________________
                                                      Examine State                                                      
                                                           b                                                                       Bottom row              
                                                            t                                                                           Top row                 
                                                           c                                                                       Current row              
                                                           m                                                                       Current row              
                                                            r                                                                       Row number             
                                                            .                                                                      Current shape            
                                                            ,                                                                        Next shape               
                                                        RET                                                                          Score                   
             Table 1: Complete list of commands in the auditory interface to Tetris
                                                                                     Page 6
6   Conclusion
By systematically enumerating the atomic
actions that happen in a visually intensive
activity like playing tetris, and mapping these to
abasic set of conversational gestures is the first
step in speech-enabling this game. Mapping
basic conversational gestures to appropriate
events in an auditory interface leads to a
speech-enabled version of Tetris. The process
of evolving this interface has important lessons
for designing auditory interfaces to day-to-day
tasks on the electronic desktop. Primary among
these are:
 <IMG SRC="http://emacspeak.sourceforge.net/releases/raman/publications/assets-98/cmsy10-c-f.gif"ALT="*"> Enable user to express intent precisely.
 <IMG SRC="http://emacspeak.sourceforge.net/releases/raman/publications/assets-98/cmsy10-c-f.gif"ALT="*"> Provide sufficient feedback to enable the
   user to maintain a mental model that is
   synchronous with the state of the
   computing system.
 <IMG SRC="http://emacspeak.sourceforge.net/releases/raman/publications/assets-98/cmsy10-c-f.gif"ALT="*"> Use auditory cues (see
   [RK92, SMG90, BGP93]) and audio
   formatted output (see
   [Ram94, RG94, Gib96, Hay96]) to
   increase the band-width of aural
   communication.
References
[BGP93]  Meera M. Blattner, Ephraim P.
            Glinert, and Albert L. Papp.
        Sonic Enhancements for 2-D
            Graphic Displays, and Auditory
            Displays.
        To be published by Addison-Wesley
            in the Santa Fe Institute Series.
            IEEE, 1993.
[Gib96]   Wayte Gibbs.
        Envisioning speech.
        Scientific American, September
            1996.
[Hay96]   Brian Hayes.
        Speaking of mathematics.                         American Scientist, 84(2),
                                                             March-April 1996.
                                               [Ram94]  T. V. Raman.
                                                          Audio System for Technical
                                                             Readings.
                                                          PhD thesis, Cornell University, May
                                                             1994.
                                                          URL  http://cs.cornell.edu/home/raman.
                                               [Ram96a] T. V. Raman.
                                                          Emacspeak --direct speech access.
                                                          Proc. of The Second Annual ACM
                                                             Conference on Assistive
                                                             Technologies (ASSETS '96), Apr
                                                             1996.
                                               [Ram96b] T. ~V. Raman.
                                                          Emacspeak -a speech interface.
                                                          Proceedings of CHI96, April 1996.
                                               [Ram97a] T. V. Raman.
                                                          Auditory User Interfaces -Toward
                                                             The Speaking Computer.
                                                          Kluwer Academic Publishers,
                                                             August 1997.
                                               [Ram97b] T. V. Raman.
                                                          Net surfing without a monitor.
                                                          Scientific American, March 1997.
                                               [RG94]   T.V. Raman and David Gries.
                                                          Interactive audio documents.
                                                          Proc. 1st Annual ACM/SIGCAPH
                                                             Conf. on Assistive Technology,
                                                             Nov 1994.
                                               [RK92]   T. V. Raman and M. S.
                                                             Krishnamoorthy.
                                                          Congrats: A system for converting
                                                             graphics to sound.
                                                          Proceedings of IEEE on Johns
                                                             Hopkins National Search for
                                                             Computing Applications to
                                                             Assist Persons with Disabilities,
                                                             pages 170-172, February 1992.
                                               [SMG90]  D. A. Sumikawa, Blattner M. M.,
                                                             and R. M. Greenberg.
                                                                                     Page 7
        Earcons and icons: Their structure
            and common design principles.
        Visual Programming Environments,
            1990.
                                                                                     Page 8