Conversational Gestures For Direct Manipulation On The
Audio Desktop
T. V. Raman
Advanced Technology Group
Adobe Systems
E-mail : <raman@adobe.com>
WWW: http://cs.cornell.edu/home/raman
Abstract
We describe the speech-enabling approach to
building auditory interfaces that treat speech as
a first-class modality. The process of designing
effective auditory interfaces is decomposed into
identifying the atomic actions that make up the
user interaction and the conversational gestures
that enable these actions. The auditory interface
is then synthesized by mapping these
conversational gestures to appropriate
primitives in the auditory environment.
We illustrate this process with a concrete
example by developing an auditory interface to
the visually intensive task of playing tetris.
Playing Tetris is a fun activity1 that has many of
the same demands as day-to-day activities on
the electronic desktop. Speech-enabling Tetris
thus not only provides a fun way to exercise
ones geometric reasoning abilities --it provides
useful lessons in speech-enabling
common-place computing tasks.
1This paper was seriously delayed because the author
was too busy playing the game.
0 1 Introduction
The phrase desktop no longer conjures up the
image of a polished high-quality wooden
surface. The pervasiveness of computing in the
workplace during the last decade has led to the
concept of a virtual electronic desktop --a
logical workspace made up of the documents
one works with and the applications used to
operate on these documents. Progressive
innovations in the Graphical User Interface
(GUI) have helped strengthen this metaphor
--today, the typical desktop enables the user to
organize the tools of his trade by dragging and
dropping graphical icons into a visual
two dimensional workspace represented on the
computer monitor. Given this tight association
between visual interaction and today's
electronic desktop, the phrase audio desktop is
likely to raise a few eyebrows (Or should it be
ear lobes)!
This paper focuses on specific aspects of
auditory interaction with a view to enabling an
audio desktop. The audio desktop is defined in
detail in Chapter 3 of [Ram97a]. Using the
speech-enabling approach first introduced in
[Ram96a, Ram96b, Ram97b], we demonstrate
how the functionality of the electronic desktop
can be exposed through an auditory interface.
The attempt is not to speak the visual desktop;
rather, we identify the key user-level
functionality enabled by the modern electronic
desktop and briefly describe how this can be
Page 1
translated to an auditory environment.
This paper specifically illustrates the auditory
analogue to gestures available on the visual
desktop by describing an auditory interface to
the popular game of Tetris. For a detailed
overview of a full implementation of an
auditory desktop, see Chapter 4 of [Ram97a].
Though speech-enabling a game like Tetris
might seem a somewhat light-hearted (and
perhaps even pointless) activity, there are
important lessons to be learned from
speech-enabling such a visually intensive task.
Many of the demands placed on the user by a
game like Tetris are closely paralleled by the
functional abilities demanded by today's
computer interfaces. Speech-enabling Tetris
thus provides a fun activity on the surface while
exposing deeper research ideas that have
wide-ranging applicability in the general design
of auditory interfaces to tomorrow's
information systems.
In visual interaction, the user actively browses
different portions of a relatively static
two dimensional display to locate and
manipulate objects of interest. Such
manipulations are aided by hand-eye
coordination in visually intensive tasks like
playing Tetris. Contrast this with auditory
displays that are characterized by the temporal
nature of aural interaction; here, the display --a
one-dimensional stream of auditory output--
scrolls continuously past a passive listener. This
disparity between aural and visual interaction
influences the organizational paradigms that are
effective in auditory interaction. The purpose of
this paper is to systematically investigate the
design of an effective audio interaction to a
visually intensive task like playing Tetris. The
steps in evolving such an interface can be
enumerated as:
Identify user functionality enabled by the
visual interface,
Exploit features of auditory displays to
enable equivalent functionality and Evolve navigational and organizational
paradigms for aural interaction that
compensate for the temporal,
one-dimensional nature of audio by
exploiting other features of aural
interaction.
2 Conversational Gestures
User interface is a means to enabling
man-machine communication. This
man-machine dialogue takes place by means of
a set of simple conversational gestures designed
to overcome the impedance mismatch in the
abilities of man and machine. These gestures
are realized in today's Graphical User
Interfaces (GUIs) by user interface widgets
such as list boxes and scroll bars.
___________________________________________________________NaturalLanguage
Edit widgets Message widgets
Answering Yes Or No
Toggles Check boxes
Select From Set
Radio groups List boxes
Select From Range
Sliders Scroll Bars
Traversing Complex Structures
Previous Next Parent Child
Left Right Up Down
First Last Root Exit
Figure 1: Conversational gestures --the build-
ing blocks for dialogues. User interface design
tries to bridge the impedance mismatch in man-
machine communication by inventing a basic set
of conversational gestures that can be effectively
generated and interpreted by both man and ma-
chine.
We first enumerate the basic conversational
gestures that constitute today's user interfaces
in figure 1. Separating conversational gestures
e.g., select an element from a list from the
Page 2
modality-specific realization of that gesture --a
list box in the case of the GUI-- (see Figure 1
on the preceding page) is the first step in
evolving speech-centric man-machine
dialogues.
3 The Game Of Tetris
This section briefly describes the game of Tetris
and enumerates the conversational gestures
involved in playing the game. These gestures
are introduced with respect to the familiar
visual interface; later sections translate these to
appropriate gestures in an auditory interface.
The game involves forming rows by arranging
interlocking shapes. When complete these rows
disappear from the board. Tetris shapes are the
seven possible arrangements of four square tiles
--see Figure 2. The shapes drop from the top of
the screen, and the user has to move and rotate
the shape before dropping it to fit in with those
at the bottom of the playing area. In this paper,
we consider an instance of the game where the
playing area is ten columns wide and twenty
rows high.
Playing Tetris involves geometric reasoning to
decide where best to fit the current tile. In the
visual interface, the user can use the
two-dimensional nature of the visual display
backed up by hand-eye coordination to line up
the current shape with the available openings on
the bottom row. The conversational gestures
involved are:
Indicate Current Shape The current shape is
indicated to the user by dropping it from
the top of the playing area.
Choose Location The user examines the
available openings on the bottom row to
mentally construct a set of available
positions that the shape can be placed in.
Choose Orientation The user selects a valid
orientation for the shape and its chosen location.
Fit Shape The user fits this shape at the chosen
location and is given the next shape.
Update State If fitting this shape completed a
row, the user is cued appropriately. The
playing area is redrawn to indicate the
available openings for the next shape.
_______________________________________________________________________11
1 1____________________________________
Box 2 × 2 square_________________________
2 2 2
2__________________________________
Right L 2 × 3 matrix with (2, 1), (2, 2) empty__
3 3 3
3 ___________________________________
Left L 2 × 3 matrix with (2, 2), (2, 3) empty___
4 4
4 4__________________________________
Z 2 × 3 matrix with (1, 3), (2, 1) empty_______
5 5
5 5 ___________________________________
S 2 × 3 matrix with (1, 1), (2, 3) empty_______
6 6 6
6 ___________________________________
T 2 × 3 matrix with (2, 1), (2, 3) empty_______
7 7 7 7_______________________________
Edge 1 × 4 matrix
Figure 2: The seven shapes of Tetris.
4 Direct Manipulation In A
Visual Environment
Playing Tetris exercises one of the basic
functionalities of the electronic desktop,
Page 3
namely, the user's ability to directly manipulate
objects in the interface. Here, the user expresses
actions by selecting and moving the shape with
apointing device or appropriate keyboard
events. The continuous visual feedback loop
that is a direct consequence of hand-eye
coordination enables the user to line up the
shape with the available openings on the bottom
of the playing area. Using different colors for
the various shapes helps the user quickly
identify each new shape as it arrives. The eye's
ability to quickly scan different portions of the
two-dimensional display helps the user identify
possible locations for the current shape.
5 Direct Manipulation In An
Auditory Environment
The temporal one-dimensional nature of aural
interaction can be a significant hurdle in
attempting a visually intensive task such as
playing Tetris. We compensate for these
shortcomings by enabling appropriate gestures
in the auditory interface that permit the user to
obtain the necessary information and express
appropriate actions in a timely and effective
manner. See Table 1 on page 6 for a full list of
commands provided in the auditory interface.
The design of the auditory interface to Tetris is
predicated by the following:
The geometric reasoning required for
fitting interlocking shapes can be carried
out mentally once the user has been given
sufficient information.
It is possible to translate actions predicated
by visual geometric reasoning such as
"move a little to the left" to functionally
precise actions such as "move two steps
left" given sufficient information. 5.1 Conveying The Shapes
The seven shapes of Tetris are shown
in Figure 2 on the page before. Each shape is
given a mnemonic name based on its visual
appearance. These mnemonics are used when
announcing the current shape in the auditory
interface. Thus the user hears spoken utterances
of the form
Left Elbow at rotation 0 next is Right
Elbow.
We use the digits 1-7 as functional colors in
place of physical colors such as red and blue.
This helps the listener recall what shapes were
fitted when examining the state of the game.
Each shape in figure Figure 2 on the preceding
page is accompanied by a detailed verbal
description to make the information readily
accessible when reading this paper in
alternative formats.
5.2 Expressing Actions
The auditory interface enhances the available
gestures by providing keyboard commands for
moving the current shape to a given absolute
position. This is the single most important
enhancement that the auditory interaction needs
over visual interaction. Unlike in visual
interaction, a user of the auditory interface does
not have the continuous visual feedback loop
that allows the current shape to be lined up with
the available openings. In the case of the
auditory interface, the listener needs to mentally
track these openings in order to play the game
fluently. Having to then line up the current
shape using only relative translations becomes
an undue mental burden. The ability to position
the shape with absolute coordinates, e.g., "move
to column 3", compensates for this shortcoming
and allows the user to play the game effectively.
Page 4
5.3 Providing Feedback
Providing prompt and immediate feedback is
essential in providing effective interaction. The
auditory interface indicates the dropping of
each shape to the bottom with an auditory icon.
As the user drops each shape, the system
produces a distinctive click; when the piece
drops to form a complete row, this click is
replaced by a short chime. Use of such auditory
icons (each cue is about 0.5 seconds long) is
extremely effective in designing fluent aural
interaction.
5.4 Communicating state of the
game
The two-dimensional display allows a user of
the visual interface to implicitly query different
aspects of the state of the game such as
What does the top row look like?
What does the bottom row look like?
How high is the stack of shapes?
How well am I currently doing?
with simple eye movements. In the auditory
interface, we make these actions explicit by
providing keyboard actions that speak the
response to these queries --see Table 1 on the
next page for a complete list.
Page 5
_______________________________________________________________________________________________________________________________________________________KeyAction
Relative Motion
h Move left
l Move right
j Rotate counter-clockwise
k Rotate clockwise
Absolute Motion
a Move to left edge e Move to right edge Digit Move to column <Digit>_____________________________________
Examine State
b Bottom row
t Top row
c Current row
m Current row
r Row number
. Current shape
, Next shape
RET Score
Table 1: Complete list of commands in the auditory interface to Tetris
Page 6
6 Conclusion
By systematically enumerating the atomic
actions that happen in a visually intensive
activity like playing tetris, and mapping these to
abasic set of conversational gestures is the first
step in speech-enabling this game. Mapping
basic conversational gestures to appropriate
events in an auditory interface leads to a
speech-enabled version of Tetris. The process
of evolving this interface has important lessons
for designing auditory interfaces to day-to-day
tasks on the electronic desktop. Primary among
these are:
Enable user to express intent precisely.
Provide sufficient feedback to enable the
user to maintain a mental model that is
synchronous with the state of the
computing system.
Use auditory cues (see
[RK92, SMG90, BGP93]) and audio
formatted output (see
[Ram94, RG94, Gib96, Hay96]) to
increase the band-width of aural
communication.
References
[BGP93] Meera M. Blattner, Ephraim P.
Glinert, and Albert L. Papp.
Sonic Enhancements for 2-D
Graphic Displays, and Auditory
Displays.
To be published by Addison-Wesley
in the Santa Fe Institute Series.
IEEE, 1993.
[Gib96] Wayte Gibbs.
Envisioning speech.
Scientific American, September
1996.
[Hay96] Brian Hayes.
Speaking of mathematics. American Scientist, 84(2),
March-April 1996.
[Ram94] T. V. Raman.
Audio System for Technical
Readings.
PhD thesis, Cornell University, May
1994.
URL http://cs.cornell.edu/home/raman.
[Ram96a] T. V. Raman.
Emacspeak --direct speech access.
Proc. of The Second Annual ACM
Conference on Assistive
Technologies (ASSETS '96), Apr
1996.
[Ram96b] T. ~V. Raman.
Emacspeak -a speech interface.
Proceedings of CHI96, April 1996.
[Ram97a] T. V. Raman.
Auditory User Interfaces -Toward
The Speaking Computer.
Kluwer Academic Publishers,
August 1997.
[Ram97b] T. V. Raman.
Net surfing without a monitor.
Scientific American, March 1997.
[RG94] T.V. Raman and David Gries.
Interactive audio documents.
Proc. 1st Annual ACM/SIGCAPH
Conf. on Assistive Technology,
Nov 1994.
[RK92] T. V. Raman and M. S.
Krishnamoorthy.
Congrats: A system for converting
graphics to sound.
Proceedings of IEEE on Johns
Hopkins National Search for
Computing Applications to
Assist Persons with Disabilities,
pages 170-172, February 1992.
[SMG90] D. A. Sumikawa, Blattner M. M.,
and R. M. Greenberg.
Page 7
Earcons and icons: Their structure
and common design principles.
Visual Programming Environments,
1990.
Page 8