Specialized Browsers

________________________________________________________________1

Specialized Browsers

T. V. Raman
http://emacspeak.sf.net/raman
raman@google.com

Google Research

Summary.

To most users of the Internet, the Web is epitomized by the Web Browser, the program on their machines that they use to “logon to the Web”. However, in its essence, the Web is both a lot more than —and a lot less than —the Web browser. The Web is built on:

URLs: A universal means for identifying and addressing content.
HTTP: A simple protocol for client/server communication.
HTML: A simple markup language for communicating hypertext content.

This decentralized architecture was designed from the outset to create an environment where content producers and consumers could come together without the need for everyone to use the same server and client. To participate in the Web revolution, one only needed to subscribe to the basic architecture of a Web of content delivered via HTTP and addressable via URLs.

Given this architecture, specialized browsers have always existed to a greater or lesser degree alongside mainstream Web browsers. These range from simple scripts for performing oft-repeated tasks e.g., looking up the weather forecast for a given location, to specialized Web user-agents that focus on providing an alternative view of Web content.

This chapter traces the history of such specialized Web clients and outlines various implementation techniques that have been used over the years. It highlights specialized browsers in the context of accessibility, especially for use by persons with special needs. However, notice that specialized browsers are not necessarily restricted to niche user communities —said differently, all of us have special needs at one time or another.

As we evolve from the purely presentational Web to a more data-oriented Web, such specialized tools become center-stage with respect to providing optimal information access to the end-user. The chapter concludes with a brief overview of where such Web technologies are headed and what this means to the future of making Web content accessible to all users.

1.1 Introduction

URLs: A universal means for identifying and addressing content.
HTTP: A simple protocol for client/server communication.
HTML: A simple markup language for communicating hypertext content.

1.2 Overview

We start this section with a brief overview of the history of talking Web browsers —commonly refered to as self-voicing browsers. The goal is not to cover every specialized browser that was ever written; rather, this section attempts to broadly classify various solutions that have been built since 1994 as a function of the end-user experience they delivered.

1.2.1 Talking Browsers —1994 – 1998

The mainstreaming of the Web in 1994 coincided with the coming of age of GUI screenreaders. This meant that Web access issues for visually impaired users became intricately tangled up with the broader issue of providing good non-visual access to the GUI. At the time, generic platform-level accessibility APIs were non-existent, and screenreaders relied on constructing an off-screen model by watching low-level graphics calls. Thus, Web access presented an additional challenge to adaptive technologies of the time.

Specialized Web browsers that spoke Web content ﬁrst emerged in early 1995. These were implemented as browser add-ons that relied on a visual browser to retrieve and display the content; the browser add-on accessed the retrieved HTML to produce spoken content. Notice that this was before the advent of standardized APIs such as the HTML Document Object Model (DOM). Despite this lack of standardized APIs, talking browsers of the time still had an advantage over available screenreader technologies; this was because specialized browsers were able to augment the user interface with additional commands that enabled the user to efficiently navigate the contents of a page. Contrast this with the screenreaders of the time that had to rely on the ﬁnal visual presentation of the page —users of specialized talking browsers could navigate by paragraphs and sections, whereas screenreader users of the time were limited to line-oriented navigation.

Examples of such specialized Web browsers from 1995 include:

PWWebSpeak

This was implemented as an add-on to the Netscape browser in 1995, and the tool survived until the late 1990’s —see Google results. The browser was revolutionary for its time in terms of providing direct spoken access to the Web document, rather than forcing the speech user to deal with a purely visual presentation.

Home Page Reader

IBM Home Page Reader was released a few months later. Built originally as a Netscape extension, it later evolved to become a plugin to Internet Explorer. Like PWWebSpeak before it, it relied on the mainstream Web browser (Netscape and later IE) to do the bulk of the work with respect to retrieving and displaying content. Home Page Reader hosted the Web browser —contrast this with PWWebSpeak which was hosted inside the browser. This reversal of roles enabled IBM Home Page Reader to provide a better end-user experience over time, since the program had greater ﬂexibility with respect to adding or subtracting user interface elements from the browser’s chrome.

Emacs W3

This was one of the early Web browsers that saw active development between 1993 and 1998. In conjunction with Emacspeak, this tool introduced many innovations including:

Support for Aural CSS. Aural style sheets based on Cascading Style Sheets (CSS) could specify the aural properties to be used when speaking Web content.
Structured navigation. Users could quickly skim through an HTML document based on its underlying document structure.
HTML form enhancements including support for element label and fieldset. This enabled Emacspeak to provide contextually meaningful prompts of the form

Press this to change “Do you accept” from yes to no.
Table navigation with the ability to speak a cell with its row or column header, as well as the ability to focus in on the contents of a given table cell.

1.2.2 Spoken Web Access —1998 – 2003

By the late 1990’s, Windows screenreaders like Jaws For Windows (JFW) and Window Eyes started looking at the HTML content in addition to using the visual presentation provided by the browser. Around the same time, platform-level accessibility APIs like Microsoft Active Accessibility (MSAA) enabled screenreaders to produce more reliable spoken output. Consequently, the combination of a screenreader and mainstream browser began to provide the same level of end-user functionality that was seen earlier with specialized browsers like PWWebSpeak and IBM Home Page Reader. As an example, popular Windows screenreaders today implement browser support by placing Web pages in a virtual buffer that the user navigates using specialized commands to listen to the contents. Tools like IBM Home Page Reader therefore evolved into tools for checking the usability of Web sites with spoken output for use by content developers.

1.2.3 Spoken Web Access —2003 – Present

Content feeds encoded as Really Simple Syndication (RSS) went mainstream in 2003. RSS feeds were the underpinnings of the blogging revolution. As a side-effect, content-rich Web sites started providing data feeds that could be viewed inside specialized tools such as feed aggregators. This marked the coming of the data-oriented Web, where content is designed to be more than just viewed in a Web browser. The coming of such data-oriented access mechanisms has had a signiﬁcant impact on the role and effectiveness of specialized browsing tools.

In early 2000, emacspeak acquired the ability to apply content transformations to Web pages before presenting them to the user. This meant that the content of a Web page could be rearranged and ﬁltered to provide an optimal eyes-free experience. In combination with the availability of content feeds, this enabled the creation of a large number of task-oriented tools. All of these tools leveraged the basic HTML+CSS rendering capabilities of Emacs/W3. Each of these specialized tools exposed a task-oriented interface that prompted the user for relevant input, retrieved and transformed the Web content relevant to the user, and ﬁnally produced a speech-friendly presentation of the results.

The key difference with such task-oriented tools is that the user does not ﬁrst launch a Web browser; for the most part, the user does not even think of the output from these tools as Web pages. The framework hosting these tools —Emacspeak —implemented the building blocks of basic Web Architecture, and the result was a set of mini-applications that the user could call up with a few keystrokes. Examples of such task-driven tools include:

Search: Prompt for a search query and speak the results.
Map Directions: Prompt for a start and end location and speak the directions.
Weather: Prompt for a location and speak the weather forecast.

Notice that the resulting collection of speech-enabled tools can each be thought of as an extremely specialized browsing tool; the framework for hosting such tools then becomes an alternative container on par with the traditional Web browser. We will return to the topic of specialized containers that host Web components in Section 1.4 where we cover the rapidly evolving space of Web gadgets.

1.2.4 Voice Browsers —2000 – 2007

In 1999, the W3C launched the Voice Browser activity which led to the publication of VoiceXML 2.0 —an XML-based language for authoring dialog interaction. VoiceXML is designed for authoring interactive applications that use speech as the primary interaction modality; typical implementations consist of a specialized container that processes VoiceXML documents to carry out a spoken dialog with the user. Covering the design and use of VoiceXML is beyond the scope of this chapter —for details, see VoiceXML.

Later, XHTML+Voice enabled the integration of interactive spoken dialogs in Web pages. In this design, VoiceXML was used to author dialogs that were then attached as event handlers to visual Web controls. When hosted within a browser implementing DOM2 Events, this had the effect of turning visual user interface controls into multimodal dialogs. When a visual user interface control received focus, the VoiceXML dialog attached to that control produced appropriate spoken prompts, activated the speciﬁed speech recognition grammar, and returned the recognized result to the user interface control. This meant that users could ﬁll-in forms either via the keyboard or via spoken input —this technique was implemented in browsers like Opera 9.

The design of VoiceXML applications is signiﬁcant from the perspective of specialized browsers and Web architecture. VoiceXML applications use URLs as the addressing mechanism for locating and retrieving resources via HTTP. Here, resources include both application data e.g., a train timetable, as well as application resources needed to carry out an effective spoken dialog with the user, e.g., spoken prompts and speech grammars. Thus, a VoiceXML application consists of the following:

Prompts

Spoken prompts as either

authored as declarative markup using Speech Synthesis Markup Language (SSML) for further processing by a Text To Speech (TTS) engine, or
Created as pre-recorded audio ﬁles for playback to the user.

Grammars

Speech Recognition Grammar Speciﬁcation (SRGS) grammars for constraining the recognizer to the set of appropriate utterances.

VoiceXML

A sequence of VoiceXML dialogs consisting of form and field elements. A VoiceXML document can be viewed as a sequence of dialog elements that act as event handlers. Each VoiceXML dialog prompts the user, collects one or more values from the user and deﬁnes the appropriate event handler to ﬁre based on the result of the recognition task. These event handlers are themselves other VoiceXML dialogs, thereby enabling VoiceXML to deﬁne a ﬁnite state machine that encapsulates dialog ﬂow.

VoiceXML applications can be viewed as attaching a purely spoken user interface to data available on the emerging service-oriented Web. Notice that the above design pattern of attaching spoken interaction to data on the service-oriented Web is still an on-going evolutionary process. VoiceXML applications authored for today’s Web often end up needing to create a specialized back-end application from scratch —as opposed to merely attaching a spoken dialog interface to an existing data-oriented application. But this is a reﬂection of the fact that until now, most applications have been authored for use via visual interaction. As we move toward an increasingly diversiﬁed Web characterised by users who demand ubiquitous access from a variety of access devices ranging from desktop PCs to mobile devices, the Web is seeing a corresponding re-factoring of the programming technologies used to author, deploy and deliver end user interaction. Over time, such re-factoring is beginning to lead to a data-oriented Web where open Web APIs based on URLs and standardized feed structures based on ATOM and RSS increasingly enable programmatic access to useful services. As such access increases, specialized browsers that provide alternative access can increasingly focus on the details of user interaction in a given modality, without having to repeatedly program the modality-independent aspects of an application. Content creation guidelines and standards play an increasingly important role in this process of refactoring as will be seen in the next section.

1.3 Access Guidelines

The extent to which Web content can be made perceivable to the widest possible audience is a function of the following:

Content: The nature of the content, and the extent to which the encoding of that content permits graceful degradation. As an example, a purely visual image is of little use to someone who cannot see (given the state of today’s automatic image recognition technologies). Notice that graceful degredation of content requires redundancy in the content encoding, and that such redundancy is an essential prerequisite when repurposing content to different modalities via specialized browsers.
User Agent: The software used to access content is primarily responsible for the quality of the user experience.
Adaptive Technology: Users’ needs and abilities vary, and where available user agents do not include the necessary augmentations needed by a speciﬁc group of users, this ability gap can often be bridged by using add-on adaptive technologies.

As can be seen from the above, the overall user experience —especially when considering users with special needs —is a function of the triple (C,UA,AT). Other chapters in this book focus on access guidelines and adaptive technologies in far greater detail; this section focuses on the relevance of accessibility guidelines as viewed from the goal of designing specialized browsing applications.

1.3.1 Content Is King

The Web as we know it would not exist without content. For the Web to remain true to its original vision of a Web of content where producers and consumers come together without explicit dependencies on a given set of technologies, content needs to be able to degrade gracefully e.g., a Web site that has been created assuming color displays needs to be usable when viewed using a monochrome display.

Returning to the topic of spoken access and specialized browsers, there is a deep relationship between access guidelines created to further the needs of graceful degradation and creating content that lends itself to being delivered via alternative modalities such as spoken output.

1.3.2 Separation Of Content From Style

Separating content from style on Web pages by using CSS is an example of good content practice that beneﬁts accessibility in the broader sense:

Empowers users to pick a presentation scheme that is best suited to the user’s needs and abilities.
User-speciﬁc styles are cascaded with author-provided styles to achieve the ﬁnal effect.
Presentation is no longer hard-wired into the content, making the content amenable for a multiplicity of presentations.

More speciﬁcally, separation of style from content makes the resulting HTML better suited for delivery via alternative modalities such as spoken output. Work on CSS1 started in 1995 —CSS1 became a W3C Recommendation in early 1997. Aural CSS was created as an appendix to CSS1 in February 1996; later, it was converted into a CSS module for CSS 2.1. Note that the next version of CSS, CSS 3.0, is being created as a collection of modules —with one module focused on auditory output.

Aural CSS is a good example of talking browsers leveraging an underlying design principle —separation of content from presentation —and applying the beneﬁts of such separation to an entirely different output modality.

Aural CSS was ﬁrst implemented in Emacs/W3 in 1996; later, Opera implemented a subset of Aural CSS in Opera 9 in the context of speech-enabling the Opera browser using XHTML+Voice (X+V).

Aural CSS speciﬁes a set of additional voice properties that can be used to annotate Web content. As with visual CSS properties, aural properties can originate from a number of sources:

Style sheets provided by the content author.
Style sheets provided by the user.
Style mappings provided by the browser e.g., a talking browser can choose to map a given visual style to a corresponding aural style.

Most uses of Aural CSS fall into the ﬁnal category above i.e., specialized browsers use Aural CSS as a rule-based means of mapping visual style rules to appropriately designed aural styles.

1.3.3 Separation Of Content And Interaction

Today’s Web pages are no longer pure content —they come close to realizing the 30-year old maxim “the document is the interface!”. User interfaces —and interactive documents as found on today’s Web —consist of content, style and interaction. Thus, today’s HTML pages consist of the following layers:

Content: Declarative HTML markup that represents document content.
Style: CSS style rules that are bound to the HTML via appropriate class attributes placed on the content.
Scripts: Event handlers implemented in the form of JavaScript functions that are invoked in response to user events.

Notice that as we add in the next layer of complexity to Web documents, there is signiﬁcant value in keeping the interaction layer well-separated from the content and style layers. Such separation is important for the broader needs of accessibility to the widest possible audience; but it is crucial with respect to creating Web applications that lend themselves for easy deployment to different end-user interaction scenarios.

In 1999, the W3C’s Forms WG set out to deﬁne the next generation of HTML Forms, but in this process quickly discovered that form elements in HTML were not just about ﬁll-out forms. Form elements collect user input, and are in fact the basic building blocks for creating user interaction within Web pages. With this realization, XForms evolved into a light-weight Web application framework with a well-deﬁned Model View Controller (MVC) design. Thus, XForms consists of the following:

Model: An XMl data model for encapsulating user input, along with validity and dependency constraints.
UI: A set of abstract user interface controls that capture the intent —rather than the presentation —underlying the user interface.
Binding: A generic binding mechanism for connecting the user interface layer to the underlying data model.

The above separation between content, presentation and interaction was introduced to ensure that Web applications created via XForms could be delivered to a multiplicity of end-user interaction contexts. As an example, a given XForms application can be hosted inside a Web page to provide visual interaction; the same XForms application can be processed by a different container to deliver a chat-like interface, where the user is progressively prompted for the requisite information using an instant messaging client.

As a case in point, see FormsPlayer and multimodal applications which describes how the various items of abstract metadata encapsulated by an XForms application, e.g., help and hint can be leveraged to deliver a multimodal experience where the relevant tips are spoken to the user.

1.4 Future Directions

Based on the trends seen so far, this section sketches future directions for light-weight Web applications and their impact on the area of specialized browsers and accessibility. Notice that most if not all evolution on the Web is incremental; this means that many of the solutions that will become common-place in the future will typically trace their past to early prototypes of today. With this in view, this section sketches some future directions based on prototypes that have been built during the last few years. This is not to say that there will be no revolutionary changes; however, incremental improvements are far easier to predict —and in their aggregate often prove to be revolutionary in their impact.

1.4.1 Web Wizards And URL Templates

As described in Section 1.1, URLs play a central role in the architecture of the Web. As the Web evolved to include dynamic server-generated content, URLs became more than locators of static content —URLs came to include URL parameters that were processed on the server to generate customized content. The formalizing of the Common Gateway Interface (CGI) in 1994, and the advent of HTML forms for collecting user input together led to the idea of RESTful URLs —see Representational State Transfer (REST). Such RESTful URIs naturally evolved into the fore-runner of light-weight Web APIs; in fact these still form the underpinnings of many of the data-oriented APIs deployed on the Web in 2007.

RESTful URIs led to the notion of url templates and Websearch wizards in Emacspeak around 1999. At the time, mainstream Web sites had become visually busy. As a result, useful services such as getting map directions were difficult to use in an eyes-free environment where one needed to listen to the entire Web page. As an example, in 1998, one could get driving directions from Yahoo Maps for anywhere in the United States —a major step forward for the time since before then, one needed to use specialized mapping/atlas programs to obtain such information. The only drawback was that the input controls for providing start and end locations were buried deeply inside a visually busy page. Worse, once one had located the input ﬁelds and ﬁlled in the requisite information, one suffered the obligatory World Wide Wait before receiving a heavy-weight HTML page with the directions swamped by a mass of additional content.

Fortunately, the underlying Web architecture based on RESTful URLs made building a specialized tool for this task relatively easy. The tool in question was implemented to:

Prompt: Collect the start and end location from the user,
Retrieve: Retrieve the content at the URL constructed by ﬁlling in the appropriate URL params,
Filter: Filter the resulting content to locate and speak the driving directions.

Eight years later and counting, the Emacspeak tool for accessing driving directions from Yahoo Maps still works. The only piece of this tool that has changed over the intervening period has been the ﬁlter step which needs to keep pace with changes to the layout of the HTML page containing the directions.

The next step in this evolution was to convert the one-off tool above into a mini-application that was hosted in a framework. Notice that there is nothing very speciﬁc to map directions about the (prompt,retrieve,ﬁlter) sequence outlined above. Thus, within a few weeks of implementing the specialized talking map directions tool, Emacspeak had evolved to contain a framework that allowed easy authoring of talking Web tools. All of these tools have the following in common:

Interaction: A common interaction model that consists of spoken prompts with auto-completion and automatic speaking of the relevant results.
Style: Aural CSS is used to consistently style all spoken output with changes in voice characteristic highlighting key portions of the result being spoken.
Code Isolation: Each specialized tool in the framework is speciﬁc to a given Web site’s idiosyncrasies. This means that at any given time, at least some of the available tools might be broken and need updating; however such breakages are isolated to that particular tool.
Incremental Evolution: Tools can be added, removed or modiﬁed without affecting other tools.

1.4.2 Portals And Web Gadgets

In the late 1990’s the Web browser evolved into a universal client —thanks to the availability of a number of useful services on the Web. This movement started on corporate Intranets where Web technologies proved far more cost-effective than traditional multi-tier client/server solutions. The trend extended itself to the global Internet as electronic commerce became prevalent on the Web. Thus, the Web browser became the user’s porthole on to the world of electronic information.

Portlets And Portal Servers

This evolution led naturally to the advent of Web portals —and consequently to the creation of portal servers. Web sites like Yahoo aggregated a number of useful services on to a single large Web page to provide a single point of access; on corporate Intranets, such portal sites were powered by portal servers that enabled the Web administrator to easily deploy new applications on the site and have users conﬁgure their user experience by determining what they saw on their customized page.

The above process gave birth to a new form of specialized browser —the Web portlet. A portlet was a small Web application that was deployed on the server to carry out the following steps:

Back-end: Communicate to the back-end application —typically via HTTP —to retrieve, ﬁlter and format the requisite information.
Front-end: Render the formatted information as HTML for embedding within a larger Web page.
Conﬁguration: Provide the user interface affordances to allow users customize the ﬁnal experience by conﬁguring the look and feel of the portlet. Such conﬁguration included adding, removing, expanding or collapsing the portlet.
Preferences: Manage user preferences across portlets hosted on a page.
Single sign-on: Delegate common tasks such as authentication to the portal container, so that users do not need to login to each portlet application.

Portlets as described above can be viewed as specialized browsers optimized for a given task e.g., working with an employee’s ﬁnancial records on a corporate Intranet. Though hosted within a specialized application container on the server, such portlets are in fact no different than the specialized talking Web tools described in the previous section; In the case of looking up an employee’s ﬁnancial records, the tasks that the user would typically need to perform:

Browse: Point the Web browser at the site for managing ﬁnancial records,
Sign In: Sign in to the site with the appropriate credentials,
Query: Request the relevant information.

are performed on behalf of the user by the portlet. Thus, the portal server becomes an application container that provides a framework for portlet authors to create task-speciﬁc Web applications. The framework manages details such as single sign-on and creating a uniform look and feel with respect to customization. The resulting portal site provides a single point of entry for the user and obviates many repetitive tasks:

Single Sign-on: Users sign in once to access a number of related applications.
Defaults: Each application can be conﬁgured with a useful set of defaults for the current user.
Preferences: Users can manage their personal preferences with respect to look and feel across a set of applications.

In their heyday, portlets were not limited to desktop browsers with a visual interface. Using the underlying Web APIs, portlets were also created for deployment to mobile devices. Finally, a small number of portlets were created for hosting within a voice portal; such voice portlets emitted VoiceXML for aggregation into a larger VoiceXML application. Compared to their visual analog, VoiceXML portlets have not been very successful —primarily because integrating multiple spoken dialog applications into a coherent whole still remains an unsolved research problem.

Web Gadgets

Portal servers and portlets became the rage in 2002. Mapping this concept on to the client led to Web gadgets —task-speciﬁc Web applications hosted within the browser. As with the task-speciﬁc Web application technologies described so far, gadgets also relied on the underlying Web architecture of URLs and HTTP to bring relevant data closer to the user. As a client-side technology, gadgets naturally chose HTML and JavaScript as the implementation language; early prototype examples include Opera Widgets for the Opera browser among others.

Thus, client-side gadgets consisting of HTML, CSS and JavaScript were initially designed for placing within a Web page in a manner analogous to what was seen earlier with portlets. The next step in this evolution came with the realization that forcing the end-user to launch a Web browser for every task was not always convenient —there are certain types of information, e.g., the current weather, that are better suited to being available on the user’s desktop. This led to Apple’s Dashboard Widgets for Mac Os. Thus, the task-speciﬁc Web applications created thus far for aggregation into a Web page for viewing within a browser were ﬁnally freed from the shackles of having to live inside the browser —Web widgets could now materialize on the desktop.

Web gadgets are still evolving as this chapter is being written. In late 2005, Google introduced IGoogle Modules for adding custom content to a user’s Personalized Google page. Conceptually, these are similar to portlets, except that IGoogle Modules can be authored by anyone on the Web and published to a directory of modules that helps users discover and add published modules to their personalized IGoogle page. In an interesting parallel to client-side Web widgets escaping the shackles of the browser to live on the desktop, IGoogle modules can now be hosted within Web pages outside Google; they can also be viewed as Google Gadgets and materialize on the Google Desktop.

Notice that we’ve now come full circle, with task-speciﬁc browsing technologies that started as a niche application becoming a mainstream feature. Notice further that though such widgets inhabit the user’s desktop outside the Web browser, they’re well-integrated with respect to Web architecture and use all of the Web’s basic building blocks to achieve their end. The impact on talking browsers of this progression from specialized Web applications to task-speciﬁc Web gadgets for the mainstream is profound, since the very features needed by spoken Web access:

A clean separation of content from presentation and interaction.
A data-oriented Web.
Light-weight Web APIs.

are all prerequisites to building a healthy environment for Web gadgets.

1.4.3 Web APIs And Mashups

RESTful Web APIs became common by late 2004. The simplicity afforded by parametrized URLs and the bottom-up nature of development of RESTful Web APIs helped them overtake the much vaunted Web Services¹ . As a consequence, the number of useful Web services available via light-weight Web APIs reached critical mass by late 2004, and Web 2.0 became a viable platform for building useful solutions.

Google Maps launched in early 2005, and provided the ﬁnal link in the chain that led to Web mashups —light-weight Web applications that bring together data from different sources on the Web. Maps provide an ideal spatial canvas for visualizing information available on the Web. The availability of location-based information e.g., available rentals in a given city, data about crime rates in different neighborhoods etc. when combined with Google Maps enabled the creation of map mashups that allowed one to place location-oriented data on a map.

Notice that Web mashups represent a very special kind of task-oriented browsing; earlier the user looking for apartments to rent would have had to perform the following discrete tasks:

Find: Browse to the relevant Web site to query for available apartments
Locate: For each available apartment, enter it’s address into the map to locate it.

The Web mashup plays the role of a specialized browser that performs these tasks on the user’s behalf to create the ﬁnal result set. Web mashups like the one described here leverage the underlying Web architecture of URL-addressable data that is retrievable via HTTP. The last 18 months have seen an explosion of useful Web mashups. Mashups have moved from being Web applications that brought together data from different sites to providing alternative views of available data. As an example, the Google Calendar API enables Web sites to embed a user’s Google Calendar within a Web page. In doing so, such mashups can customize the look and feel of the calendar; this leads naturally to mashups that provide alternative views of the Google Calendar. The ability to provide alternative views of the same data source is a key consequence of the separation of data from any given view, and was earlier identiﬁed as a key requirement for adaptive Web access. With Web APIs and mashups liberating Web developers and users from a one size ﬁts all Web, mashups are evolving to be a ﬂexible platform that:

Provide the ability to build highly optimized custom views for cases where the ”one size ﬁts all” solution does not work.
Discover innovative access solutions via experimentation for inclusion into the mainstream.

1.4.4 Putting It Together —Ubiquitous Access

The evolution of specialized Web tools, light weight Web APIs and Web mashups have together led to the emergence of a component framework for the Web. This framework is characterized by:

Data Model

An emerging data model for representing, manipulating and communicating presentation-independent structured data. These manifest themselves in one of the following forms:

ATOM feeds used in the context of Atom Publishing Protocol (APP).
XML instances backed by appropriate XML Schema type deﬁnitions for application-speciﬁc data. These are most commonly encountered in the context of XForms.
JavaScript Object Notation (JSON) serializations of structured data records. JSON uses JavaScript serialization to represent structured data as an alternative to XML.

UI

User interface controls authored as a mixture of declarative markup, style speciﬁcations and script-based event handlers to implement custom interaction.

Binding

A set of common technologies for binding user interface controls to underlying data. Such binding brings data to life by enabling users to manipulate and view structured data.

The separation of data, presentation and interaction that is manifest in this emerging architecture for Web components lends itself well toward making Web gadgets available on a variety of devices and interaction modalities. Thus, specialized browsing —a niche concept that was originally limited to special adaptive aids or software engineers building themselves efficient one-off solutions —have now evolved to become center-stage. Specialized browsers that talk HTTP to retrieve information from the data-oriented Web and deliver a custom presentation that is optimized for the user’s special needs and abilities are now a mainstream technology. Such specialized gadgets range in complexity from the simple look up weather gadget to full-blown custom applications such as mobile-optimized email clients. All of these share the underlying Web fabric of HTTP and URLs which means that specialized clients like GMail Mobile need only implement the user-facing aspects of a traditional mail client. This space is still evolving rapidly as this chapter is being written. Component technologies such as those described so far will likely evolve to become pervasive i.e., a Web component once created is likely to be capable of manifesting itself in a variety of end-user environments ranging from the graphical desktop to the speech-enabled mobile phone.

Web Command Line

As the Web platform continues to evolve, we can expect today’s environment of Web mashups backed by RESTful APIs to further evolve to enable end-user composability of Web components. Functionally, the service-oriented Web is a collection of Web APIs that can be composed to create higher-level solutions. In this regard, APIs such as Google Maps are Web components analogous to UNIX shell tools such as ls and find —UNIX is exempliﬁed by its command-line shell where small, task-speciﬁc tools are composed to create custom end-user shell scripts to automate common tasks. The next step in this evolution is likely to be the creation of a Web command-line that enables end-users to compose higher-level solutions from existing Web components.

In the UNIX shell, components were composed by piping the output of one component to the input of the next component to create logical pipelines —once created, such user-deﬁned components could themselves be used as components. Equivalent concepts for the Web platform are still evolving as this chapter is being written. Below, we enumerate some of the design patterns that have emerged over the last few years to serve as an indicator of what is to come.

Data Feeds: Structured data feeds encoded as RSS, Atom or JSON are used to communicate between Web components.
XML HTTP: XML-HTTP is used to make asynchronous requests for data within Web applications, making them more reactive.
Eventing: DOM eventing provides a standardized mechanism for reacting to user interaction events.
Greasemonkey: Content APIs like the DOM enable content transformation on the client —either via Javascript as implemented by Greasemonkey, or via XSLT.
Composability: Web APIs enable composability. Composability can happen on the client e.g., AJAX APIs coming together in a mashup, or on the server as shown by solutions such as Yahoo Pipes.
Command-line: The address bar of the Web browser has for now turned into a poor man’s command-line while we evolve toward a truly programmable Web platform.

1.4.5 Web Access —A Personal View

The Web is an information platform, and the question What is good Web access? is better answered by rephrasing the question as How does one deliver effective information access?.

My own work in this ﬁeld started with the work on AS TE R —a system for producing high quality aural renderings. The primary insight underlying AS TE R was Electronic information is display-independent. To produce good aural renderings, one needs to start with the underlying information, as opposed to a speciﬁc visual presentation.

AS TE R introduced the notion of audio formatting and produced high-quality aural renderings from structured markup by applying rendering rules written in Audio Formatting Language (AFL). AS TE R included an interactive browser that allowed the listener to obtain multiple views. Re-ordering and ﬁltering of content is an essential aspect of specialized browsing.

Extending these ideas from documents to user interfaces led to Emacspeak —a well-integrated computing environment that provides the auditory equivalent of the graphical desktop. Emacspeak extended the notion of audio formatting to interactive environments. In implementing rich auditory interaction for the Emacspeak audio desktop, it became clear that most of today’s user interfaces could be phrased in terms of a small number of abstract conversational gestures —see Table 1.1. The term conversational gestures was chosen intentionally —conversation implies speech; gestures implies pointing; the set of abstract conversational gestures identiﬁed by the work on Emacspeak is actually independent of both interaction modalities.


Exchanging Textual Information

Edit widgets		Message widgets

Answering Yes Or No

Toggles		Check boxes

Select Elements From Set

Radio groups		List boxes

Traversing Complex Structures

Previous	Next	Parent	Child

Left	Right	Up	Down

First	Last	Root	Exit

Table 1.1:

Conversational gestures

Conversational gestures as enumerated in Table 1.1 enable the authoring of intent-based user interaction that can be mapped to different interaction modalities. This notion was further developed and implemented within XForms —XML powered Web forms —where we deﬁned user interface controls for each of the conversational gestures.

User interaction authored via such intent-based vocabularies lend themselves well to delivery to different interaction modalities. Notice that a common set of abstract user interface controls gives enormous ﬂexibility when determining how a particular piece of user interaction is delivered to the user. But to deliver such ﬂexible interaction, one needs to have the freedom to experiment at the time the user interface is delivered.

An emerging pattern in this space is to therefore:

Author high-level user interaction using declarative markup e.g., XForms.
Deliver speciﬁc interaction behavior via event handlers implemented using client-side scripting e.g., JavaScript.

This leads naturally to the next step in this evolution —dynamic Web interaction delivered as a collection of declarative markup, prescriptive style-sheets and imperative event handlers. Notice that this packaging of Web interaction once again reﬂects the oft-mentioned separation of content, presentation and interaction.

Speech-enabling Dynamic User Interfaces

User interfaces created using intent-based authoring as embodied by technologies like XForms enables ﬂexible delivery, and consequently makes attaching spoken interaction tractable. However, there is a concomitant need to be able to speech-enable dynamic interaction delivered as a combination of declarative markup and imperative event-handlers. Notice that the availability of declarative, intent-based representations for common interaction tasks does not eliminate the need for imperative script-based solutions; scripting will always remain as a means to experiment with new interaction patterns. Thus, there is a need to identify the relevant pieces of information that need to be added to the content layer to enable speech-enabling dynamic Web interaction.

For Dynamic HTML (DHTML), such information consists of the following:

Role: A property that reﬂects the role played by UI component. As an example, property role might be used to indicate that an interactive element on a Web page is a menu.
State: Dynamic user interfaces are reactive —the state of user interface controls gets updated dynamically based on user interaction. Thus, a set of user interface controls that were originally disabled might become available to the user during the course of interaction. Dynamic property state can be used to encapsulate such changes in the state of user interface controls.
Monitors: In addition, dynamic visual interfaces rely on the eye’s ability to track changes in the presentation. To be able to effectively speech-enable such user interfaces, one needs to be able to establish an observer-observable relationship between various interaction elements making up the user interface. The content layer of the application needs to enable the identiﬁcation of such relationships and clearly markup those regions of the interface that need to be presented to the user when updated.

Addition of these properties to the content layer of Web appllications brings the interaction layer on par with the rest of the Web component framework with respect to empowering alternative modes of interaction. Note that these properties also form the underpinnings of the present work on access-enabling rich Internet applications ARIA.

1.5 Summary

This section summarizes the key take-aways from this chapter.

Web Arch: The Web based on HTTP and URLs is bigger than any given type of browser. Browsers have evolved from being viewers for static HTML documents into a universal container for light-weight user interaction. Specialized browsers technologies are a key component of the Web and are beginning to play a central role in enabling ubiquitous Web access.
Separation Of Concerns: Refactoring of Web applications to reﬂect the separation of content, presentation and interaction is an on-going process that progressively enables ﬂexible delivery of content. It is crucial for specialized browsers, and is central to the world of Web components.
Gadgets: Web gadgets capable of manifesting themselves in a variety of access contexts ranging from the user’s personalized Web page to the traditional graphical desktop outside the shackles of a Web browser and mobile devices are at the leading edge of today’s advances in Web interaction.
Web APIs: RESTful Web APIs have begun to deliver the original but unrealized promise of Web Services. With the arrival of mashups, we are ﬁnally beginning to see the emergence of a data-oriented Web.
Web Platform: The Web environment powered by content feeds and backed by data-oriented Web APIs and dynamic client-side interaction has been tagged with the Web 2.0 moniker. But more signiﬁcant than the version number is the emergence of the Web as a viable platform for delivering ubiquitous information access backed by ﬂexible user interaction.
Web Command-line: As the Web platform evolves further, we can expect many of the technologies underlying specialized browsers to morph into a Web command-line that allows end users to compose ﬂexible custom solutions from the various building blocks provided by the service-oriented Web.