Summary.
To most users of the Internet, the Web is epitomized by the Web Browser, the program on their machines that they use to “logon to the Web”. However, in its essence, the Web is both a lot more than —and a lot less than —the Web browser. The Web is built on:
This decentralized architecture was designed from the outset to create an environment where content producers and consumers could come together without the need for everyone to use the same server and client. To participate in the Web revolution, one only needed to subscribe to the basic architecture of a Web of content delivered via HTTP and addressable via URLs. Given this architecture, specialized browsers have always existed to a greater or lesser degree alongside mainstream Web browsers. These range from simple scripts for performing oft-repeated tasks e.g., looking up the weather forecast for a given location, to specialized Web user-agents that focus on providing an alternative view of Web content. This chapter traces the history of such specialized Web clients and outlines various implementation techniques that have been used over the years. It highlights specialized browsers in the context of accessibility, especially for use by persons with special needs. However, notice that specialized browsers are not necessarily restricted to niche user communities —said differently, all of us have special needs at one time or another. As we evolve from the purely presentational Web to a more data-oriented Web, such specialized tools become center-stage with respect to providing optimal information access to the end-user. The chapter concludes with a brief overview of where such Web technologies are headed and what this means to the future of making Web content accessible to all users. |
To most users of the Internet, the Web is epitomized by the Web Browser, the program on their machines that they use to “logon to the Web”. However, in its essence, the Web is both a lot more than —and a lot less than —the Web browser. The Web is built on:
This decentralized architecture was designed from the outset to create an environment where content producers and consumers could come together without the need for everyone to use the same server and client. To participate in the Web revolution, one only needed to subscribe to the basic architecture of a Web of content delivered via HTTP and addressable via URLs.
Given this architecture, specialized browsers have always existed to a greater or lesser degree alongside mainstream Web browsers. These range from simple scripts for performing oft-repeated tasks e.g., looking up the weather forecast for a given location, to specialized Web user-agents that focus on providing an alternative view of Web content.
This chapter traces the history of such specialized Web clients and outlines various implementation techniques that have been used over the years. It highlights specialized browsers in the context of accessibility, especially for use by persons with special needs. However, notice that specialized browsers are not necessarily restricted to niche user communities —said differently, all of us have special needs at one time or another.
As we evolve from the purely presentational Web to a more data-oriented Web, such specialized tools become center-stage with respect to providing optimal information access to the end-user. The chapter concludes with a brief overview of where such Web technologies are headed and what this means to the future of making Web content accessible to all users.
We start this section with a brief overview of the history of talking Web browsers —commonly refered to as self-voicing browsers. The goal is not to cover every specialized browser that was ever written; rather, this section attempts to broadly classify various solutions that have been built since 1994 as a function of the end-user experience they delivered.
The mainstreaming of the Web in 1994 coincided with the coming of age of GUI screenreaders. This meant that Web access issues for visually impaired users became intricately tangled up with the broader issue of providing good non-visual access to the GUI. At the time, generic platform-level accessibility APIs were non-existent, and screenreaders relied on constructing an off-screen model by watching low-level graphics calls. Thus, Web access presented an additional challenge to adaptive technologies of the time.
Specialized Web browsers that spoke Web content first emerged in early 1995. These were implemented as browser add-ons that relied on a visual browser to retrieve and display the content; the browser add-on accessed the retrieved HTML to produce spoken content. Notice that this was before the advent of standardized APIs such as the HTML Document Object Model (DOM). Despite this lack of standardized APIs, talking browsers of the time still had an advantage over available screenreader technologies; this was because specialized browsers were able to augment the user interface with additional commands that enabled the user to efficiently navigate the contents of a page. Contrast this with the screenreaders of the time that had to rely on the final visual presentation of the page —users of specialized talking browsers could navigate by paragraphs and sections, whereas screenreader users of the time were limited to line-oriented navigation.
Examples of such specialized Web browsers from 1995 include:
Press this to change “Do you accept” from yes to no.
By the late 1990’s, Windows screenreaders like Jaws For Windows (JFW) and Window Eyes started looking at the HTML content in addition to using the visual presentation provided by the browser. Around the same time, platform-level accessibility APIs like Microsoft Active Accessibility (MSAA) enabled screenreaders to produce more reliable spoken output. Consequently, the combination of a screenreader and mainstream browser began to provide the same level of end-user functionality that was seen earlier with specialized browsers like PWWebSpeak and IBM Home Page Reader. As an example, popular Windows screenreaders today implement browser support by placing Web pages in a virtual buffer that the user navigates using specialized commands to listen to the contents. Tools like IBM Home Page Reader therefore evolved into tools for checking the usability of Web sites with spoken output for use by content developers.
Content feeds encoded as Really Simple Syndication (RSS) went mainstream in 2003. RSS feeds were the underpinnings of the blogging revolution. As a side-effect, content-rich Web sites started providing data feeds that could be viewed inside specialized tools such as feed aggregators. This marked the coming of the data-oriented Web, where content is designed to be more than just viewed in a Web browser. The coming of such data-oriented access mechanisms has had a significant impact on the role and effectiveness of specialized browsing tools.
In early 2000, emacspeak acquired the ability to apply content transformations to Web pages before presenting them to the user. This meant that the content of a Web page could be rearranged and filtered to provide an optimal eyes-free experience. In combination with the availability of content feeds, this enabled the creation of a large number of task-oriented tools. All of these tools leveraged the basic HTML+CSS rendering capabilities of Emacs/W3. Each of these specialized tools exposed a task-oriented interface that prompted the user for relevant input, retrieved and transformed the Web content relevant to the user, and finally produced a speech-friendly presentation of the results.
The key difference with such task-oriented tools is that the user does not first launch a Web browser; for the most part, the user does not even think of the output from these tools as Web pages. The framework hosting these tools —Emacspeak —implemented the building blocks of basic Web Architecture, and the result was a set of mini-applications that the user could call up with a few keystrokes. Examples of such task-driven tools include:
Notice that the resulting collection of speech-enabled tools can each be thought of as an extremely specialized browsing tool; the framework for hosting such tools then becomes an alternative container on par with the traditional Web browser. We will return to the topic of specialized containers that host Web components in Section 1.4 where we cover the rapidly evolving space of Web gadgets.
In 1999, the W3C launched the Voice Browser activity which led to the publication of VoiceXML 2.0 —an XML-based language for authoring dialog interaction. VoiceXML is designed for authoring interactive applications that use speech as the primary interaction modality; typical implementations consist of a specialized container that processes VoiceXML documents to carry out a spoken dialog with the user. Covering the design and use of VoiceXML is beyond the scope of this chapter —for details, see VoiceXML.
Later, XHTML+Voice enabled the integration of interactive spoken dialogs in Web pages. In this design, VoiceXML was used to author dialogs that were then attached as event handlers to visual Web controls. When hosted within a browser implementing DOM2 Events, this had the effect of turning visual user interface controls into multimodal dialogs. When a visual user interface control received focus, the VoiceXML dialog attached to that control produced appropriate spoken prompts, activated the specified speech recognition grammar, and returned the recognized result to the user interface control. This meant that users could fill-in forms either via the keyboard or via spoken input —this technique was implemented in browsers like Opera 9.
The design of VoiceXML applications is significant from the perspective of specialized browsers and Web architecture. VoiceXML applications use URLs as the addressing mechanism for locating and retrieving resources via HTTP. Here, resources include both application data e.g., a train timetable, as well as application resources needed to carry out an effective spoken dialog with the user, e.g., spoken prompts and speech grammars. Thus, a VoiceXML application consists of the following:
VoiceXML applications can be viewed as attaching a purely spoken user interface to data available on the emerging service-oriented Web. Notice that the above design pattern of attaching spoken interaction to data on the service-oriented Web is still an on-going evolutionary process. VoiceXML applications authored for today’s Web often end up needing to create a specialized back-end application from scratch —as opposed to merely attaching a spoken dialog interface to an existing data-oriented application. But this is a reflection of the fact that until now, most applications have been authored for use via visual interaction. As we move toward an increasingly diversified Web characterised by users who demand ubiquitous access from a variety of access devices ranging from desktop PCs to mobile devices, the Web is seeing a corresponding re-factoring of the programming technologies used to author, deploy and deliver end user interaction. Over time, such re-factoring is beginning to lead to a data-oriented Web where open Web APIs based on URLs and standardized feed structures based on ATOM and RSS increasingly enable programmatic access to useful services. As such access increases, specialized browsers that provide alternative access can increasingly focus on the details of user interaction in a given modality, without having to repeatedly program the modality-independent aspects of an application. Content creation guidelines and standards play an increasingly important role in this process of refactoring as will be seen in the next section.
The extent to which Web content can be made perceivable to the widest possible audience is a function of the following:
As can be seen from the above, the overall user experience —especially when considering users with special needs —is a function of the triple (C,UA,AT). Other chapters in this book focus on access guidelines and adaptive technologies in far greater detail; this section focuses on the relevance of accessibility guidelines as viewed from the goal of designing specialized browsing applications.
The Web as we know it would not exist without content. For the Web to remain true to its original vision of a Web of content where producers and consumers come together without explicit dependencies on a given set of technologies, content needs to be able to degrade gracefully e.g., a Web site that has been created assuming color displays needs to be usable when viewed using a monochrome display.
Returning to the topic of spoken access and specialized browsers, there is a deep relationship between access guidelines created to further the needs of graceful degradation and creating content that lends itself to being delivered via alternative modalities such as spoken output.
Separating content from style on Web pages by using CSS is an example of good content practice that benefits accessibility in the broader sense:
More specifically, separation of style from content makes the resulting HTML better suited for delivery via alternative modalities such as spoken output. Work on CSS1 started in 1995 —CSS1 became a W3C Recommendation in early 1997. Aural CSS was created as an appendix to CSS1 in February 1996; later, it was converted into a CSS module for CSS 2.1. Note that the next version of CSS, CSS 3.0, is being created as a collection of modules —with one module focused on auditory output.
Aural CSS is a good example of talking browsers leveraging an underlying design principle —separation of content from presentation —and applying the benefits of such separation to an entirely different output modality.
Aural CSS was first implemented in Emacs/W3 in 1996; later, Opera implemented a subset of Aural CSS in Opera 9 in the context of speech-enabling the Opera browser using XHTML+Voice (X+V).
Aural CSS specifies a set of additional voice properties that can be used to annotate Web content. As with visual CSS properties, aural properties can originate from a number of sources:
Most uses of Aural CSS fall into the final category above i.e., specialized browsers use Aural CSS as a rule-based means of mapping visual style rules to appropriately designed aural styles.
Today’s Web pages are no longer pure content —they come close to realizing the 30-year old maxim “the document is the interface!”. User interfaces —and interactive documents as found on today’s Web —consist of content, style and interaction. Thus, today’s HTML pages consist of the following layers:
Notice that as we add in the next layer of complexity to Web documents, there is significant value in keeping the interaction layer well-separated from the content and style layers. Such separation is important for the broader needs of accessibility to the widest possible audience; but it is crucial with respect to creating Web applications that lend themselves for easy deployment to different end-user interaction scenarios.
In 1999, the W3C’s Forms WG set out to define the next generation of HTML Forms, but in this process quickly discovered that form elements in HTML were not just about fill-out forms. Form elements collect user input, and are in fact the basic building blocks for creating user interaction within Web pages. With this realization, XForms evolved into a light-weight Web application framework with a well-defined Model View Controller (MVC) design. Thus, XForms consists of the following:
The above separation between content, presentation and interaction was introduced to ensure that Web applications created via XForms could be delivered to a multiplicity of end-user interaction contexts. As an example, a given XForms application can be hosted inside a Web page to provide visual interaction; the same XForms application can be processed by a different container to deliver a chat-like interface, where the user is progressively prompted for the requisite information using an instant messaging client.
As a case in point, see FormsPlayer and multimodal applications which describes how the various items of abstract metadata encapsulated by an XForms application, e.g., help and hint can be leveraged to deliver a multimodal experience where the relevant tips are spoken to the user.
Based on the trends seen so far, this section sketches future directions for light-weight Web applications and their impact on the area of specialized browsers and accessibility. Notice that most if not all evolution on the Web is incremental; this means that many of the solutions that will become common-place in the future will typically trace their past to early prototypes of today. With this in view, this section sketches some future directions based on prototypes that have been built during the last few years. This is not to say that there will be no revolutionary changes; however, incremental improvements are far easier to predict —and in their aggregate often prove to be revolutionary in their impact.
As described in Section 1.1, URLs play a central role in the architecture of the Web. As the Web evolved to include dynamic server-generated content, URLs became more than locators of static content —URLs came to include URL parameters that were processed on the server to generate customized content. The formalizing of the Common Gateway Interface (CGI) in 1994, and the advent of HTML forms for collecting user input together led to the idea of RESTful URLs —see Representational State Transfer (REST). Such RESTful URIs naturally evolved into the fore-runner of light-weight Web APIs; in fact these still form the underpinnings of many of the data-oriented APIs deployed on the Web in 2007.
RESTful URIs led to the notion of url templates and Websearch wizards in Emacspeak around 1999. At the time, mainstream Web sites had become visually busy. As a result, useful services such as getting map directions were difficult to use in an eyes-free environment where one needed to listen to the entire Web page. As an example, in 1998, one could get driving directions from Yahoo Maps for anywhere in the United States —a major step forward for the time since before then, one needed to use specialized mapping/atlas programs to obtain such information. The only drawback was that the input controls for providing start and end locations were buried deeply inside a visually busy page. Worse, once one had located the input fields and filled in the requisite information, one suffered the obligatory World Wide Wait before receiving a heavy-weight HTML page with the directions swamped by a mass of additional content.
Fortunately, the underlying Web architecture based on RESTful URLs made building a specialized tool for this task relatively easy. The tool in question was implemented to:
Eight years later and counting, the Emacspeak tool for accessing driving directions from Yahoo Maps still works. The only piece of this tool that has changed over the intervening period has been the filter step which needs to keep pace with changes to the layout of the HTML page containing the directions.
The next step in this evolution was to convert the one-off tool above into a mini-application that was hosted in a framework. Notice that there is nothing very specific to map directions about the (prompt,retrieve,filter) sequence outlined above. Thus, within a few weeks of implementing the specialized talking map directions tool, Emacspeak had evolved to contain a framework that allowed easy authoring of talking Web tools. All of these tools have the following in common:
In the late 1990’s the Web browser evolved into a universal client —thanks to the availability of a number of useful services on the Web. This movement started on corporate Intranets where Web technologies proved far more cost-effective than traditional multi-tier client/server solutions. The trend extended itself to the global Internet as electronic commerce became prevalent on the Web. Thus, the Web browser became the user’s porthole on to the world of electronic information.
This evolution led naturally to the advent of Web portals —and consequently to the creation of portal servers. Web sites like Yahoo aggregated a number of useful services on to a single large Web page to provide a single point of access; on corporate Intranets, such portal sites were powered by portal servers that enabled the Web administrator to easily deploy new applications on the site and have users configure their user experience by determining what they saw on their customized page.
The above process gave birth to a new form of specialized browser —the Web portlet. A portlet was a small Web application that was deployed on the server to carry out the following steps:
Portlets as described above can be viewed as specialized browsers optimized for a given task e.g., working with an employee’s financial records on a corporate Intranet. Though hosted within a specialized application container on the server, such portlets are in fact no different than the specialized talking Web tools described in the previous section; In the case of looking up an employee’s financial records, the tasks that the user would typically need to perform:
are performed on behalf of the user by the portlet. Thus, the portal server becomes an application container that provides a framework for portlet authors to create task-specific Web applications. The framework manages details such as single sign-on and creating a uniform look and feel with respect to customization. The resulting portal site provides a single point of entry for the user and obviates many repetitive tasks:
In their heyday, portlets were not limited to desktop browsers with a visual interface. Using the underlying Web APIs, portlets were also created for deployment to mobile devices. Finally, a small number of portlets were created for hosting within a voice portal; such voice portlets emitted VoiceXML for aggregation into a larger VoiceXML application. Compared to their visual analog, VoiceXML portlets have not been very successful —primarily because integrating multiple spoken dialog applications into a coherent whole still remains an unsolved research problem.
Portal servers and portlets became the rage in 2002. Mapping this concept on to the client led to Web gadgets —task-specific Web applications hosted within the browser. As with the task-specific Web application technologies described so far, gadgets also relied on the underlying Web architecture of URLs and HTTP to bring relevant data closer to the user. As a client-side technology, gadgets naturally chose HTML and JavaScript as the implementation language; early prototype examples include Opera Widgets for the Opera browser among others.
Thus, client-side gadgets consisting of HTML, CSS and JavaScript were initially designed for placing within a Web page in a manner analogous to what was seen earlier with portlets. The next step in this evolution came with the realization that forcing the end-user to launch a Web browser for every task was not always convenient —there are certain types of information, e.g., the current weather, that are better suited to being available on the user’s desktop. This led to Apple’s Dashboard Widgets for Mac Os. Thus, the task-specific Web applications created thus far for aggregation into a Web page for viewing within a browser were finally freed from the shackles of having to live inside the browser —Web widgets could now materialize on the desktop.
Web gadgets are still evolving as this chapter is being written. In late 2005, Google introduced IGoogle Modules for adding custom content to a user’s Personalized Google page. Conceptually, these are similar to portlets, except that IGoogle Modules can be authored by anyone on the Web and published to a directory of modules that helps users discover and add published modules to their personalized IGoogle page. In an interesting parallel to client-side Web widgets escaping the shackles of the browser to live on the desktop, IGoogle modules can now be hosted within Web pages outside Google; they can also be viewed as Google Gadgets and materialize on the Google Desktop.
Notice that we’ve now come full circle, with task-specific browsing technologies that started as a niche application becoming a mainstream feature. Notice further that though such widgets inhabit the user’s desktop outside the Web browser, they’re well-integrated with respect to Web architecture and use all of the Web’s basic building blocks to achieve their end. The impact on talking browsers of this progression from specialized Web applications to task-specific Web gadgets for the mainstream is profound, since the very features needed by spoken Web access:
are all prerequisites to building a healthy environment for Web gadgets.
RESTful Web APIs became common by late 2004. The simplicity afforded by parametrized URLs and the bottom-up nature of development of RESTful Web APIs helped them overtake the much vaunted Web Services1 . As a consequence, the number of useful Web services available via light-weight Web APIs reached critical mass by late 2004, and Web 2.0 became a viable platform for building useful solutions.
Google Maps launched in early 2005, and provided the final link in the chain that led to Web mashups —light-weight Web applications that bring together data from different sources on the Web. Maps provide an ideal spatial canvas for visualizing information available on the Web. The availability of location-based information e.g., available rentals in a given city, data about crime rates in different neighborhoods etc. when combined with Google Maps enabled the creation of map mashups that allowed one to place location-oriented data on a map.
Notice that Web mashups represent a very special kind of task-oriented browsing; earlier the user looking for apartments to rent would have had to perform the following discrete tasks:
The Web mashup plays the role of a specialized browser that performs these tasks on the user’s behalf to create the final result set. Web mashups like the one described here leverage the underlying Web architecture of URL-addressable data that is retrievable via HTTP. The last 18 months have seen an explosion of useful Web mashups. Mashups have moved from being Web applications that brought together data from different sites to providing alternative views of available data. As an example, the Google Calendar API enables Web sites to embed a user’s Google Calendar within a Web page. In doing so, such mashups can customize the look and feel of the calendar; this leads naturally to mashups that provide alternative views of the Google Calendar. The ability to provide alternative views of the same data source is a key consequence of the separation of data from any given view, and was earlier identified as a key requirement for adaptive Web access. With Web APIs and mashups liberating Web developers and users from a one size fits all Web, mashups are evolving to be a flexible platform that:
The evolution of specialized Web tools, light weight Web APIs and Web mashups have together led to the emergence of a component framework for the Web. This framework is characterized by:
The separation of data, presentation and interaction that is manifest in this emerging architecture for Web components lends itself well toward making Web gadgets available on a variety of devices and interaction modalities. Thus, specialized browsing —a niche concept that was originally limited to special adaptive aids or software engineers building themselves efficient one-off solutions —have now evolved to become center-stage. Specialized browsers that talk HTTP to retrieve information from the data-oriented Web and deliver a custom presentation that is optimized for the user’s special needs and abilities are now a mainstream technology. Such specialized gadgets range in complexity from the simple look up weather gadget to full-blown custom applications such as mobile-optimized email clients. All of these share the underlying Web fabric of HTTP and URLs which means that specialized clients like GMail Mobile need only implement the user-facing aspects of a traditional mail client. This space is still evolving rapidly as this chapter is being written. Component technologies such as those described so far will likely evolve to become pervasive i.e., a Web component once created is likely to be capable of manifesting itself in a variety of end-user environments ranging from the graphical desktop to the speech-enabled mobile phone.
As the Web platform continues to evolve, we can expect today’s environment of Web mashups backed by RESTful APIs to further evolve to enable end-user composability of Web components. Functionally, the service-oriented Web is a collection of Web APIs that can be composed to create higher-level solutions. In this regard, APIs such as Google Maps are Web components analogous to UNIX shell tools such as ls and find —UNIX is exemplified by its command-line shell where small, task-specific tools are composed to create custom end-user shell scripts to automate common tasks. The next step in this evolution is likely to be the creation of a Web command-line that enables end-users to compose higher-level solutions from existing Web components.
In the UNIX shell, components were composed by piping the output of one component to the input of the next component to create logical pipelines —once created, such user-defined components could themselves be used as components. Equivalent concepts for the Web platform are still evolving as this chapter is being written. Below, we enumerate some of the design patterns that have emerged over the last few years to serve as an indicator of what is to come.
The Web is an information platform, and the question What is good Web access? is better answered by rephrasing the question as How does one deliver effective information access?.
My own work in this field started with the work on AS TE R —a system for producing high quality aural renderings. The primary insight underlying AS TE R was Electronic information is display-independent. To produce good aural renderings, one needs to start with the underlying information, as opposed to a specific visual presentation.
AS TE R introduced the notion of audio formatting and produced high-quality aural renderings from structured markup by applying rendering rules written in Audio Formatting Language (AFL). AS TE R included an interactive browser that allowed the listener to obtain multiple views. Re-ordering and filtering of content is an essential aspect of specialized browsing.
Extending these ideas from documents to user interfaces led to Emacspeak —a well-integrated computing environment that provides the auditory equivalent of the graphical desktop. Emacspeak extended the notion of audio formatting to interactive environments. In implementing rich auditory interaction for the Emacspeak audio desktop, it became clear that most of today’s user interfaces could be phrased in terms of a small number of abstract conversational gestures —see Table 1.1. The term conversational gestures was chosen intentionally —conversation implies speech; gestures implies pointing; the set of abstract conversational gestures identified by the work on Emacspeak is actually independent of both interaction modalities.
|
Conversational gestures as enumerated in Table 1.1 enable the authoring of intent-based user interaction that can be mapped to different interaction modalities. This notion was further developed and implemented within XForms —XML powered Web forms —where we defined user interface controls for each of the conversational gestures.
User interaction authored via such intent-based vocabularies lend themselves well to delivery to different interaction modalities. Notice that a common set of abstract user interface controls gives enormous flexibility when determining how a particular piece of user interaction is delivered to the user. But to deliver such flexible interaction, one needs to have the freedom to experiment at the time the user interface is delivered.
An emerging pattern in this space is to therefore:
This leads naturally to the next step in this evolution —dynamic Web interaction delivered as a collection of declarative markup, prescriptive style-sheets and imperative event handlers. Notice that this packaging of Web interaction once again reflects the oft-mentioned separation of content, presentation and interaction.
User interfaces created using intent-based authoring as embodied by technologies like XForms enables flexible delivery, and consequently makes attaching spoken interaction tractable. However, there is a concomitant need to be able to speech-enable dynamic interaction delivered as a combination of declarative markup and imperative event-handlers. Notice that the availability of declarative, intent-based representations for common interaction tasks does not eliminate the need for imperative script-based solutions; scripting will always remain as a means to experiment with new interaction patterns. Thus, there is a need to identify the relevant pieces of information that need to be added to the content layer to enable speech-enabling dynamic Web interaction.
For Dynamic HTML (DHTML), such information consists of the following:
Addition of these properties to the content layer of Web appllications brings the interaction layer on par with the rest of the Web component framework with respect to empowering alternative modes of interaction. Note that these properties also form the underpinnings of the present work on access-enabling rich Internet applications ARIA.
This section summarizes the key take-aways from this chapter.