Electronic document systems are based on:
Standard Generalized Markup Language (SGML) marks up document logical structure in a layout-independent manner [SGM86, Org90, HPR92, Gol90]. A Document Type Definition (DTD) is used to encapsulate the logical structure of specific classes of documents. Thus, SGML provides a notation for describing classes of structured documents and for coding documents belonging to described classes. An advantage of SGML and other grammar-based document representations is the ability to perform multiple applications on a single document source file. The International Committee on Accessible Documents (ICAD) has been working on defining an accessible DTD1, but at present their work does not encompass mathematical content.
Though SGML is now used to markup a variety of documents by many government agencies, it still has very little support for marking up technical content, e.g., mathematics. There is ongoing work to remedy this situation. In the last year, the SGML-Math committee has been working on a math DTD for SGML. This work is not yet complete, but it has raised a few interesting issues. The main point of discussion has been whether it is possible to design a math DTD that captures semantic information about the mathematical constructs being marked up. Though it would be nice to have all of a mathematical constructs semantic content when processing the document, e.g., in our case producing audio renderings, this seems almost unattainable. There is as yet no firm agreement on this point, but the trend seems to be to move towards a math DTD that captures the layout as embodied by TEX. Defining a DTD that captures full mathematical semantics would make it difficult to invent new notation. TEX, by only capturing the layout constructs used to build up written mathematics, side-steps this issue, and the resulting system makes it easy to invent new notation. However, this also makes recognition more difficult. Some of the problems present in (LA)TEX are being addressed by ongoing work on the LATEX3 project.
Significant work has been carried out in the context of structure-sensitive editors for documents. This work has focused on the design of appropriate document encodings that capture high-level structure unambiguously. Another topic of interest has been the capture of hypertext links within the context of structured documents. The logical structure of documents is typically captured using a tree-like representation consisting of hierarchical units. The challenge of integrating this model with the notion of hypertext links has been successfully addressed by the design of HyperText Markup Language (HTML), an SGML-based markup system for encoding structured hypertext documents. Finally, the aim of achieving the best of two worlds, i.e., the power afforded by a grammar-based markup system and the user-interface provided by WYSIWYG systems (what you see is what you get) has led to work on providing multiple synchronized views of a document [Har88]. See [QV92, LG90, BB90, KLMN90, PI88, SF88, SF90, FBN+90, Lev88, SFR92, FS89, Kat87, Ass86, KS84, CJ90, BG90, Ver90, PS88, QNA90, Bro88] for details on relevant work in this area.
There has been some work towards building automatic translators for converting electronic documents from one markup language to another. The need for such systems is apparent: Even though most of todays documents get written electronically, it is still practically impossible to exchange electronic documents generated on disparate computer systems. This means that the only way information can be exchanged is by first printing a hardcopy.
There are two approaches to solving this problem:
ICA (Integrated Chamelion Architecture), was developed at the Computer Science Department in Ohio State University [MOB90]. The system produces translators between different document encodings. Users specify an abstract syntax for the class of documents they wish to translate. Typically, this abstract syntax would be similar to a DTD used by SGML. Users then specify the conversion rules for mapping this abstract syntax to and from the concrete syntax used by different markup systems. Using this specification, the system generates translators that can convert documents from a specific concrete syntax to the abstract syntax and vice-versa.
The advantage of this approach is that it requires only O(n) translators to convert between documents encoded in n different markup languages. Directly translating between the n markup systems would require O(n2) translators. The difficulty is that not all markup systems use the same model for the same class of documents. This means that the abstract syntax can capture only those features that are common to all n markup systems. To give an example, one of the target systems might explicitly capture section numbers in the markup, while the other might compute them upon request.
Given documents marked up in n different markup languages, an alternative solution is to convert all of them to a form that is the least common denominator of the various document encodings. This can be done by converting the documents either to plain ASCII or to a display-specific format, such as Postscript. Both these alternatives have shortcomings as outlined below.
Converting to ASCII loses layout structure. Since the only thing that cues logical structure in a formatted document is layout, this form of conversion loses information.
An alternative solution is adopted by systems like the Adobe Acrobat. Page Description Format (PDF), a portable form of Postscript, is used by the Adobe Acrobat as a common currency between different computing platforms. The encoded document can be displayed with its original layout on disparate computing platforms without using the software used to produce the original document. This solution does allow users to exchange documents without losing any layout information. However, it is only one step better than exchanging printed paper: exchanging PDF files is like exchanging electronic paper! For example, the information present in the document cannot be manipulated electronically. This also means that the information and its inherent structure can be accessed in only one way by a human looking at the information. The principal advantage of having information online the ability to process it is lost. In addition, it has the serious disadvantage of making electronic information inaccessible to persons with special needs.
Most recognition work has focused on extracting logical structure from document layout. Significant research has been carried out in the context of OCR-based document recognition. For a complete bibliography on work in this area, we refer the reader to the online bibliography on document understanding available on the Internet2.
See [PR92] for details on recognizing logical structure from the layout information present in a Postscript file. This is a difficult problem and re-emphasizes the earlier comment on PDF and the shortcomings in storing electronic documents in a purely layout-oriented form.
Relatively little work has been done in recognizing document structure from electronic markup. The work on Chamelion [MOB90] and related work in the area of attribute grammars [Yel88] could be used to extract logical structure from electronic documents. Tools such as the Cornell Synthesizer Generator [RT84, RT88a, RT88b] and the Centaur system [Bor88] can also be used to build such recognizers. The key to building such recognizers successfully is the robustness and applicability of the high-level models used. For details on other attempts at recognizing structure from markup, see [Arn91, AM91, AW91, Arn92].
One of our principal aims when designing AS TE R was to produce clear and succinct audio renderings for mathematics. In designing a concise audio notation for mathematics, it is interesting to note that the written mathematical notation that we have come to accept is relatively new. For an in-depth survey of the evolution of written mathematics, see [Caj30].
There is little similarity between developing a written notation and its audio counterpart. However, the evolution of written notation shows the following. Any notational system is a combination of conventions and an intuitive use of the various dimensions that are provided by the perceptual modality and the means available to produce output appropriate for that modality. In the case of visual notation, these dimensions are font size, changes in baseline, use of different delimiters, stacking of sub-expressions to build up layout, and the use of characters from different scripts. This insight enabled us to develop a concise audio notation for spoken mathematics that exploits the various audio dimensions that are currently available see Section 4.3 for details. It is conceivable that the number of audio dimensions will increase with the improvement in audio hardware, leading to a more sophisticated audio notation.