2.1 Document models

All information has high-level structure, and any physical rendering of a document is a projection of this structure onto a particular medium, e.g., printed paper. This high-level structure is itself independent of any particular mode of displaying the information. We have developed high-level models to represent document structure as a first step in audio rendering such structured information. The amount of structural information that can be extracted from the electronic source depends entirely on how the logical structure is marked up. In the context of OCR-based document recognition, this is a function of the quality of the visual rendering being recognized. In the case of both markup-based and OCR-based document recognition, the type of structure that can be extracted varies widely.

Intuitively, there is a hierarchy of document types ordered by the amount of structural information captured, and the ease with which such structure can be recognized. The amount of structural information varies from plain paragraphs and sentences marked up with normal punctuation, all the way up to highly technical documents with footnotes, equations and references. The ease with which the structure can be extracted ranges from the bitmap on a low-resolution fax, through to a postscript file, on upward to a highly marked up SGML file. Given a document instance, the amount of markup information determines which of these logical structures we can extract. Given a plain ASCII document, structural information has to be inferred from the layout of the text, e.g., spacing, vertical alignment and centering. In the case of encodings in markup languages like (LA)TEX, much of the logical structure is explicitly present in the markup. Structure based document encoding systems like SGML provide the potential for extracting the richest possible logical structure, since they separate the layout process from the encoding of the document structure.

Our recognizer captures logical structure present in electronic documents encoded in the TEX family of languages. An important feature of this recognizer is that it works on the entire gamut of encodings, ranging from plain ASCII documents, i.e., no markup, up to documents containing completely unambiguous encodings of the logical structure. Recognition of document structure is an important step in producing audio renderings, since the quality of such renderings is directly determined by the richness of the available structural information.

Our basic document model is the attributed tree. Each hierarchical level of the document is modeled as a node in this tree. Each node can have content, children and attributes. In this respect, our document model is no different from the ones used by SGML1. We now introduce the hierarchy of objects used to model documents belonging to the article style of LATEX. Since our recognizer is implemented in CLOS, an object-oriented language, we will use object-oriented terminology throughout this chapter. Thus, the term object typically refers to a CLOS object. Further, the terms subclass and subtype are used synonymously.

Class Article

An object of class article has attributes such as title, author, abstract and date. The children of object article represent hierarchical structure, e.g., sectional units. The prologue of an article is its initial body, i.e., any text occurring before the first sectional unit. Though it would be cleaner to model such initial text as the first child, it is more convenient to handle it as an attribute. This is because (LA)TEX does not specify a complete document type definition (DTD) for articles. This lack of a fully specified DTD results in many of the objects not being well-defined. All objects that capture document content have the same basic model as described above for articles. Note also that LATEX provides separate book and report styles. These styles differ from the article style mostly in the kind of layout achieved. The only structural difference is that books and reports in LATEX can have chapters, while articles cannot. Chapters, sections and subsections are all structures that capture hierarchical document content and are modeled as sectional units. The article class of documents defined here therefore encompasses books and reports.

The leaves in the tree structure for documents represent actual content. Plain text is represented as a list of word objects, and inline mathematics is represented by object inline math. Each node in the document model is linked to its parent and siblings, enabling sophisticated browsing. These links are provided by the document base class.

Thus, class document provides the following slots:

The following is a brief overview of some of the document objects in our model. All of the following objects inherit from base class document.

Extending Document Logical Structure

LATEX allows the basic model described above to be extended in two ways:

User-defined macros and environments add new object types to the model described above. This will be covered in detail in Section 2.4. Suffice it to say for the present that these new objects will extend the basic model outlined above.

The document model is an attributed tree. Cross references are represented by object cross reference that contains a pointer to the object being cross-referenced, and this link can be used to traverse the model. The label of a cross-referenceable object is represented as an attribute of that object.