[Next] [Up] [Previous]
Next: Class article Up: Recognizing high-level document Previous: Recognizing high-level document

Document models

  All information has high-level structure, and any physical rendering of a document is a projection of this structure onto a particular medium, e.g., printed paper. This high-level structure is itself independent of any particular mode of displaying the information. We have developed high-level models to represent document structure as a first step in audio rendering such structured information. The amount of structural information that can be extracted from the electronic source depends entirely on how the logical structure is marked up. In the context of OCR-based document recognition, this is a function of the quality of the visual rendering being recognized. In the case of both markup-based and OCR-based document recognition, the type of structure that can be extracted varies widely.

Intuitively, there is a hierarchy of document types ordered by the amount of structural information captured, and the ease with which such structure can be recognized. The amount of structural information varies from plain paragraphs and sentences marked up with normal punctuation, all the way up to highly technical documents with footnotes, equations and references. The ease with which the structure can be extracted ranges from the bitmap on a low-resolution fax, through to a postscript file, on upward to a highly marked up SGML file. Given a document instance, the amount of markup information determines which of these logical structures we can extract. Given a plain ASCII document, structural information has to be inferred from the layout of the text, e.g., spacing, vertical alignment and centering. In the case of encodings in markup languages like La)TeX, much of the logical structure is explicitly present in the markup. Structure based document encoding systems like SGML provide the potential for extracting the richest possible logical structure, since they separate the layout process from the encoding of the document structure.

Our recognizer captures logical structure present in electronic documents encoded in the TeX family of languages. An important feature of this recognizer is that it works on the entire gamut of encodings, ranging from plain ASCII documents, i.e., no markup, up to documents containing completely unambiguous encodings of the logical structure. Recognition of document structure is an important step in producing audio renderings, since the quality of such renderings is directly determined by the richness of the available structural information.

Our basic document model is the attributed tree. Each hierarchical level of the document is modeled as a node in this tree. Each node can have content, children and attributes. In this respect, our document model is no different from the ones used by SGML[+]. We now introduce the hierarchy of objects used to model documents belonging to the article style of LaTeX. Since our recognizer is implemented in CLOS, an object-oriented language, we will use object-oriented terminology throughout this chapter. Thus, the term object typically refers to a CLOS object. Further, the terms subclass and subtype are used synonymously.





[Next] [Up] [Previous]
Next: Class article Up: Recognizing high-level document Previous: Recognizing high-level document



TV Raman
Thu Mar 9 20:10:41 EST 1995