[Next] [Up] [Previous]
Next: Document Models in Up: Representing Information Previous: Representing Information

Ordering the Possible Representations

The amount of structural information that can be extracted from the electronic source depends entirely on how the logical structure is marked up. In the context of OCR-based document recognition, this is also a function of the quality of the visual rendering being recognized. In the case of both markup-based and OCR-based document recognition, the type of structure that can be extracted varies widely.

Intuitively, there is a hierarchy of document types (lattice) ordered by the amount of structural information captured, and the ease with which such structure can be recognized. The amount of structural information varies from plain paragraphs and sentences marked up with normal punctuation, all the way up to highly technical documents with footnotes, equations and references. The ease with which the structure can be extracted ranges from the bitmap on a low-resolution fax, through to a postscript[+] or PDF[+] file, on upward to a highly marked up LaTeX or SGML file. Given a document instance, the amount of structural information determines which of these logical structures we can extract. Given a plain ASCII document, structural information has to be inferred from the layout of the text, e.g.,spacing, vertical alignment and centering. This is also true of pure visual layout encodings like PostScript and PDFTeX. In the case of encodings in markup languages like La)TeX, much of the logical structure is explicitly present in the electronic source. Structure based document encoding systems like SGML provide the potential for extracting the richest possible logical structure, since they separate the layout process from the encoding of the document structure.

TV Raman
Fri Mar 10 08:30:23 EST 1995