Recognizing High-Level Document Structure

This chapter describes high-level models for document structure and the extraction of such structure from electronic markup. Our recognizer, a recursive descent parser written in Lisp, handles documents encoded in the (LA)TEX family of markup languages: TEX, LATEX and AM S-TEX.

We present the recognizer as follows. Section 2.1 describes the high-level models used to capture general document content. Section 2.2 presents the models used to capture written mathematics. Section 2.3 gives a brief overview of the techniques used to extract structure from documents conforming to our model. (LA)TEX allows the author of a document to extend the markup language by introducing user-defined macros. These are modeled as introducing new object types into the logical structure. Using this model, we describe a flexible method for extending the recognizer to handle (LA)TEX macros in Section 2.4. Section 2.5 formulates a few guidelines for unambiguous document encodings based on our experience in extracting structure from current-day markup documents. Appendix A.2 documents the external interface to the recognizer.

2.1 Document models

Class Article

Extending Document Logical Structure

2.2 Representing Mathematical Content

Math Object Encapsulates Quasi-Prefix Form

Refining the Quasi-Prefix Form

2.3 Constructing High-Level Representations

2.3.1 Lexical Analysis and Recognition

2.3.2 Constructing the Quasi-Prefix Form

2.4 Macros Introduce New Object Types

2.4.1 How define-text-object Works

2.4.2 Rendering Instances of User-Defined Macros

Defining New Environments in LATEX

2.5 Unambiguous Document Encodings