Chapter 2
Recognizing High-Level Document Structure

This chapter describes high-level models for document structure and the extraction of such structure from electronic markup. Our recognizer, a recursive descent parser written in Lisp, handles documents encoded in the (LA)TEX family of markup languages: TEX, LATEX and AM S-TEX.

We present the recognizer as follows. Section 2.1 describes the high-level models used to capture general document content. Section 2.2 presents the models used to capture written mathematics. Section 2.3 gives a brief overview of the techniques used to extract structure from documents conforming to our model. (LA)TEX allows the author of a document to extend the markup language by introducing user-defined macros. These are modeled as introducing new object types into the logical structure. Using this model, we describe a flexible method for extending the recognizer to handle (LA)TEX macros in Section 2.4. Section 2.5 formulates a few guidelines for unambiguous document encodings based on our experience in extracting structure from current-day markup documents. Appendix A.2 documents the external interface to the recognizer.


 2.1 Document models
  Class Article
  Extending Document Logical Structure
 2.2 Representing Mathematical Content
  Math Object Encapsulates Quasi-Prefix Form
  Refining the Quasi-Prefix Form
 2.3 Constructing High-Level Representations
  2.3.1 Lexical Analysis and Recognition
  2.3.2 Constructing the Quasi-Prefix Form
 2.4 Macros Introduce New Object Types
  2.4.1 How define-text-object Works
  2.4.2 Rendering Instances of User-Defined Macros
  Defining New Environments in LATEX
 2.5 Unambiguous Document Encodings