This section describes the techniques used to extract high-level models from the (LA)TEX source. A recursive descent parsing algorithm is used to construct the tree structure for document content conforming to the model described in Section 2.1. This algorithm is modified to construct the quasi-prefix form. These refinements enable our recognizer to correctly handle ambiguous mathematical notation, as in the expression sin2x = 2sinxcosx. We use a modified version of the conventional operator-precedence approach for constructing the quasi-prefix form. With the refinements and heuristics outlined in this section, our algorithm successfully recognizes written mathematical notation from a wide variety of sources.
Lisp-CLOS was chosen to implement AS TE Rbecause of its powerful development environment and object-oriented features. However, Lisp-CLOS lacks tools such as lexical analyzers and parser generators, e.g., LEX and YACC. As a convenient way of getting the best of both worlds, we designed a lexical analyzer called lispify in LEX that outputs the input (LA)TEX source in a canonical list representation. This list is then read in by a recursive descent parser written in Lisp. The general form of this list is (token <body>), where <token> identifies the type of content encapsulated by the list and <body> represents the content. The recognizer returns a document object that encapsulates the document instance being recognized. For example, given the (LA)TEX input
\begin{center}
This is a sample document.
\end{center}
LISPIFYproduces
(center "This" "is" "a" "sample" "document" ".")
LISPIFYhandles all of (LA)TEX concrete syntax.
The recursive descent parser examines the token at the front of the input list and calls a token-specific processing function on the rest of the list. Thus, given the input (token <body>), the recognizer executes
The technique described so far is sufficient for handling sections, enumerated lists and other textual content.
The recognizer processes the mathematical content to construct the quasi-prefix form described in Section 2.2. For example, given the input $a+b$, LISPIFY produces
(inline-math "a" "+" "b" )
Converting a list as shown above to prefix form is a simple exercise and can be found in most programming language texts. Our implementation is based on the infix to prefix converter in the text on Common Lisp by Winston and Horn5 [HW89].
Function inf-to-pre performs the infix-to-prefix conversion. The input to this function is a list of math objects that have been processed using the classification given in Section 2.2. Each element of this list is a math object with content and attributes but no children. Note that the contents of the attributes are first converted to quasi-prefix form. For example, when recognizing xk−1 + xk + xk+1, the input is first converted to a list of five math objects containing the quasi-prefix representation for xk−1, +, xk, + and xk+1 respectively. This is achieved by collecting the attributes that appear on each math object and processing their content recursively. Converting such a list to prefix form is now no different than processing a + b.
We now extend this algorithm to handle ambiguous mathematical notation. Conventional parsing techniques fail, since written mathematics does not adhere to a rigorous set of precedence rules. For example, the expression sin2nπ means sin(2nπ) rather than sin(2) ∗ nπ, even though function application is normally assigned the highest precedence. Moreover, sinacosb means sina ∗ cosb rather than sin(acosb). We have taken many such anomalies into account.
The precedence table for operators Table 2.1 on page 43 lists operators in ascending order of precedence. Only one operator is shown at each level.
|
Functions define-precedence and remove-precedence allow the user to modify the precedence table. These, however, are not for use by a casual user of AS TE R, since changes to the precedence table without a clear understanding of the recognition algorithm can cause unexpected behavior.
As pointed out earlier, precedence rules alone are not sufficient to handle written mathematics. We adapt the algorithm by using the following heuristics:
everything up to the = sign is treated as the summand. This technique is particularly useful in recognizing expressions like x + ∑ iai = 0. By our heuristic, the summation is correctly recognized as the second argument to the + sign. Further, the summand is terminated by the = sign. The expression is now equivalent to recognizing a + c = 0, which can be handled by the standard algorithm.