2.3 Constructing High-Level Representations

This section describes the techniques used to extract high-level models from the (LA)TEX source. A recursive descent parsing algorithm is used to construct the tree structure for document content conforming to the model described in Section 2.1. This algorithm is modified to construct the quasi-prefix form. These refinements enable our recognizer to correctly handle ambiguous mathematical notation, as in the expression sin2x = 2sinxcosx. We use a modified version of the conventional operator-precedence approach for constructing the quasi-prefix form. With the refinements and heuristics outlined in this section, our algorithm successfully recognizes written mathematical notation from a wide variety of sources.

2.3.1 Lexical Analysis and Recognition

Lisp-CLOS was chosen to implement AS TE Rbecause of its powerful development environment and object-oriented features. However, Lisp-CLOS lacks tools such as lexical analyzers and parser generators, e.g., LEX and YACC. As a convenient way of getting the best of both worlds, we designed a lexical analyzer called lispify in LEX that outputs the input (LA)TEX source in a canonical list representation. This list is then read in by a recursive descent parser written in Lisp. The general form of this list is (token <body>), where <token> identifies the type of content encapsulated by the list and <body> represents the content. The recognizer returns a document object that encapsulates the document instance being recognized. For example, given the (LA)TEX input

\begin{center}

This is a sample document.

\end{center}

LISPIFYproduces

(center "This" "is" "a" "sample" "document" ".")

LISPIFYhandles all of (LA)TEX concrete syntax.

The recursive descent parser examines the token at the front of the input list and calls a token-specific processing function on the rest of the list. Thus, given the input (token <body>), the recognizer executes

(funcall(get-parsertoken) <body>)

The technique described so far is sufficient for handling sections, enumerated lists and other textual content.

2.3.2 Constructing the Quasi-Prefix Form

The recognizer processes the mathematical content to construct the quasi-prefix form described in Section 2.2. For example, given the input $a+b$, LISPIFY produces

(inline-math "a" "+" "b" )

Converting a list as shown above to prefix form is a simple exercise and can be found in most programming language texts. Our implementation is based on the infix to prefix converter in the text on Common Lisp by Winston and Horn⁵ [HW89].

Function inf-to-pre performs the infix-to-prefix conversion. The input to this function is a list of math objects that have been processed using the classification given in Section 2.2. Each element of this list is a math object with content and attributes but no children. Note that the contents of the attributes are first converted to quasi-prefix form. For example, when recognizing x_k−1 + x_k + x_k+1, the input is first converted to a list of five math objects containing the quasi-prefix representation for x_k−1, +, x_k, + and x_k+1 respectively. This is achieved by collecting the attributes that appear on each math object and processing their content recursively. Converting such a list to prefix form is now no different than processing a + b.

We now extend this algorithm to handle ambiguous mathematical notation. Conventional parsing techniques fail, since written mathematics does not adhere to a rigorous set of precedence rules. For example, the expression sin2nπ means sin(2nπ) rather than sin(2) ∗ nπ, even though function application is normally assigned the highest precedence. Moreover, sinacosb means sina ∗ cosb rather than sin(acosb). We have taken many such anomalies into account.

The precedence table for operators Table 2.1 on page 43 lists operators in ascending order of precedence. Only one operator is shown at each level.

Table 2.1:

Precedence table for mathematical operators.


Level	Description	Examples


0	tex-infix-operator	a b

1	math-list-operator	a,b

2	conditional-operator	a : b

3	quantifier	∀a

4	relational-operator	a = b

5	arrow-operator	a→b

6	big-operator	∑ _ab

7	logical-or	a ∨ b

8	logical-and	a ∧ b

9	addition	a + b

10	multiplication	a ∗ b

11	mathematical-function	sina

12	juxtaposition	ab

13	unary-minus	¬a

Functions define-precedence and remove-precedence allow the user to modify the precedence table. These, however, are not for use by a casual user of AS TE R, since changes to the precedence table without a clear understanding of the recognition algorithm can cause unexpected behavior.

As pointed out earlier, precedence rules alone are not sufficient to handle written mathematics. We adapt the algorithm by using the following heuristics:

The big operators, e.g., ∫ and ∑ , are treated as unary. Everything up to the next operator of lower precedence than the operator in question is considered part of the operand of the big operator. Thus, in the expression $∑ aijbjkcki = 1 1≤i≤p 11≤≤j≤kq≤r$
everything up to the = sign is treated as the summand. This technique is particularly useful in recognizing expressions like x + ∑ _ia_i = 0. By our heuristic, the summation is correctly recognized as the second argument to the + sign. Further, the summand is terminated by the = sign. The expression is now equivalent to recognizing a + c = 0, which can be handled by the standard algorithm.
The integral operator can have an optional delimiter, as in ∫ ₁^∞f dx. If the dx is present and is recognizable i.e., has been marked up as \d{x} as opposed to dx, it is recognized as the closing delimiter; the variable of integration⁶ is inferred. However, this closing delimiter may not always be available —it may be encoded ambiguously, as in $\int f dx$, or the integral itself may not require a closing dx, as in ∫ f. In the former case, our recognizer treats the juxtaposition fdx as the integrand. Though this may seem incorrect, it is in fact exactly what the typeset output means. In the latter case, the earlier rule (treating the operand of a big operator to be everything up to the first operator of lower precedence) applies. Hence, we can correctly recognize x + ∫ f = 0.
The closing delimiter dx is treated as such only if it occurs at the top level. Thus, in $\frac{\dx}{x}$, the \dx does not end the integrand. This allows us to recognize such integrals correctly, but we cannot now infer the variable of integration. There seems to be no clean solution for this problem. Written mathematical notation relies on the fact that dx means 1 ⋅ dx and the integrand is therefore .
Function application is treated as right associative. This results in sinacosb being interpreted correctly. Since juxtaposition has been assigned a higher precedence than function application, sinacosb continues to be recognized correctly. The following equation is a good example of such ambiguous notation —note the complete absence of parentheses: $2sin2n πcos2nπ = sin4nπ$
In written mathematics, delimiters do not always match. For example, (0,1] denotes a semi-open interval. There are also cases where there is no matching closing delimiter. The recognizer is aware of such anomalies and handles them correctly. When it sees an open delimiter, it scans forward to the end of the math expression for the first matching close delimiter of the same kind. If one is found, then all of the input up to this point is treated as the delimited expression. If no matching close delimiter of the same kind is found, then the first unmatched close delimiter delimits the input. Otherwise, the occurrence is treated as an unmatched delimiter.
The ! is one of the few postfix operators used in written mathematics. This is treated as a special case, and we confirm that the ! is indeed a factorial sign by making sure that it does not have any attributes. Thus, !_k is not a factorial symbol.

[next] [prev] [prev-tail] [front] [up]