We formulate a few guidelines for encoding document content unambiguously based on our experience in recognizing structure from electronic markup. A document that adheres to these guidelines makes recognition easier. This is not to say that we cannot handle documents that do not conform to these guidelines, but our recognizer can extract more information from such unambiguous encodings. In general, we feel that electronic encodings conforming to these guidelines will be easier to maintain and enable multiple uses of the electronic source.
(LA)TEX macros provide an excellent solution to the problem of capturing context specific information in the document encoding. The same visual layout may be used to display disparate concepts. Encoding instances of such ambiguous notation by using well-designed macros abstracts out the layout details from the document encoding, and allows a recognizer to identify the different concepts correctly. We illustrate this with a concrete example.
The visual layout of stacking one mathematical object above another, separated by a horizontal line (horizontal rule), could be used in several contexts.
Using the encoding \frac{object-1}{object-2} in both cases makes it impossible to disambiguate between the various interpretations. When an author wishes to use the same layout to mean different things, the different occurrences should be marked up distinctly. For instance, in LATEX, the author could extend the markup language by defining two new macros:
Visual math notation is inherently ambiguous and derives most of its expressiveness by freely overloading standard visual-layout operators. (LA)TEX allows an author complete flexibility in producing mathematical notation by providing the primitives needed to produce such notation. It would be too restrictive to insist that the complete semantics of a mathematical object appear explicitly in the markup, since this would make inventing new notation cumbersome, if not impossible. So, to an extent, we will never be able to attach semantic meaning to every object in the document. However, we insist that document encodings should not use identical markup to represent objects that have the same visual layout but different meanings. This will allow a recognizer to process the objects in the document according to context and later permit context specific renderings.
In addition, an electronic encoding should not use (LA)TEX layout operators within the body of the document. This principle is in fact well-accepted within the electronic-documents community, but it is not adhered to as often as one would like. Document encodings that violate this requirement will become fewer with widespread use of editors that allow an author to easily encode document structure.
What we are saying is no different than the well-accepted programming standard that stipulates that function names should reflect the computation they perform. A well-designed macro library is like a well-designed subroutine library. To cite a quotation from the TEX Book:
It is much easier to use macros than to define them. … The use of macro libraries, in fact, mirrors almost exactly the use of subroutine libraries for programming languages. There are the same levels of specialization, from publicly shared subroutines to special subroutines within a single program, and there is the same need for a programmer with particular skills to define the subroutines.
PETER BROWN, Macro Processors (1974)
To summarize, here are some guidelines for unambiguous document encoding:
Electronic encodings have not always followed these rules, since the markup was viewed purely as a means of producing the visual rendering. Our work points out that the same encoding can be put to multiple uses; it is therefore important to apply principles of good software design to produce well structured document encodings.