Annotate math expressions

From LuaTeXWiki
Revision as of 11:48, 12 June 2012 by Esteis (talk | contribs) (One very long line of code was breaking horizontal scroll.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Now with the scripting language Lua having access to the TeX internals, it is quite easy to generate PDF annotations automatically. An interesting example I started to explore is whether it would be possible to generate Content MathML expressions from the low level TeX mathlist node representations exposed via the proper LuaTeX callback mlist_to_hlist.

Succinctly, it is possible to generate Content MathML from simple math formulas, however my initial approach using context free grammar parsers, (i.e. lpeg) are severely limited by the fact that the interpretation of LaTeX math expressions is rather context sensitive.

A much simpler topic is how Lua(La)TeX could be used to automatically generate math expression bounding boxes in PDF documents, such that extraction programs can reliably identify text areas in the PDF document that pertain to math formulas.

The entire code snippets can be downloaded from:

We use two callback functions, mlist_to_hlist to insert a PDF annotation node and the callback pre_output_filter to identify the bounding box for a math formula.

-- doesn't yield anything interesting
function convertToMathML(head)
	return {tag="not implemented"}

-- create content MathML for every math formula
function(head, display, penalty)
	texio.write_nl('NEW mathlist')

	result = convertToMathML(head)
	if result ~= nil then
		et = etree.ElementTree({tag = "math", result}, {decl = false})
		local pdf ="whatsit", "pdf_annot")
		local buffer = etree.StringBuffer()
		et:write(buffer) = '/Subtype /MathML /Contents (' .. tostring(buffer) .. ')'
		head = node.insert_before(head, head, pdf)
	return node.mlist_to_hlist(head, display, penalty)
	"content MathML generator")

In the callback above the function convertToMathML does not yield any interesting result:

-- element tree, (is a bit buggy for the {decl = false} option)
local el = require "etree"

function convertToMathML(head)
	return {tag="not implemented"}

Now having tagged the math formula, we still need a bounding box. Luckily, the the pre-output phase is intercepted by the pre-output-filter and can be used to accomplish exactly that!

local vpack_counter = 1

	-- viz.nodelist_visualize(head, "vpack"..vpack_counter..".gv")
	vpack_counter = vpack_counter + 1
	return head
	,"find math bounding box")

Of course, we need the add_size_to_annot function:

local whatsit   ='whatsit')
local hlist     ='hlist')
local vlist     ='vlist')
local math_node ='math')

local function add_size_to_annot(head, hbox)
	while head do
		typ =
		if typ == vlist then
			add_size_to_annot(head.head, hbox)
		elseif typ == hlist then
			add_size_to_annot(head.head, {width=head.width,height=head.height,depth=head.depth})
		elseif typ == whatsit and head.subtype == 15 and
			string.sub(, 1, 16) == '/Subtype /MathML' then
			if head.prev ~= nil and == math_node and head.prev.subtype == 0 then
				tail = head
				for test_node in node.traverse_id(math_node, do
					if test_node.subtype == 1 then
						tail = test_node
				w, h, d = node.dimensions(head.prev, tail)
				hbox = {width=w,height=h,depth=d}
			--[[ texio.write_nl(string.format("add height %gpt, width %gpt, depth %gpt",
                                 hbox.height / 2^16, 
                                 hbox.width / 2^16, 
                                 hbox.depth / 2^16))
			head.width  = hbox.width
			head.height = hbox.height
			head.depth  = hbox.depth
			-- texio.write_nl('found node ' .. node.type(
		head =

In order to use the above depicted code snippets in your LaTeX documents, simply generate a file named mathml.lua and include the Lua code in a LaTeX document:

\pdfcompresslevel=0 % to make everything visible in the pdf


Interestingly, with Mac OSX PDF Preview one can hover over the formula areas and the appropriate text content pops up.

Alternatively, the Apache Java project PDFbox, may be quickly extended to allow for the extraction of previously tagged math areas, see here