Difference between revisions of "Annotate math expressions"
m (Created page with "Now with the scripting language Lua having access to the TeX internals, it is quite easy to generate PDF annotations automatically. An interesting example I started to explore is...") |
m (One very long line of code was breaking horizontal scroll.) |
||
Line 10: | Line 10: | ||
The entire code snippets can be downloaded from: [https://gist.github.com/2018232 https://gist.github.com/2018232] | The entire code snippets can be downloaded from: [https://gist.github.com/2018232 https://gist.github.com/2018232] | ||
− | We use two callback functions, ''mlist_to_hlist'' to insert a PDF annotation node and the callback ''pre_output_filter'' to identify the bounding box for | + | We use two callback functions, ''mlist_to_hlist'' to insert a PDF annotation node and the callback ''pre_output_filter'' to identify the bounding box for a math formula. |
− | a math formula. | ||
<pre> | <pre> | ||
Line 93: | Line 92: | ||
hbox = {width=w,height=h,depth=d} | hbox = {width=w,height=h,depth=d} | ||
end | end | ||
− | --texio.write_nl(string.format("add height %gpt, width %gpt, depth %gpt",hbox.height / 2^16, hbox.width / 2^16, hbox.depth / 2^16)) | + | --[[ texio.write_nl(string.format("add height %gpt, width %gpt, depth %gpt", |
+ | hbox.height / 2^16, | ||
+ | hbox.width / 2^16, | ||
+ | hbox.depth / 2^16)) | ||
+ | --]] | ||
head.width = hbox.width | head.width = hbox.width | ||
head.height = hbox.height | head.height = hbox.height | ||
head.depth = hbox.depth | head.depth = hbox.depth | ||
else | else | ||
− | -- texio.write_nl('found node '..node.type(head.id)) | + | -- texio.write_nl('found node ' .. node.type(head.id)) |
end | end | ||
head = head.next | head = head.next |
Latest revision as of 10:48, 12 June 2012
Now with the scripting language Lua having access to the TeX internals, it is quite easy to generate PDF annotations automatically. An interesting example I started to explore is whether it would be possible to generate Content MathML expressions from the low level TeX mathlist node representations exposed via the proper LuaTeX callback mlist_to_hlist.
Succinctly, it is possible to generate Content MathML from simple math formulas, however my initial approach using context free grammar parsers, (i.e. lpeg) are severely limited by the fact that the interpretation of LaTeX math expressions is rather context sensitive.
A much simpler topic is how Lua(La)TeX could be used to automatically generate math expression bounding boxes in PDF documents, such that extraction programs can reliably identify text areas in the PDF document that pertain to math formulas.
The entire code snippets can be downloaded from: https://gist.github.com/2018232
We use two callback functions, mlist_to_hlist to insert a PDF annotation node and the callback pre_output_filter to identify the bounding box for a math formula.
-- doesn't yield anything interesting function convertToMathML(head) return {tag="not implemented"} end -- create content MathML for every math formula luatexbase.add_to_callback('mlist_to_hlist', function(head, display, penalty) texio.write_nl('NEW mathlist') result = convertToMathML(head) if result ~= nil then et = etree.ElementTree({tag = "math", result}, {decl = false}) local pdf = node.new("whatsit", "pdf_annot") local buffer = etree.StringBuffer() et:write(buffer) pdf.data = '/Subtype /MathML /Contents (' .. tostring(buffer) .. ')' head = node.insert_before(head, head, pdf) end return node.mlist_to_hlist(head, display, penalty) end, "content MathML generator")
In the callback above the function convertToMathML does not yield any interesting result:
-- element tree, http://etree.luaforge.net/ (is a bit buggy for the {decl = false} option) local el = require "etree" function convertToMathML(head) return {tag="not implemented"} end
Now having tagged the math formula, we still need a bounding box. Luckily, the the pre-output phase is intercepted by the pre-output-filter and can be used to accomplish exactly that!
local vpack_counter = 1 luatexbase.add_to_callback('pre_output_filter', function(head) add_size_to_annot(head,{width=0,height=0,depth=0}) -- viz.nodelist_visualize(head, "vpack"..vpack_counter..".gv") vpack_counter = vpack_counter + 1 return head end ,"find math bounding box")
Of course, we need the add_size_to_annot function:
local whatsit = node.id('whatsit') local hlist = node.id('hlist') local vlist = node.id('vlist') local math_node = node.id('math') local function add_size_to_annot(head, hbox) while head do typ = head.id if typ == vlist then add_size_to_annot(head.head, hbox) elseif typ == hlist then add_size_to_annot(head.head, {width=head.width,height=head.height,depth=head.depth}) elseif typ == whatsit and head.subtype == 15 and string.sub(head.data, 1, 16) == '/Subtype /MathML' then if head.prev ~= nil and head.prev.id == math_node and head.prev.subtype == 0 then tail = head for test_node in node.traverse_id(math_node, head.next) do if test_node.subtype == 1 then tail = test_node break end end w, h, d = node.dimensions(head.prev, tail) hbox = {width=w,height=h,depth=d} end --[[ texio.write_nl(string.format("add height %gpt, width %gpt, depth %gpt", hbox.height / 2^16, hbox.width / 2^16, hbox.depth / 2^16)) --]] head.width = hbox.width head.height = hbox.height head.depth = hbox.depth else -- texio.write_nl('found node ' .. node.type(head.id)) end head = head.next end end
In order to use the above depicted code snippets in your LaTeX documents, simply generate a file named mathml.lua and include the Lua code in a LaTeX document:
\pdfcompresslevel=0 % to make everything visible in the pdf \documentclass{article} \usepackage{amssymb} \usepackage{luacode} \directlua{dofile("mathml.lua")}
Interestingly, with Mac OSX PDF Preview one can hover over the formula areas and the appropriate text content pops up.
Alternatively, the Apache Java project PDFbox, http://pdfbox.apache.org/index.html may be quickly extended to allow for the extraction of previously tagged math areas, see here https://gist.github.com/2018466