Writing Lua in TeX

From LuaTeXWiki
Revision as of 20:04, 27 December 2010 by Patrick (talk | contribs) (Undo revision 49 by 89.162.236.186 (talk))

Embedding Lua code in a TeX document

Although it is simpler to put Lua code in Lua files, from time to time one may want or need to go Lua in the middle of a document. To this end, LuaTeX has two commands: \directlua and \latelua. They work the same, except \latelua is processed when the page where it appears is shipped out, whereas \directlua is processed at once; the distinction is immaterial here, and what is said of \directlua also applies to \latelua.

\directlua can be called in three ways:

\directlua {<lua code>}
\directlua name {<name>} {<lua code>}
\directlua <number> {<lua code>}

Those three ways are equivalent when it comes to process <lua code>, but in the second case processing will occur in a chunk named <name>, and in the third it will occur in a chunk whose name is the entry <number> in the table lua.name. The difference manifests itself only when errors occur, in which case the name of the chunk, if any, is reported.

Each call to \directlua, named or not, is processed in a separate chunk. That means that any local variable is defined for this call only and is lost afterward. Hence:

\directlua{
  one = 1
  local two = 2
}
\directlua{
  texio.write_nl(type(one))
  texio.write_nl(type(two))
}

will report number and nil (texio.write_nl writes to the log file). On the other hand, Lua code is completely insensitive to TeX's grouping mechanism. In other words, calling \directlua between \bgroup and \egroup doesn't affect the code to be processed.

TeX catcodes in Lua

By default, the code passed to \directlua is treated as normal TeX input and only then sent to the Lua interpreter. This may lead to unwanted results and must be acknowledged.

Expansion

As with any other special, the code in \directlua (and the <name>, if specifed) is fully expanded. This means that macros can be safely passed to \directlua if one wants their values, but that they should also be properly escaped when needed. For instance:

\def\macro{1}
\directlua{
  myvar = \macro
}

defines myvar as the number 1. To store the control sequence \macro instead, another course of action is needed: see the section on backslash below.

Line ends

When TeX reads a file, it normally turns line ends into spaces. That means that what looks like several lines is actually fed to the Lua interpreter as one big line. For instance:

\directlua{
  myvar = 1
  anothervar = 2
  onelastvar = 3
}

amounts to the following, if it were written in a separate Lua file:

myvar = 1 anothervar = 2 onelastvar = 3

That is perfectly legitimate, but strange things might happen. First, TeX macros gobble spaces as usual. Hence:

\def\macro{1}
\directlua{
  myvar = \macro
  anothervar = 2
}

will be fed to the interpreter as

myvar = 1anothervar = 2

which is not legitimate at all. Second, the Lua comment -- will affect everything to the end of the \directlua call. That is:

\directlua{
  myvar = 1
--  anothervar = 2
  onelastvar = 3
}

will be processed as

myvar = 1 -- anothervar = 2 onelastvar = 3

which works but only defines myvar. Third, when reporting error, the Lua interpreter will always mention the line number as 1, since it processes one big line only; that isn't extremely useful when the code is large.

The solution is to set \endlinechar=10 or \catcode`\^^M=12. In both cases, line ends will be preserved and the code will be processed as it is input.

Special characters

In TeX, some characters have a special behavior. That must be taken into account when writing Lua code: one must change their catcodes beforehand if one wants to handle them as Lua would, as has just been done for line ends. That means that \directlua, as such, is clearly insufficient to write any extended chunk of code. It is thus better to devise a special macro that sets the catcodes to the appropriate values, reads the Lua code, feeds it to \directlua, and restores the catcodes. The following code does the job:

\def\luacode{%
  \bgroup
  \catcode`\{=12
  \catcode`\}=12
  \catcode`\^^M=12
  \catcode`\#=12
  \catcode`\~=12
  \catcode`\%=12
  \doluacode
}

\bgroup
\catcode`\^^M=12 %
\long\gdef\doluacode#1^^M#2\endluacode{\directlua{#2}\egroup}%
\egroup

Note that not all special characters are set to normal (catcode 12) characters; that is explained for each below. Note also that \doluacode, internally called by \luacode, is defined to get rid of anything up to the line end, and then pass anything up to \endluacode to \directlua. Discarding what follows \luacode is important, otherwise a simple code as

\luacode
myvar = 1
\endluacode

would actually create two lines, the first being empty; it is annoying because errors are then reported with the wrong line number (i.e. any error in this one-line code would be reported to happen on line 2).

However, the rest of the line after \luacode could also be processed, instead of discarded, to manage special effects (e.g. specifying a chunk's name, storing the code in a control sequence, or even setting which catcodes should be changed or not).

Backslash

The backslash in TeX is used to form control sequences. In the definition of \luacode above, it isn't changed and thus behaves as usual. It allows commands to be passed and expanded to the Lua code. Anyway a backslash in Lua is also an escape character in strings. Hence, if one wants to store the name of a macro in Lua code, the following won't work:

\luacode
myvar = "\noexpand\macro"
\endluacode

because to the Lua interpreter the string is made of \m followed by acro; since \m is not defined in Lua, the string is read as macro, but in other circumstances strange things might happen: for instance, \n is a newline. The proper way to pass a macro verbatim is:

\luacode
myvar = "\noexpand\\macro"
\endluacode

which Lua will correctly read as

myvar = "\\macro"

with the backslash escaped to represent itself. Another solution is:

myvar = [[\noexpand\macro]]

because the double brackets signals a string in Lua where no escape sequence occurs (and the string can also run on several lines). Note however that in the second case myvar will be defined with a trailing space, i.e. as "\macro ", because of TeX's habit to append a trailing space to unexpanded (or unexpandable) control sequences.

Braces

One may want to define a string in Lua which contains unbalanced braces, i.e.:

\luacode
myvar = "{"
\endluacode

If the braces' catcodes hadn't been changed beforehand, that would be impossible. Note, however, that this means that one can't feed arguments to commands in the usual way. I.e. the following will produce nothing good:

\luacode
myvar = "\dosomething{\macro}"
\endluacode

\dosomething will be expanded with the left brace (devoid of its usual delimiter-ness) as its argument, and the rest of the line might produce chaos. Thus, one may also choose not to change the catcodes of braces, depending on how \luacode is most likely to be used. Note that strings with unbalanced braces can still be defined, even if braces have their usual catcodes, thanks to the following trick:

\luacode
myvar = "{" -- }
\endluacode

When the code is passed to \directlua, braces are balanced because the Lua comment means nothing to TeX; when passed to the Lua interpreter, on the other hand, the right brace is ignored.

Hash and comment

The hash sign # in Lua is the length operator: prefixed to a string or table variable, it returns its length. If its catcode weren't taken care of, LuaTeX would pass to \directlua a double hash for each hash, i.e. each # would be turned into ##. That is normal TeX behavior, but unwanted here.

As for the commen sign %, it is useful in Lua when manipulating strings. If it weren't escaped it would discard parts of the code when TeX reads it, and a mutilated version of the input would be passed to the Lua interpreter. In turn, discarding a line by commenting it in \luacode should be done with the Lua comment --.

Active characters

The ~ character is generally active and used as a no-break space in TeX. It it were passed as is to \directlua, it would expand to uninterpretable control sequences, whereas in Lua it is used to form the unequal operator ~=.

Other possible active characters should be taken care of, but which characters are active is unpredictable; punctuation marks might be so to accommodate special spacing, as with LaTeX's babel package, but such tricks are unlikely to survive in LuaTeX (cleaner methods exist that add a space before punctuation marks when necessary).

Other characters

When processing verbatim text in TeX, one generally also changes the catcodes of $, &, ^, _ and the space character, because they too are special. When passed to the Lua interpreter, though, their usual catcodes won't do any harm, that is why they are left unmodified here.

\luaescapestring

Although it can't do all of what's been explained, the \luaescapestring command might be useful in some cases: it expands its argument (which must be enclosed in real braces) fully, then modify it so that dangerous characters are escaped: backslashes, hashes, quotes and line ends. For instance:

\def\macro{"\noexpand\foo"}
\luacode
myvar = "\luaescapestring{\macro}"
\endluacode

will be passed to Lua as

myvar = "\"\\foo \""

so that myvar is defined as "\foo ", with the quotes as parts of it. Note that the trailing space after \foo still happens.

From Lua to TeX

Inside Lua code, one can pass strings to be processed by TeX with the functions tex.print(), tex.sprint() and tex.tprint(). All such calls are processed at the end of a \directlua call, even though they might happen in the middle of the code. This behavior is worth noting because it might be surprising in some cases, although it is generally harmless.

tex.print()

This function receives as its argument(s) either one or more strings or an array of strings. Each string is processed as an input line: an end-of-line character is appended (except to the last string), and TeX is in state newline when processing it (i.e. leading spaces are skipped). Hence the two equivalent calls:

tex.print("a", "b")
tex.print({"a", "b"})

are both interpreted by TeX as would the following two lines:

a
b

Thus `a b' is produced, since line ends normally produce a space.

The function can also take an optional number as its first argument; it is interpreted as referring to a catcode table (as defined by \initcatcodetable and \savecatcodetable), and each line is processed by TeX with that catcode regime. For instance (note that with such a minimal catcode table, braces don't even have their usual values):

\bgroup
\initcatcodetable1
\catcode`\_=0
\savecatcodetable1
\egroup

\directlua{tex.print(1, "_TeX")}

The string will be read with _ as an escape character, and thus interpreted as the command commonly known as \TeX. The catcode regime holds only for the strings passed to tex.print() and the rest of the document isn't affected.

If the optional number is -1, or points to an invalid (i.e. undefined) catcode table, then the strings are processed with the current catcodes, as if there was no optional argument. If it is -2, then the strings are read as if the result of \detokenize: all characters have catcode 12 (i.e. `other', characters that have no function beside representing themselves), except space, which has catcode 10 (as usual).

tex.sprint()

Like tex.print(), this function can receive either one or more strings or an array of strings, with an optional number as its first argument pointing to a catcode table. Unlike tex.print(), however, each string is processed as if TeX were in the middle of a line and not at the beginning of a new one: spaces aren't skipped, no end-of-line character is added and trailing spaces aren't ignored. Thus:

tex.sprint("a", "b")

is interpreted by TeX as

ab

without any space inbetween.

tex.tprint()

This function takes an unlimited number of tables as its arguments; each table must be an array of strings, with the first entry optionally being a number pointing to a catcode table. Then each table is processed as if passed to tex.sprint(). Thus:

tex.tprint({1, "a", "b"}, {"c", "d"})

is equivalent to

tex.sprint(1, "a", "b")
tex.sprint("c", "d")

The expansion of \directlua

A call to \directlua is fully expandable; i.e. it can be used in contexts where full expansion is required, as in:

\csname\directlua{tex.print("TeX")}\endcsname

which is a somewhat convoluted way of saying \TeX. Besides, since Lua code is processed at once, things that were previously unthinkable can now be done easily. For instance, it is impossible to perform an assignment in an \edef by TeX's traditional means. I.e. the following:

\edef\macro{\count1=5}

defines \macro as \count1=5 but doesn't perform the assignment (the \edef does nothing more than a simple \def). After the definition, the value of \count1 hasn't changed. The same is not true, though, if such an assigment is made with Lua code. The following:

\edef\macro{\directlua{tex.count[1] = 5}}

defines \macro emptily (since nothing remains after \directlua has been processed) and sets count 1 to 5. Since such a behavior is totally unexpected in normal TeX, one should be wary when using \directlua in such contexts.