Process input buffer

From LuaTeXWiki


This callback is called whenever TeX needs a new input line from the file it is currently reading. The argument passed to the function registered there is a string representing the original input line; the function should return another string, to be processed by TeX, or nil; in the latter case, the original string is processed. So the function should be defined according to the following blueprint:

function (<string> original_line)
  return <string> adjusted_line or nil

In case TeX isn't reading the main document but a file that is \input in it, the callback is called after the reader function possibly defined in the open_read_file callback.

The following examples aren't mutually exclusive; however, if one wanted to register them all in the same callback, special care would be needed, as explained in the main page on callbacks. Here each call to callback.register() removes whatever there might have been in the callback beforehand.



LuaTeX understands UTF-8 and UTF-8 only (ASCII is a part of it, however). A line containing invalid characters will produce an error message: String contains an invalid utf-8 sequence. Thus documents with different encoding must be converted, and this can be done in process_input_buffer.

Note that such a conversion might not produce anything meaningful when the document is compiled, because the fonts used must have the proper glyphs in the right place; for instance, the Computer Modern fonts don't have characters with diacritics, and Latin Modern should be used instead. An alternative solution (quite archaic with LuaTeX) is to define the special characters as active, and to make them create a diacritic. For instance:

\catcode`\é = 13
\def é{\'e}


Reading a document written in Latin-1 (ISO-8859-1) is relatively straightforward, because the slnunicode library embedded in LuaTeX does the conversion at once.

local utf8_char, byte, gsub = unicode.utf8.char, string.byte, string.gsub

local function convert_char (char)
  return utf8_char(byte(char))

local function convert_line (line)
  return gsub(line, ".", convert_char)

callback.register("process_input_buffer", convert_line)

The code reads as follows (bottom up): the convert_line function is registered in the process_input_buffer callback; it is passed the original input line as its sole argument, and returns another string, namely that line processed with the string.gsub() function. The latter replaces in a string (the first argument) all the occurrences of a pattern (the second argument) with the outcome of its third argument (for further information on string.gsub() see the Lua reference manual).

Here the third argument is a function, convert_char() to which is passed whatever matches the pattern. This pattern matches any character, the dot being a magic character. In convert_char(), the character is first converted to a numerical code (the codepoint in Latin-1, see the description of string.byte()); the code is passed to unicode.utf8.char which, given a number, returns the associated character in UTF-8, the number being interpreted as a code point in Unicode; since the characters in Latin-1 have the same code points as in Unicode, the conversion is automatic here.

The use of local variables ensures speed, and above all that those variables aren't defined outside the current chunk, for instance, the current \directlua call or the current Lua file; actually, the code could even be embedded between do and end and leave absolutely no trace whatsoever.

Other 8-bit encodings[edit]

When using other 8-bit encoding, the previous code won't work, because it defaults to Latin-1 only. Then one must convert each character one by one by setting up a table matching each input character with the Unicode value; that value can be passed to unicode.utf8.char to yield the desired character.

For instance, here's the code needed to process a document encoded in Latin/Greek (ISO-8859-7):

local utf8_char, byte = unicode.utf8.char, string.byte

local LatinGreek_table = {
  [0] = 0x0000, 0x0001, 0x0002, 0x0003, 0x0004,
  -- ... 240+ other values...
  0x03C9, 0x03CA, 0x03CB, 0x03CC, 0x03CD, 0x03CE

local function convert_char (char)
  return utf8_char(LatinGreek_table[byte(char)])

local function convert_line (line)
  return gsub(line, ".", convert_char)

callback.register("process_input_buffer", convert_line)

The convert_line() function and call to callback.register() are the same as above. What has changed is convert_char(); the numerical value of the original character is now used as an index in the LatinGreek_table, and the value returned is the corresponding Unicode code point; that number is passed to unicode.utf8.char to produce the character in UTF-8.

The values in the LatinGreek_table are assigned to the right indices because they are declared in a row (here in hexadecimal form by prefixing them with 0x). The only index that needs to be specified is 0, because indexing of tables in Lua begins at 1 by default. The table length is 254 plus the value at index 0 (not 255, because there is no such character in Latin/Greek). Each index is a code point in Latin/Greek, and the value is the code point of the same character in Unicode. For instance, lowercase omega in Latin/Greek is 249, at which index one finds 0x03C9, lowercase omega in Unicode.

TeX as a lightweight markup language[edit]

The process_input_buffer can be put to an entirely different use, namely to preprocess input strings using some kind of lightweight markup and turn them into proper TeX.

Let's suppose one has two control sequences, \italic and \bold, taking a single argument; let's suppose furthermore that one wants to write a sentence like

This is in /italic/ and this is in *bold*.

to be processed by TeX as

This is in \italic{italic} and this is in \bold{bold}.

so as to produce: `This is in italic and this is in bold.' The following code does exactly that (the use of the percent sign and of \\ without \noexpand means that this code is either in a Lua file or in the second version of \luacode as defined in Writing Lua in TeX; also, this code illustrates the registering of an anonymous function instead of a function variable as in the previous examples):

local gsub = string.gsub

  function (str)
    str = gsub(str, "/(.-)/", "\\italic{%1}")
    str = gsub(str, "%*(.-)%*", "\\bold{%1}")
    return str

What happens is that the original string is replaced with successive call to string.gsub (see the Lua reference manual), in which the captures in the patterns are replaced with themselves as arguments to the TeX function (the non-capture parts of the patterns are discarded). For instance, /a word/ yields \italic{a word}. Note that with \bold, the asterisks in the pattern must be escaped with %, otherwise they would be interpreted as magic characters. The line can then be processed by TeX as usual.

Only pairs of slashes or asterisks in the same line will be interpreted as markup, because lines are processed one by one and nothing is remembered from one line to the next (that can be implemented, but is a bit more complicated and dangerous). Hence, nothing will be in italics in the following example:

This will /not
be/ in italic.

Instead, slashes will be read.

One can add an unlimited number of replacements of the input line. For instance, here's a way to mark sections by rows of #'s:

  str = gsub(str, "^%s*#%s*(.*)", "\\section{%1}")
  str = gsub(str, "^%s*##%s*(.*)", "\\subsection{%1}")
  str = gsub(str, "^%s*###%s*(.*)", "\\subsubsection{%1}")

What is recognized as a section header is any line beginning with one, two or three hashes, ignoring space before and after. So one can write:

# A section
  ## A subsection
    ### A subsubsection

More complicated things are possible. However, using the open read file callback (only with an \input file) is preferable if subtlety is required, because it can read an entire file (or any set of lines) before changing anything. Also, the powerful LPeg library (embedded in LuaTeX) can be used instead of the default operations.

Finally, one must note that such manipulations can be dangerous or lead to unwanted results. For instance, a line such as

... the file should be contained in user/directory/myfiles...

will be processed without slashes and with directory in italics. Thus, such markup should either be used with files specially tailored for it, or contain ways to be overriden or to protect some text.