LaTeX, ligatures and grep

Having finally finished a long overdue paper, I thought I’d share a little knowledge, well, semi-knowledge/-ugly hack actually, that I have found useful while working on this paper.

I like justified text, I think it make the content look sharp. LaTeX seem to agree with me on that point, at least in the style I used (report). Justified text in LaTeX has one drawback however. Sometimes the letter spacing between certain letters become too small, resulting in what I surmise typographers call “broken ligatures”. The term “ligature” seem to simply  refer to a specific part of a letter. A broken ligature, then, would happen when the ligature in a preceding letter “floats into” the next one.

Justified text is sharp, justified text with broken ligatures… not so much. And LaTeX doesn’t seem to be fully able to handle this on its own, so manual intervention seem necessary. (It could of course just be that the version I use (texlive) is silly, but I recall having similar problems back in Uni while I used tetex)

In any case, ugly-hacking tiem!

SEARCH

First priority: find all occurrences of potential broken ligatures.

One could visually (using the ole trusty eyeball mk.1) scan the generated document for imperfections. That takes a lot of time and there is a large risk that some occurrences “slip through”. Also, in some places the ligatures won’t be broken, because the text has a good fit on the row at present time. But then someone adds a word, a sentence, or just fix a grammatical bug, whatever, and the fit is not so good anymore.

Of course, it is wholly unnecessary to run this procedure until the document is “frozen” and won’t accept any more addition to it in terms of text. I ran it three times, one time before each “beta”/”release candidate” which I sent to some friends for critique/proof-reading/sanity checking, and then once more after having incorporated the input from my friends.

To identify potential trouble, grep is called in to find every instance of the character combinations which can break. In my experience, these combinations are “ff”, “fi” and “fl”.

\$ grep -rn f[fil] chapters/*.tex

Only lower-case letters seem to cause trouble, but that is an assumption I make. I could well see problems stemming from having an initial lower-case f, followed by an upper-case letter. I have never encountered this, so I don’t search for it, but as usual, ymmw.

Now I have a nifty little list with all occurrences of the letter sequences “ff”, “fi” and “fl”, nice! Now what?

DESTROY

The solution should, preferably, be applied to nearly all instances of these sequences, so that a present “good fit” line, if modified, would just automagically work later on as well. This means that the solution should not screw up the formatting of the “good fit” cases, while kicking into action, iff the good fit turn bad.

The solution I use is “\hbox{}”. This is inserted between the characters (f\hbox{}f, f\hbox{}i, f\hbox{}l) What makes this ugly is of course that your LaTeX code is now littered with this… well umm… shit. This method will of course give your spell checker a nervous breakdown.

Now you are probably thinking that this is a non-issue, just create a small shell-script to use sed, and produce new files with the modified content, copy these files into a build directory and have the make script invoke that shell-script before invoking the build command.

There is a potential pitfall in that solution. My paper linked to a couple of websites, as in clickable hyperlinks inside the pdf. Imagine the fun that would be derived when sed would hit upon \url{http://www.openoffice.org/} and transform that into \url{http://www.openof\hbox{}f\hbox{}ice.org/}.

Making sed aware of the \url{} tag, and verbatim quotes (probably all of the quoting systems), and making it leave the content inside well enough alone is probably doable, but having my favorite text-editor to an interactive search/replace was the method I opted for.

Tags: , , , , , ,