Posts Tagged ‘PDF’

LaTeX, ligatures and grep

Thursday, April 16th, 2009

Having finally finished a long overdue paper, I thought I’d share a little knowledge, well, semi-knowledge/-ugly hack actually, that I have found useful while working on this paper.

I like justified text, I think it make the content look sharp. LaTeX seem to agree with me on that point, at least in the style I used (report). Justified text in LaTeX has one drawback however. Sometimes the letter spacing between certain letters become too small, resulting in what I surmise typographers call “broken ligatures”. The term “ligature” seem to simply  refer to a specific part of a letter. A broken ligature, then, would happen when the ligature in a preceding letter “floats into” the next one.

Justified text is sharp, justified text with broken ligatures… not so much. And LaTeX doesn’t seem to be fully able to handle this on its own, so manual intervention seem necessary. (It could of course just be that the version I use (texlive) is silly, but I recall having similar problems back in Uni while I used tetex)

In any case, ugly-hacking tiem!

SEARCH

First priority: find all occurrences of potential broken ligatures.

One could visually (using the ole trusty eyeball mk.1) scan the generated document for imperfections. That takes a lot of time and there is a large risk that some occurrences “slip through”. Also, in some places the ligatures won’t be broken, because the text has a good fit on the row at present time. But then someone adds a word, a sentence, or just fix a grammatical bug, whatever, and the fit is not so good anymore.

Of course, it is wholly unnecessary to run this procedure until the document is “frozen” and won’t accept any more addition to it in terms of text. I ran it three times, one time before each “beta”/”release candidate” which I sent to some friends for critique/proof-reading/sanity checking, and then once more after having incorporated the input from my friends.

To identify potential trouble, grep is called in to find every instance of the character combinations which can break. In my experience, these combinations are “ff”, “fi” and “fl”.

$ grep -rn f[fil] chapters/*.tex

Only lower-case letters seem to cause trouble, but that is an assumption I make. I could well see problems stemming from having an initial lower-case f, followed by an upper-case letter. I have never encountered this, so I don’t search for it, but as usual, ymmw.

Now I have a nifty little list with all occurrences of the letter sequences “ff”, “fi” and “fl”, nice! Now what?

DESTROY

The solution should, preferably, be applied to nearly all instances of these sequences, so that a present “good fit” line, if modified, would just automagically work later on as well. This means that the solution should not screw up the formatting of the “good fit” cases, while kicking into action, iff the good fit turn bad.

The solution I use is “\hbox{}”. This is inserted between the characters (f\hbox{}f, f\hbox{}i, f\hbox{}l) What makes this ugly is of course that your LaTeX code is now littered with this… well umm… shit. This method will of course give your spell checker a nervous breakdown.

Now you are probably thinking that this is a non-issue, just create a small shell-script to use sed, and produce new files with the modified content, copy these files into a build directory and have the make script invoke that shell-script before invoking the build command.

There is a potential pitfall in that solution. My paper linked to a couple of websites, as in clickable hyperlinks inside the pdf. Imagine the fun that would be derived when sed would hit upon \url{http://www.openoffice.org/} and transform that into \url{http://www.openof\hbox{}f\hbox{}ice.org/}.

Making sed aware of the \url{} tag, and verbatim quotes (probably all of the quoting systems), and making it leave the content inside well enough alone is probably doable, but having my favorite text-editor to an interactive search/replace was the method I opted for.

Fun with LaTeX

Thursday, March 26th, 2009

So I have finally gotten my shit together and seriously started putting my ideas for the FS/OS course into writing. $DIETY knows cultivating those ideas has taken long enough…

I started out, as I usually do, with my trusty LaTeX template:

\documentclass[english,a4paper,utf8]{report}
\usepackage[utf8]{inputenc}
\usepackage{verbatim}
\usepackage[dvips,bookmarks=false]{hyperref}
\hypersetup{
    colorlinks=true,
    citecolor=black,
    filecolor=black,
    linkcolor=black,
    urlcolor=blue
}
\author{}
\title{}

\begin{document}
    \maketitle
    \tableofcontents
    \input{./00_chapters}
    \bibliographystyle{unsrt}
    \bibliography{./bibtex/ref}
\end{document}

I then proceeded to copy the old build-system which mra rigged for us while doing our bachelor thesis, and all seemed good and well, until I realized that the hyperrefs (i.e. supposedly clickable URLs) weren’t all that clickable. I was baffled. What had gone wrong?

Had I missed to install a required package? Why then had not rubber (which the build-system use)  died with an error? No, packages seemed fine.

Had I found a feature which Adobe Acrobat Reader possessed, but Evince didn’t? Nope, opening the pdf-file in didn’t yield a better result (only a slower result… jeebuz acroread is bloated…)

I knew that I had gotten clickable links to work in LaTeX-generated pdfs before, so what was different? Ah! It might be that I used mra’s old build-script, the one he wrote before learning about rubber. Ok, $ less bin/makedoci.sh told me all I needed to know. The relevant procedure in that file was:

  1. call latex
  2. call bibtex
  3. call latex
  4. call latex
  5. call dvips
  6. call ps2pdf

As it turns out, the new “rubberized” build-system called rubber with flags -d and -f (i.e. produce pdf output, and force compilation). At the same time I was following up another lead, trying to figure out the documentation for the hyperref package in CTAN. I may have spent too little time reading the actual content in there, but when I came over a list about drivers and \special commands, I started seeing some patterns.

rubber -d calls pdftex, and it might have just been easier to switch “dvips” in the hyperref configuration in the template, but then I’d have to check and possibly dig even deeper to find what the actual string to put in the configuration should be.

This was less attractive since I knew that the current template had worked before (using dvips). But that would involve finding out if rubber could pass through DVI, to PS, and then to PDF. Coincidentally, this is just what rubber -p does.

Which sortof create a really cute little circumstance, to create a pdf-file, you call rubber with the flags -p -d and -f.

PDF, rubber -p -d -f, get it? XD

Putting technologies to use in peculiar ways

Wednesday, March 4th, 2009

I just read a daily WTF and I can’t be sure why, possibly because they were generating invoices, an activity which my mind for some reason has been linked to PDFs, I had a flashback to term 5 at ITU, where our project group collected a bunch of data through a web-based questionnaire, and stored in a database.

Then there was the question about retrieving the information and presenting it in our document (a PDF, generated by LaTeX), which, if I remember correct, was done by me by ugly-hacking together a PHP-script which, depending on what script you called from the webserver, either presented you with a csv file, or a LaTeX formatted file. To be completely honest I guess stream would be the better description, which the browser interpreted as a file and rendered.

In any case, I have a little suspicion that this wasn’t one of the intended domains for PHP, but it did the job well nonetheless.

Splitting a PDF

Tuesday, January 20th, 2009

A friend just called me up and asked if I knew of any means to split a multi-page pdf-file into several single-page pdf files. My immediate answer was no, as I did not know of such a tool. A requirement was that it would work under Windows as well.

My instincts told that there would probably exist a web-service which could do what he asked for, however, the problem with online services is trust. How can I know what they do with “my” files after having let them operate on them? More specifically, what if it is sensitive information which they store, without my consent or knowledge?

So web-services was probably also out of the question (although I have no idea about the level of sensitivity surrounding his document). In any case, I hit Google with the the keywords “free” “pdf” “authoring” “software”. The top search result was a Wikipedia page. Scanning the page for the functionality I wanted, I quickly zeroed in on Pdftk (the PDF Toolkit) – can merge, split, en-/decrypt, watermark/stamp and manipulate PDF files.

The Wikipedia page, redirecting to iText didn’t amount to much, but now armed with the knowledge that what seemed like at least a library existed, I could hit Google again. “pdftk”, “split”, “file”.

Opening up a host of tabs from the search results, I stopped dead in my tracks upon finding AngusJ. To quote the site “PDFTK Builder & other PDF Resources for Windows”. I smiled for a bit giving myself a mental pat on the back for my awesome Google-Fu, and then I relayed the search terms to my friend (still on the phone, mind you) and directed him to the angusj.com search result.

For some reason or another, he couldn’t download the software. It cut out half-way through the download. I didn’t try it myself, but instead my mind went into “Plan B” mode. I.e. investigating whether or not I could split the pdf for him, using the command-line pdftk which, for some reason, was already installed on my machine, a fact apt-get promptly informed me of when I tried to install it.

Just then, it seems, he got a call from his boss, and splitting the pdf was no longer an issue. But as I had already started thinking about it, I simply continued my thought process and started experimenting.

Et voila:

pdftk [input_file] [action [arguments_for_action]] output [output_file]

or, more readble:

$ pdftk a_file.pdf cat 1 output page1.pdf
$ pdftk a_file.pdf cat 2 output page2.pdf
$ pdftk a_file.pdf cat 3-4 output pages3and4.pdf

in short… Awesome!

Of course, there are more uses for this toolkit, as stated on the Wikipedia page, and again on the official page, but since my friend only needed splitting, that is what I cover.

Also, for the record, when I later tried to download the AngusJ pdftk-builder, the download worked like a charm.

Benefits of the Portable Document Format

Monday, June 11th, 2007

As my summer holiday started I found myself with the urge to promote the document preparation system LaTeX. But during drafting a post about it (which soon grew in my mind to a whole series of posts) I realized that all this work would be almost completely uninteresting to the majority of the people on the Internet since the majority runs Windows, and although LaTeX can produce other formats than pdf, these other formats are mostly irrelevant to the average Windows user.

The pdf format on the other hand, although not overly liked by the average Windows user, is at least a format that is well-known and available even to the most green of computer users.

So before I set out on my LaTeX promotion crusade I have to promote the portable document format.

I know why I didn’t like pdf before, and I am sure that the very same reason, or a permutation of it, is why people still doesn’t like pdf.

  • Why another frickin’ format? We already have Microsoft Word’s .doc, isn’t that enough?!
  • Building on the first point, this means one more program (Adobe Reader or whatever other pdf viewer you use) to install and keep up to date
  • It is sooo fun having your browser freeze up for a second or two every time you click on a link at a homepage, which turns out to be a pdf file

I felt all of this before, so what has changed?

First of all I started building web-pages. I wanted to be good at it, so I did my research, I went to the bottom with this fixation of linking to pdf files.

My first question was “Why not just link to another web page?”
The answer to this question was rather simple. Different web-browsers render the content in their own way. There is no way to achieve consistency, not to mention that it all to some extent also depends on whatever features the user has activated in his/her browser (Javascript and/or CSS may be turned off, as well as images)
Also, the very same thing applies should a user want to print the content. Since the browsers render the pages in their own way, the printed copy will reflect this.

“Great” I thought, “then why not link it to a Word document?”
This is where different systems enter the picture. I admit, many of these examples are more or less moot today, but back then they were very much a factor to consider. GNU/Linux does not have the Office suite, and as such have (had) no way of reading Word documents. Today this has somewhat been mitigated by the advent of projects like OpenOffice.org which more or less reliably can open and correctly render Microsoft Word documents.

But back then, and also today, not all systems can reliably read Word documents. Also, opening Word documents found on the Internet (at least if you are running Windows) is almost akin to begging to have your computer infected by viruses.

So clearly none of these two seemingly good alternatives would work. But there was still that small matter of creating pdf files. For that you needed Adobe Acrobat (the full program, not the Reader) and that cost money.
Of course, back then I didn’t have anything to put in a potential pdf file anyway, which I couldn’t readily present with html, so it mattered not.

Another thing that changed recently, but all the same made my views towards pdf even more positive, was that I myself changed operating system, from Windows to GNU/Linux.

This brought two things into my life. The lesser of the two was that I could no longer rely on the Word .doc format.

And the more important aspect, I began appreciating free and open source software (FOSS). This second aspect made me forcefully reject the .doc format from my life as best I could (not easy when the majority of the world uses it) but luckily I am still studying, and in this world of academia students come from different situations, have different backgrounds, experience, and run different systems. Most run Windows, some run GNU/Linux, and one or two run Mac.

It makes for a rather diverse environment. Which in turn means that the teachers/examiners must either be able to read every single format that each of these systems can produce (or at least a subset based on the most widely popular document format for each individual system) or require of the students to hand in each assignment and report in a standard format which exists across all (or at least most) systems.

Our teachers doesn’t do any of this, or rather, they adjust for the Windows crowd, and feel safe in the fact that the Mac and GNU/Linux crowd will behave. And of course we do. Since we can only reliably say that the teachers will have the ability to read documents from Windows users, we either have to produce Windows documents ourselves, or some document that can be read on a Windows system, which we can produce. PDF solves that.

In fact, I need not worry what system the teacher is on, because I can rest assured that my pdf will be readable. I can simply care less what system they are operating. Windows? Pdf works! Mac? Pdf Works! GNU/Linux? Pdf works! You start to see a trend here? ;)

And not only can they open and view the reports, it will actually look to them exactly as it looked to me when I reviewed the document before submission. Whatever settings they have, the margins won’t be changed (in turn changing the entire flow of the document and re-arranging graphics etc.)

Also, since pdf nowadays is an open standard, and aspiring to become ISO certified, from a libertarian perspective this is also great. Sure, Adobe is the “owner” of the technology, but this is still less of a vendor lock-in than say, using Microsoft Word and their proprietary formats.

The one drawback with pdf files is that you need some sort of editing software to have any hope of modifying them. These usually cost money.

But then again, this is where LaTeX comes into play…