Last Saturday… sorry, early Sunday, way past any reasonable bedtime, I was twisting and turning, finding it impossible to fall asleep. Reading a magazine didn’t work, in fact it might just have had the opposite effect. It got my brain working, and all of the sudden an idea entered my mind.
I can’t understand it myself, so don’t bother asking, there will be no coherent or reasonable answer, but I got the idea to pull my Pidgin log-files, all 2900 of them, dating back from 2008.01.01, and have a program go through all of them, cataloging and counting the outgoing words.
Maybe it was some urge to code, maybe my subconscious has a plan for the code which it has yet to reveal to me, I couldn’t tell you, but the more I thought about it, the more the idea appealed to me. Within half an hour I knew roughly how I wanted to do it.
The premise was: 2900 HTML-formatted log-files describing interactions between me and one or more external parties. Pidgin stores each sent message on a separate line, so except for some meta-data about when the conversation took place, located at the top of the file, there was one line per message.
I wanted the code to be, I hesitate to call the resulting code “modular”, but “dynamic” might be better. So no hard coded values about what alias to look for. This worked out fine, as I soon realized I needed a file name for the SQLite database which would store the data.
The script is called with two parameters, an alias and a path to the directory in which the logs can be found. This is also where I cheated. I should have made the script recursively walk into any sub directory in that path, looking for HTML-files, but I opted instead to move all the files from their separate sub directories into one large directory. Nautilus gives me an angry stare every time I even hint at wanting to open that directory, but handling sub directories will come in a later revision.
So, given an alias (the unique identifier which every line that shall be consumed should have) and a path, list all HTML files found in that path. Once this list has been compiled, begin working through it, opening one file at a time, and for each line in that file, determine it the line should be consumed, or discarded.
Since the line contains HTML-formatting, as well as the alias and a timestamp, this would be prudent to scrape away, regular expressions to the rescue. Notice the trailing “s”, simple code is better than complex code, and a couple of fairly readable regular expressions is better than one monster of an expression. So away goes HTML-formatting, smileys, timestamps and the alias. What should now, theoretically, be left, is a string of words.
So that string is split up into words and fed into the SQLite database.
I was happy, this was my first attempt at working with SQLite, and thus my first attempt at getting Python to work with SQLite. It worked like a charm. Three separate queries where used, one, trying to select the word being stored. If the select returned a result, the returned value was incremented by one, and updated. Of no result was returned, a simple insert was called.
This is of course the naive and sub-optimal way to do it, but right then I was just so thrilled about coding something that I didn’t want to risk leaving “the zone”. Needless to say, doing two queries per word, means hitting the hard drive two times per word, EVERY word, for EVERY matching line, for EVERY one of the 2900 files. Yeah, I think there is room for some improvement here.
But I have to admit, I am impressed, in roughly four hours, give or take half an hour, I managed to put together a program which worksed surprisingly well. The one regret I have right now is that I didn’t write anything in the way of tests. No unit tests, no performance tests, no nothing. Of course, had I done that, I probably would have gotten bored half way through, and fallen asleep. And tests can be written after the fact.
Well, I wrote one simple progress measurement. The loop going through the files is called through the function enumerate, so I got hold of an index indicating what file was being processed, and for each file having been closed (processed and done) I printed the message “File %d done!”. From this I was able to clock the script at finishing roughly 20 files a minute (the measurements was taken at ten minute intervals) but this is rather inprecise as no file equals another in line or word length.
It was truly inspiring to realize how much can be done, in so little time. The next steps, besides the obvious room for improvement and optimization, is to use this little project as a real-life excercise to test how much I have learned by reading Martin Fowler’s Refactoring – Improving the Design of Exising Code.
Adding the ability to walk down into sub directories should of course be added, but the most interesting thing at the moment is going to be finding a way to detect typoes. The regexp rule for how to detect and split up words is a little… “stupid” at the moment.
Initially (after having slept through the day following that session) I thought about the typoes, and how to detect them, and how one might be able to use something like levenshtein, but again, this would entail IO heavy operations, and also start impacting the processor. There is probably some Python binding for Aspell one could use, I will have to look into that.
So, finally, why? What’s the reason? The motivation?
Well, somewhere in the back of my mind I remember having read an article somewhere which discussed the ability to use writing to identify people. So if I publish enough text on this blog, and then, on another, more anonymous blog, I publish something else, the words I use, or the frequency with which I use them, should give me away. In order to prove or disprove that hypothesis a signature would need to be identified, in the form of a database containing words and their frequency (in such a case it might even be beneficial to NOT attempt to correct spelling errors as they might indeed also be a “tell”) and then write a program which attempts to determine whether or not it is probable that this text was written by me.
While talking about the idea with a friend, she asked me about privacy concerns (I can only assume that she didn’t feel entirely satisfied with the thought of me sitting on a database with her words and their frequencies) and that is a valid concern. Besides the ethical ramifications of deriving data from and about people I call my friends, there is a potential flaw in trying to generate signatures for my friends from the partial data feed I am privy to. I base this potential flaw on the fact that I know that my relationship, experience and history with my various friends make for a diversified use of my vocabulary. In short, what language I use is determined by the person I speak with.
And now I have spewn out too many words again… *doh*