Posts Tagged ‘regexp’

Adventures in Javascript land

Tuesday, May 5th, 2009

Last night an old classmate of mine approached me with a programming problem, this time in Javascript. This is not a language I have spent much time with for various reasons and my previous attempts at grokking it have all been unsuccessful.

But times seem to have changed. Either I am a better programmer now, or Javascript has matured (I haven’t been keeping tabs on what is going on in the Javascript camp so I am still a bit fuzzy on the whole DOM thingy and if you can, nowadays, write a Javascript which both Internet Explorer and all the other browsers will understand without jumping through at least a couple of hoops), but in any case I was able to help him.

The problem he was having oriented around string manipulation. Given a pipe-separated string of “key=value” pairs, extract the values and store them in a variable of their own.

I am guessing he was going to extract all the values, but the example code he sent to me just included the extraction of a single value. His attempt was not bad, combining indexOf() and substring(), and I would like to think he had a pretty smart idea going, but Javascript was not intelligent enough for his idea.

The “mistake” he made was thinking that indexOf() would continue and find the next occurrence once it had been called once thus saving its “state” from the previous search, much like how a file/array iterator would behave.

He also had some rather “funny” constraints, or rather the reasoning behind them. There could be no magic numbers in the code (which is of course a good thing in and of its own) because the given string could be subject to change in the future.

I have no objection to this, except for the fact that he used hard-coded strings as argument to indexOf(). But sure, having to update that one place instead of two is better because otherwise you’ll just miss the other place when you update the first. (The magic numbers constraint surfaced when I suggested adding an offset to the index returned by indexOf() so that he would get the appropriate parts of the string.)

The initial code (obscured for his benefit) looked something like this:

var arg = "key1=value1|key2=value2|key3=value3|key4=value4";
var index = arg.indexOf("key2");
var key2 = arg.substring(index, indexOf("|"));

There were some initial mistakes in the code (second indexOf() called on nothing) as well as the assumption described earlier, that indexOf() kept a state. This code, after modification, would of course have tried to return a negative length substring. Now I haven’t checked, but somehow I doubt that Javascript would actually return a reversed substring of the indicated length and position.

My suggestion was on par with:

var arg = "key1=value1|key2=value2|key3=value3|key4=value4";
var index = arg.indexOf("key2") + 4;
var key2 = arg.substring(index, arg.indexOf("|", index));

This brought up the magic numbers constraint, so I then suggested replacing line 2 with:

var index = arg.indexOf("key2") + "key2".length;

But that was not satisfactory either. I believe he ended up with a solution of his own, calculating a second index finding the first “=” after “key2″ (using index as offset) and then from there in the substring incrementing the second index by one:

var arg = "key1=value1|key2=value2|key3=value3|key4=value4";
var index = arg.indexOf("key2");
var index2 = arg.indexOf("=", index);
var key2 = arg.substring(index2 + 1, arg.indexOf("|", index));

He was happy with that solution, but I was less than enthusiastic with any of that code so I started digging on my own. One solution could be regular expressions (I don’t know if it is a good or bad thing that I almost always nowadays think “regexp” when I have to find and/or break out data from a string…) and I found a pretty neat solution due to the fact that the Javascript string type has a built-in function match() which takes a regexp pattern as argument.

var arg = "key1=value1|key2=value2|key3=value3|key4=value4";
var key2_pattern = /key2=(\d+)/;
var key2 = arg.match(key2_pattern)[1];
// [0] == entire matched string, [1] == matched group (i.e. \d+)

Still I was not entirely satisfied, regular expressions can be great, if kept uncomplicated, but really how often do things stay uncomplicated? I was also wondering whether it would be possible to just store it all, dynamically, in a dictionary type of structure. That’s when I realized that Javascript has associative arrays. :D

var arg = "key1=value1|key2=value2|key3=value3|key4=value4";
var tmp_array = arg.split("|"); // array with "key=value" elements
var dictionary = new Array();
while (e = tmp_array.pop())
    tmp = e.split("=");
    dictionary[tmp[0]] = tmp[1]; // dictionary["key"] = "value";

I haven’t fully tried out this last suggestion (it works, but I haven’t yet tried how Javascript deals with “2” as opposed to 2, and how this might screw things up. I am thinking that it probably will screw things up. On the other hand so should substring (since it per definition also returns a string).

None of this really matters, my friend came to a working solution and I got to play with Javascript. Win-win :)

Midnight hacking, part 2

Saturday, February 28th, 2009

I have since the last post come up with a name for this little project: “Vocabulary fingerprinting”. This post should be part 3 or 4, but obviously it fell out of my memory to write the second post after my second midnight session. The first refactorings I made brought down execution time a bit. Not as much as I had hoped for, but a bit. When I then added more regular expressions to improve the accuracy of what should be stored in the database, execution time was impacted negatively. It didn’t quite go back to the first value of ~120 minutes, but is now holding steady at around 113 minutes. I will have to profile the code to get closer to the root, but my suspicions are a) the regular expressions (I believe I now have in the vicinity of 15 different regexes which are being executed once on every line I have written, and some on every word) and b) SQLite (disk I/O).

I do have some ideas about how to speed it up, in the case of the regexes I’d have to rearrange the order in which they are executed, and make some of them more generic (having 3 or 4 regexes to sort out all the various smileys… no not good) and for the probable disk I/O bottleneck… well I can’t really get around writing to the disk, but I can store a bunch of words in memory, and then write them all to disk at once… I don’t know why that would affect anything, but it just feels like it would help.

I have to confess that it has actually been a couple of days since I last worked, or thought about working, on this project, but today when I woke up I got a new idea. I haven’t yet decided if the idea is any good.

Since I am parsing instant-messaging logs, each line represents one message. Since messages differ in length, both in terms of number of words and number of characters, it could be interesting to store an “average line length” value in the database as well, since this could make identification easier (think ballpark estimates, “chatty” or “quiet”). But then the code would become very domain specific. You couldn’t try to identify a person by feeding the code an article, since these will be a great mass of text, which each line formatted to be roughly of equal length (on or around 80 characters per line probably)

One could of course craft two modules, one article specific and one instant messaging specific, and have them share database (i.e. the “vocabulary” table would be shared, but meta-data around how the writer compose his texts would be stored in two different tables, depending on what we want to compare with. This could work, since it is possible to identify “chattiness” in articles as well, the only difference being that the word count ranges for determining “chattiness” are larger (150-2000, 2000-6000, etc). (I have a feeling that I could easily end up in the latter range…)

Finally, I tried to implement a measurement facility in the code. Something along the lines of:

Given ten measurement points, evenly distributed across the files being parsed, in this case 2900 files / 10 points = every 290 files, record the current time (timestamp) at each point.

In the end, what I want is an output for every 290 files, telling me the estimated time to completion, based on the average time it has taken the code to accomplish its task so far.

I thought that it would be pretty straightforward, just a simple case of determining how many measurement points there are left, create time-deltas (mp1 – starttime, mp2 – mp1, mp3 – mp2 …) and add those together, divide by the number of time-deltas, and multiply it by the number of measurement points there are left.

I checked it with my brother (who is way more math smart than me) and he think it seems legit… but the code doesn’t work. It reports that it’s about 30 minutes to completion on every measurement point up until the last where it drops to 8 minutes…

His thought was that maybe the initial files are so small compared to the files at the end, that the estimations are frakked up. I guess I will have to either randomize the order of the list containing the filenames, or just reverse it, to see if it makes any difference.

That’s all for now.