awk, filtering and counting

Suppose that you have a file containing some structured data, something perhaps along the lines of this, highly fictive but yet remarkably common, syntax:

<id><separator><somestring><separator><integer>

Now, let’s say that there were 99999 lines of this to go through, and the file is unsorted, and you wanted to find all the lines where SOMESTRING is foo, and then sum up the INTEGER field of those lines.

I almost had this problem at work, except my file probably didn’t contain more than a hundred or so lines.

For this I wrote a Perl script, which worked well, with the small inconvenience that I’d have to move that script onto each system where I’d want to use it.

Pontus, never the one to berate anyones efforts, but still finding room for improvements, both in the fact that my approach, the script, carried that inconvenience, and that is was very verbose when compared to the solution he ultimately suggested, he showed me a better way, the awk way.

$ awk -F<separatorGoesHere> 'BEGIN { SUM = 0 } /<someStringGoesHERE/ { SUM += $3 } END { print SUM }' <fileToBeParsedGoesHere>

I said before that my real file, at work, was small, so awk crunched through it at lightning speed. I also suggested a file containing 99.999 lines, and I did that to prove a point, namely:

Using this script:

#!/usr/bin/env python2

import random

filename = "awk.example.txt"
index = 0
iterations = 100000
choices = ['foo', 'bar', 'baz']
fh = open(filename, 'w')

for index in range(1, iterations):
    fh.write("%d, %s, %d\n" % (index,
                               random.choice(choices),
                               random.randint(0, 100)))
fh.close()

I generated a file (~1.5Mb) with a couple of lines ;) and let awk loose on it:

$ time awk -F, 'BEGIN { SUM = 0 } /foo/ { SUM += $3 } END { print SUM }' awk.example.txt

Which on my netbook took 0.241 seconds to complete.

real	0m0.241s
user	0m0.237s
sys	0m0.000s

Or in other words: awk if pretty frakking fast!

Now, let’s break it down:

awk

obviously, is the command, and it rocks, ‘nuf said.

-F,

means “change the field separator (from whitespace) to commas”

And then it gets tricky, but not as tricky as at least I was lead to believe.

There are two single-quotes, and between these we place all the things we want awk to do for us.

One good thing to note is that the syntax for awk is quite simple, something I didn’t grasp at first. It goes like this:

<somePattern> { <someAction> }

And that’s it. You can chain several <pattern>{<actions>} after each other.

In my, well Pontus’, command above, there are three such pairs:

BEGIN { SUM = 0 }

which is just another way of saying “before we start executing, create a variable SUM and set its value to 0″

/foo/ { SUM += $3 }

If you’re familiar with regular expressions you might have stumbled upon the pattern in which you enclose an expression between two slashes, and that pattern is used to search (or match) contents of lines or files. That’s what we’re doing here. So we’re basically saying “find lines containing foo, and from these lines extract column number three ($3), and increment the variable SUM by the value stored in column three.”

If instead, you’d wanted to count all the lines containing foo, SUM += 1 would have done that job.

Finally:

END { print SUM }

which should be pretty obvious: “When all is said and done, print whatever is stored in the variable SUM”

And last but not least, outside the single-quotes, we give awk the name of the file we wish it to process.

This is just a fantastic tool which I regret not having taking the time to learn the basics of earlier. Thank you Pontus for making me see the light (again) ;)

:wq

Tags: , , ,

5 Responses to “awk, filtering and counting”

  1. For comparison, in sed:
    time sed -n ‘s/^.*, foo,//p’ awk.example.txt|tr ‘\n’ ‘+’|sed ‘s/+$/\n/’|bc

    The awk version is faster. But there it is in sed. I had to throw in ‘tr’ to strip out newlines.

  2. Patrik says:

    Hey :)

    This actually symbolizes rather well why I was so awestruck with awk: the fact that it is one process doing all of it.

    I do however really like your solution turning “\n” into “+” for later use with bc. Not something I would have intuitively thought of, so that was a treat. Thank you :)

  3. You’re welcome. I do agree that awk is very impressive.

  4. Rikard says:

    For comparison, in bash using cut and grep:

    time (n=0;for i in $(grep foo awk.example.txt|cut -d',' -f3);do n=$((n + i));done;echo $n)
    1684808
    

    About twice the time compared to AWK :) (awk: 0.167s, bash: 0.315s)

  5. Patrik says:

    So, Rikard, the first thing you do back on the job from the holidays is to go read my blog.

    I’m flattered ;D

    It’s also interesting to see that awk on my netbook isn’t a whole lot slower than on what I suppose is your T-61?

    Cheers,
    Patrik