Posts Tagged ‘awk’

2012w38

Sunday, September 23rd, 2012

Patent / Copyright madness

Automated copyright enforcement really seems to work well… NOT!

And Apple is up to no good as usual…

Patent trolls trolling around, but it would seem not without a fight :)

Oh, and if you were thinking of setting up a service which required a login, and you thought you’d protect people’s accounts well, then that might be patented…

Programming

Through this reddit thread (referencing it both for source and because the comments in there are relevant) I was lead to this post sometime this week or last.

And this weekend, while doing the weekly write-up, I re-read the post, and started thinking, because I kindof feel that my own hobby projects very easily fall victim to this. They get shot down because I start thinking of how much stuff I would have to rewrite (things I’ve already solved in previous projects, but never put down the time to make generic enough to reuse) or figure out, and it just takes my edge off wanting to sit down and do it.

But then it struck me, what help me get reinvigorated, what helps me come over at least that hurdle: A moderately quiet place, time enough for a conversation, a good (programmer) friend, and optionally a beer.

Broken gets fixed. Shoddy lasts forever — I wonder, does this mean that, if something is shoddy and you want it fixed, the correct action is to break it? ;)

git

I found myself wanting a way to have a central repository react to pushes based on what branch was pushed (I am working on a script at work, which some of my colleagues are beta-testing for me).

Whilst I develop a new feature I need a way to push that potentially buggy version of the script to a path where the testers can find it, while using a completely different path for the stable version which everyone else could use without any big risk of it messing anything up.

What I ended up with was this:

#!/bin/sh

read _from _to _branchPath
_branch=`basename $_branchPath`

if [ "$_branch" == "develop" ];
then
    cd /path/to/local/repository/on/server
    unset GIT_DIR
    git pull
    git checkout develop
    cp -f ./scriptname /path/to/beta/test/directory/
fi

Two questions on Stack Overflow helped me out tremendously: This and this (and as always, pesa was a big help too)

And since I agree with this post (namely that vimdiff would be a great diff viewer for git) I went ahead and followed the instructions of that post :)

vim

Using vimwiki to track time sounds brilliant. It’s almost enticing enough for me to look into vimscripting to help out. Yet another project I’d like to spend time on :S

Being that paranoid soul I am, and now that I can also call myself a tester, I realize I’ve become even more aware of how many different entry points there are which needs to secured, such as vim’s modelines (protip: use secure modelines

From this post I learnt about license-loader, which I need to look into.

awk

I found this post to be an excellent intro to awk, I am going to spread this around whenever I need to show anyone the basics :)

Misc

This post about 52Hz made me kindof sad :/

On the other hand, this post filled me with some hope.

I think it could be beneficial if this site was more widely distributed, so here’s me doing my part.

Finally, this post was pretty cool, and I immediately thought of at least two people I know who would get a kick out of reading this. :)

:wq

AWK hackery++

Thursday, September 13th, 2012

While I cannot take credit for the solution which follows (the real star is Pontus) what I can do is document it for posterity:

The scenario is as follows: You have a couple of log files, foo.X.log for X in [1,2,..,N-1,N] and you want to watch all of these for a certain pattern, P.

And all of these matching results you want to write to a condensed log, in the format <originatingLogFile>: <lineThatMatchedPattern>

You have figured out that tail -f foo.*.log will yield you all the lines, and separate output from the various files using ==> foo.X.log <==

You might have tried with grep, to get the correct lines, and perhaps even got it to output to the file using --line-buffered but you are not out of the woods, that pesky filename at the beginning of the line is still nowhere to be seen.

grep? where we’re going, we don’t need grep

Every time that tail outputs a line from one of the other files than the last one indicated by ==> ... <== it outputs that pattern again on the line before the output.

E.g.:

$ tail -f foo.*.log
==> foo.1.log <==

==> foo.2.log <==

==> foo.3.log <==

Now, from another terminal, while the tail is still following all three of those logs, let us insert some content:

$ echo "PATTERN, some more stuff" >> foo.2.log

Which results in the first terminal now displaying:

$ tail -f foo.*.log
==> foo.1.log <==

==> foo.2.log <== 

==> foo.3.log <==

==> foo.2.log <==
PATTERN, some more stuff

Ok, so whenever new content is pushed onto any one of the tailed files, we will get notified of it, along with what file it was pushed to (note though, that if we were to push more content into the same file as was previously pushed to (foo.2.log) it would NOT generate another ==> foo.2.log <== line, it would simply append to the ongoing tail:

$ echo "PATTERN, even more stuff" >> foo.2.log

$ tail -f foo.*.log
==> foo.1.log <==

==> foo.2.log <==

==> foo.3.log <==

==> foo.2.log <==
PATTERN, some more stuff
PATTERN, even more stuff

This doesn’t matter much, our solution will adapt to this, but it is a gotcha worth noting (although it should be instantly obvious to anyone reproducing this on their own anyway)

Anyway, with that out of the way, lets get down to business: AWK hackery!

First order of business: We have a pattern we wish to match. This may or may not change at a later data, and sure, we could practice YAGNI (You Ain’t Gonna Need It) but let us prepare for it anyway by assigning the pattern to be matched in the invocation of AWK (this will assume that we have stored the pattern to be matched inside the variable “$_P”):

awk -v pattern="$_P"

Since AWK operates on lines (that is a simplification since one could redefine what constitutes a line) and it won’t remember whatever was in the line if processed in the previous iteration (previous line) we need to store the latest file(name) into which a line was added.

To achieve this we need to detect whenever new content has been added to a new file (i.e. whenever a new line consisting of ==> foo.*.log <== appears.

BEGIN { filename = "" }
/==> .* <==/ { filename = $2; next }

Since BEGIN triggers before AWK starts any processing, it is the perfect place to define our placeholder variable filename in which we are going to temporarily store the filename of the file into which data is presently being entered into.

The next line matches any line containing ==> followed by anything really, followed by <==. Upon encountering one of those lines, we store the stuff between the arrows in the filename variable.

$0 signifies the entire line, and everything inside that line, separated by newlines (start and end of line) or whitespace, is assigned a numbered variable $1 .. $x for x in number of whitespace-separated words in line. (Again, we could reconfigure AWK to not split things on whitespace, but in this case there would be no sense in doing that.)

And just to ensure that we won’t ever match the ==> .* <== line with whatever we put in $_P we tell AWK that once we have stored the filename, we should move on to the next line.

Next on the agenda is to match the pattern:

$0 ~ pattern { printf "%s: %s\n", filename, $0 }

Essentially this tells AWK to look inside $0 (the whole (current) line) for the stuff we have stored in the variable pattern (i.e. $_P) and if there is a match, print it out, first the value stored in the filename variable, then a colon and a space, followed by the entire line that matched ($0) and finally a newline.

Rinse, repeat, until we abort the command with C^c.

Now, this is not the end of the story though. We wanted to write this to a file. But AWK does not write to file until it has processed all of the input, and this will not happen since tail is following and will not end until we kill it (and in doing that we lose it all before it has a chance to be written to file). pesa points out that I got this wrong, go downwards and read his comment for the real deal

Doh!

What we need is a way to tell AWK to write directly to file directly after it has found a match and printed. fflush() is the function to make this happen for us:

$0 ~ pattern { printf "%s: %s\n", filename, $0; fflush() }

Now all we have to do is wrap the whole thing up, redirect to the filename of the condensed log we want to establish:

tail -f foo.*.log | awk -v pattern="$_P" '
    BEGIN { filename = "" }
    /==> .* <==/ { filename = $2; next }
    $0 ~ pattern { printf "%s: %s\n", filename, $0; fflush() }
' > condensedLog.log

There is one big gotcha here though:

If you are starting a tail on log files which are already populated, tail will first output everything from the first log (which is the first file to be read I suppose depends on some system settings (locales comes to mind)) and then move on to the next file, etc. So regardless of the chronological appearence of matching lines in the different files, the first files matching lines will always be presented first.

All matching lines entered AFTER this one-liner has started running, will appear in the correct chronological order.

This simply means that you will have to be prepared for this, and possibly do processing on the condensed log file as well (hopefully the logs you are tailing all have timestamps in them so that sort has something easy to operate on once the lines are entered into the condensed log)

:wq

2012w25

Sunday, June 24th, 2012

Quite a while since I wrote a post now, I’ve not been sick or anything, but there has been a lot of work abound, and outside work I prioritized sleeping over writing. But now I’m back for the moment, so let’s get down to business :)

Since last time I’ve come up with new ways of abusing awk, such as having it find the highest value from a command outputting in the following syntax:

\t<characters, integers, fullstop>: <integer>\n

To make it a little more different, the command also spits out a header, as well as an additional newline after the end of output.

I just now, while writing this, came up with a different solution, which doesn’t use awk:

theCommand | grep -v '^[^ \t]\+' | tr -d ' ' | cut -d':' -f2 | sort -r | head -n 1

but what I ended up using was:

theCommand | awk 'BEGIN { highest = 0 } $0 ~ /^[ \t]/ { if ( $2 > highest ) { highest = $2 } } END { print highest }'

In this case, from what I can gather, awk is the more efficient solution. One process versus five.

Update: As Werner points out, the if statement isn’t really necessary (which also makes it possible to cut out the BEGIN statement as well):

theCommand | awk '/^[^ \t]/ && $2 > highest { highest = $2 } END { printf "%d\n", highest }'

Utilities

  • ditaa (a.k.a DIagrams Through Ascii Art) makes it easy to generate nice-looking diagram images from… rather nice-looking ASCII diagrams
  • docopt, a command-line interface description language, which also seems to support generating the parser for the CLI being described
  • Peity for generating different types of charts using jQuery and <canvas>
  • Ghost.py interacting with web pages, programmatically

As of late I have been thinking a great deal about backups and the project which seems the most interesting to me is Duplicity.

Random tech stuff

Other random not-so-techy stuff

What I pass for humour

:wq

2012w18

Sunday, May 6th, 2012

The scripty stuff

This week I finally managed to crack a problem I’d been trying to solve for a couple of weeks, namely how to only print the foobar errors, and the ensuing stack trace of these errors from a logfile:

awk 'BEGIN { section = 0 } /foobar/ { section = 1; print; next } /^[A-Z]/ && section == 1 { section = 0; next } section == 1 { print; next }' logfile 

Looking at the solution, I am kindof ashamed that it took me that long to get a workable solution…

I also found this neat little oneliner in a comment on reddit: echo "something long and space separated of which you want the last word" | rev | cut -d ' ' -f 1 | rev. Then again, I’m sure that awk could have done this with a little $(NF-1) magic or something like that.

The headache-inducing stuff

All since my netbook broke down, I’ve thought about two things: restoring the netbook/replacing it, and how to create some form of backup infrastructure which should be better than what I have in place today.

As for the backups, the “system” I have today is couple of USB-disks which I at times plug in and sync files to. That and most of my projects and config-files are in various git repositories all synced to the laptop/server-in-the-wardrobe which I made sure to backup after the netbook died, especially since the laptop/server disk is much older than the netbook disk was.

Another thing which bothers me with the current solution is that I have no off-site storage. And that would be nice to have. Belt AND suspenders of course, and off-site storage comes with its own set of problems such as trust in the offsite storage maintainer.

I think the solution will take the shape of a GNU+Linux box and Unison and possibly aided by incron. Not sure yet, will have to think more about it.

There are some other requirements which I have just barely scratched the surface of or not even begun thinking about yet, for instance it would be nice to be able to backup my parents stuff as well on a regular basis as to keep their stuff safer as well.

And as for the netbook, although it was a nice little machine, the keyboard was getting a bit worn out, and at times it was rather underpowered with its single core 1.6GHz atom processor, so the direction I am looking in now is towards something like this.

The stuff screwing over society

Now there’s truly no way in hell I’ll ever use Skype again.

Nothing new under the sun I guess, but it lends credibility to the Skype quip above.

This sure is some level-A grade retarded society we are constructing for ourselves…

Samsung Galaxy S3: The first smartphone designed entirely by lawyers, a great read about a truly depressing matter which probably is closer to the truth than we imagine. On the other hand, my personal opinion is that the midnight blue version looks pretty damn sweet.

SaaS and other crap where someone else is in control sure is a honking good idea, isn’t? Well, I guess it is if you’re the one in control, but I guess you won’t ever get my business…

The cool stuff

And I also managed to find some posts which touched the hacker in me, such as this post about how one could go about generating pseudo-random numbers (don’t use the algorithms, just be inspired by them) or how this guy started shaving bytes off of his “hello, world!” binary.

I immediately thought about FSCONS when I read this, and I didn’t feel at all worried about people thinking the same about our conf :)

Until the other day, when I read about its inclusion into git, I’d never even heard about git subtree, but this post makes a compelling case for looking into it.

I also came across a, to me, new data structure: the XOR linked list. Now, it has a couple of drawbacks, and I don’t think I’ll find much use for it, ever, but as a concept it is a very interesting idea, and just goes to show that XOR is frakking awesome.

I thought this was a pretty cool thing.

While I don’t have any problems with my ISP hijacking DNS requests right now, it is nice to know for posterity that there are ways around it ;)

If you are going to use JSON, and need comments, this seems like a reasonable way to go about it.

While I haven’t decided what I think about Go I really liked this blog post on how to create a grass mowing agent which derives the most optimal way to cut the digital grass in a simulated world.

Hopefully I ain’t the only one who finds this hilarious ;)

This is actually quite neat: Instead of adding “lorem ipsum” paragraphs all over your design, tweak the word list in the script, include it in the mockup, and markup all places which need filler content. Done.

In the latest issue of DatorMagaZin there was an article about FUSE which caught my eye, and having read the article my interest was piqued, so I just had to go look at the list myself, and truly, have you seen all the cool filesystems people have come up with? Frakkin’ awesome!

The food for thought stuff

Oh yeah, finally remember to treat everyone the way you’d like people to treat your own mother

:wq

2012w15

Sunday, April 15th, 2012

This has been a pretty rough week, but I guess there is nothing less to expect when deadlines are drawing near.

This week I found myself wanting to count all the occurrences of “foo”, but ONLY if they occurred BEFORE “bar”:

awk 'BEGIN { fooCount=0; stopCounting=0 } /bar/ { stopCounting=1 } /foo/ && stopCounting=0 { fooCount = fooCount + 1 } END { print fooCount }' <myfile>

And despite the quite hectic schedule, I did manage to help a colleague with a little scripting, and those are two things which almost always sets me in a better mood: scripting (problem solving), and helping others (of course, if I don’t manage to be of any help, that kindof defeats any positive mood change I get from scripting, but in this particular case it all worked out really well in the end) :)

And now for the mandatory collection of links from this week:

  • This must be a joke right? The US can’t really, for real, be irritated with Australia for preferring national service providers over American ones, right? Especially when it could come down to storing data about Australian citizens, or in other ways vital to the government. This has to be a joke right?
  • I wonder if this is something most programmers can relate to or if it’s just me
  • This post could have been written by me… well, not as articulate, but the spirit of it. What’s even more interesting is the response this triggered on HackerNews.
  • QArt Codes is where QR codes, Solomon-Reed error correction, some extra calculations and your imagination mix together ;)
  • Hilarious post making fun about certain governments and their want for even more snooping laws, especially about conducting surveillance in in-game chats…

:wq

2012w09

Sunday, March 4th, 2012

Ohai!

This week has been rather productive. I’ve both gotten work done AND learnt a crapload of stuff AND gotten to hack away on some scripts, leading to some personal programming revelations :D

When it comes to shell scripting, printf has become a new friend (leaving echo pretty much out in the cold). It is a continuation from last weeks post about shell tricks and I actually got to use it helping a colleague at work to better format the output of a script.

Something along the lines of:

printf "There were %s connections from %s\n" `some-counting-command` $tmpIP

I also wrote a small demonstrator for another colleague:

for i in `seq 1 21`; do printf "obase=16; $i\n" | bc; done

(Yes, I know about printf’s ability to convert/print hexadecimal on the fly)

for i in `seq 1 21`; do printf "%0.2x\n" $i; done

The for loop piping to bc was mostly for fun and to spark ideas about loops and pipes.

In another script I found myself needing two things: a reliable way to shut the script down (it was running a loop which would only stop when certain things appeared in a log) and a way to debug certain parts of the loop.

I know there is nothing special at all about it, but coming up with the solution instead of trying to google myself to a solution left me feeling like a rocket-scientist ;D

If you have a loop, and you want the ability to controllably get out of said loop, do something along the lines of this in your script:

touch /tmp/someUniqueName
while [ ... && -f /tmp/someUniqueName ]; do 
    ...
done

My first thought was to use $$ or $! to have a unique name but since I wouldn’t (couldn’t) be running more than one instance of this script at a time, I didn’t need to worry about that, and it would have made it a tiny bit harder to stop the script, so I finally (thanks razor) opted for a static, known, filename.

While that file exists, and you other loop conditions are normal, the loop will … loop on, but the second either condition becomes false, like someone removing the file ;) the loop doesn’t do another iteration.

Problem two was that I wanted a quick way to switch between running the script live, or in debug mode. Since running it live calls on some other stuff which then takes a while to reset, debugging the script using these calls would have been painfully slow, but I found a neat way around that:

DEBUG="echo" # (should be "" (debug off) or "echo" (debug on)
...
$DEBUG some-slow-command
...

With debug “on” it will print the command and any parameters, instead of executing it. It doesn’t look all that impressive in this shortened example, but instead imagine if you had more than ten of those places you wanted to debug.

What would you rather do? Edit the code in ten+ places, perhaps missing one, or just change in one place, and have it applied in all the places at once?

This script, once in place and running, did however bring with it another effect, namely a whole lot of cleanup. Cleanup which could only be performed by running a command and giving it some parameters, which could be found in the output of yet another command.

To make matters worse, not all lines of that output were things I wanted to remove. The format of that output was along the lines of:

<headline1>,<date>
<subheader1-1>,<date>
<subheader1-2>,<date>
<subheader1-3>,<date>
<subheader1-4>,<date>
...

<headline2>,<date>
<subheader2-1>,<date>
<subheader2-2>,<date>
<subheader2-3>,<date>
<subheader2-4>,<date>
...

<headline3>,<date>
<subheader3-1>,<date>
<subheader3-2>,<date>
<subheader3-3>,<date>
<subheader3-4>,<date>
...

Again, these seem like small amounts, but for the “headline” I needed to clean up, there were about 70 subheaders, out of which I wanted to clean up all but one. Thankfully, that one subheader I wanted to preserve was not named in the same way as the other 69 subheaders (which had been created programmatically using the loop above).

Also, it was rather important not to delete any other headlines or subheaders. awk to the rescue! But first, here are some facts going into this problem:

  • the subheaders I wanted removed all shared a common part of name between each section is an empty line
  • I knew the name of the section heading containing the subheaders to remove
  • To remove one such subheader I’d need to execute a command giving the subheader as an argument

And this is what I did:

list-generating-command | awk -F, 
    '/headline2/ { m = 1 }
    /subheader/ && m == 1 { print $1 }
    /^$/ { m = 0 }' | while read name;
do
    delete-subheader-command $name
done

Basically, what is going on here is that first of all, setting the field separator to “,” and then, once we come across the unique headline string, set a little flag telling awk that we are within the matching area, and if it can only match the subheader pattern, it gets to print the first column of that line. Finally, when we reach any line containing nothing but a newline, unset the flag, so that there will be no more printouts

And another thing I’ve stumbled upon, and which I already know where I can use this, is this post and more specifically:

diff -u .bashrc <(ssh remote cat .bashrc)

(although it is not .bashrc files I am going to compare).

And finally, some links on assorted topics:

2012w08

Sunday, February 26th, 2012

Hacks

A capture the flag game where the objective is to break into a computer system.

Commandline

I found myself needing to remove a couple (three) columns from a file containing about 15 columnts per line. And sure, I could have done something like awk '{ print $1 " " $2 " " $3 " " }' for the 12 columns I wanted, but that would have been tedious.

There just had to be a better way. And of course there was ;)

* * * * * *

I’ve been entertaining an idea which would need version controlled updates, and they’d also need to be trusted. So I’d need signed commits, and since I’m mostly using git nowadays, I needed to find out if this was possible. It is.

* * * * * *

Since starting my new job I’ve realized just how important it can be to write portable scripts (especially echo has bitten me in the ass a couple of times already) so this post was pretty useful to me.

Society

Now this was a pretty inspiring post.

* * * * * *

A pretty funny post about how truly sorry a state the TV is in.

2012w01

Sunday, January 8th, 2012

column

The other day I wanted some prettier (tabularized) output and of course someone has already wanted this and of course there are tools for that :)

bash_completion

This is so frakking cool! I’ve built this little shellscript “vault.sh” which is a simple wrapper script for mounting and unmounting encfs mounts.

It takes two parameters: operation and target, where operation can be one of “lock” and “unlock”, and target—at present—resolves to “thunderbird” (signifying my .thunderbird directory).

Since I intend to expand this with more encrypted directories as I see fit, I don’t want to hard-code that.

What I did want, however, was to be able to auto complete operation and target. So I looked around, and found this post, and although I couldn’t derive enough knowledge from it to solve my particular problem, having multiple levels of completion, the author was gracious enough to provide references to where s/he had found the knowledge (here, here and here). That second link was what did it for me.

My /etc/bash_completion.d/vault.sh now looks like this:

_vault()
{
    local cur prev opts
    COMPREPLY=()
    cur="${COMP_WORDS[COMP_CWORD]}"
    prev="${COMP_WORDS[COMP_CWORD-1]}"
    first="lock unlock"
    second="thunderbird"

    if [[ ${cur} == * && ${COMP_CWORD} -eq 2 ]] ; then
        COMPREPLY=( $(compgen -W "${second}" -- ${cur}) )
        return 0
    fi

    if [[ ${cur} == * && ${COMP_CWORD} -eq 1 ]] ; then
        COMPREPLY=( $(compgen -W "${first}" -- ${cur}) )
        return 0
    fi
}
complete -F _vault vault.sh

And all the magic is happening in the two if-statements. Essentially: if current word (presently half typed and tabbed) is whatever, and this is the second argument to the command, respond with suggestions taken from the variable $second.

Otherwise, if current word is whatever, and this is the first parameter, take suggestions from the variable $first.

Awsum!

awk for great good

Another great use for awk: viewing selected portions of source code. For instance, in Perl, if you just want to view a specific subroutine, without getting distracted by all the other crud, you could do: $ awk '/sub SomeSubName/,/}/' somePerlModule.pm

Links

If PHP were British, perhaps it’s just me, but I find it hilarious.

PayPal just keeps working their charm…

Belarus just… wait what?

Why we need version control

Preserving space, neat!

Fuzzy string matching in Python

If you aren’t embarrassed by v1.0 you didn’t release it early enough

The makers schedule, oldie but goldie

CSS Media Queries are pretty cool

Static site generator using the shell and awk

A netstat companion

Reducing code nesting

Comparing images using perceptual hashes

Microsofts GPS “avoid ghetto” routing algorithm patent…

My Software Stack 2011 edition

Saturday, December 31st, 2011

I realize that I haven’t written my customary “software stack” post for this year yet. But hey, from where I’m sitting, I still have … 36 minutes to spare ;)

I’ll be using the same categories as last year; system, communications, web, development, office suite, server, organization, and entertainment.

System

The OS of choice is still Archlinux, my window manager is still wmii, my terminal emulator is rxvt-unicode, upgraded by also installing urxvt-tabbedex.

My shell is still bash, my cron daemon is still fcron, and my network manager is wicd.

To this configuration I’ve added the terminal multiplexer tmux, and have lately found out just how useful mc can be. Oh, and qmv from the renameutils package is now a given part of the stack.

Communications

Not much change here, Thunderbird for email, Pidgin for instant messaging, irssi for IRC.

Heybuddy has been replaced by identicurse as my micro-blogging (identi.ca) client. Heybuddy is very nice, but I can use identicurse from the commandline, and it has vim-like bindings.

For Pidgin I use OTR to encrypt conversations. For Thunderbird I use the enigmail addon along with GnuPG.

This means that Thunderbird still hasn’t been replaced by the “mutt-stack” (mutt, msmtp, offlineimap and mairix) and this is mostly due to me not having the energy to learn how to configure mutt.

I also considered trying to replace Pidgin with irssi and bitlbee but Pidgin + OTR works so well, and I have no idea about how well OTR works with bitlbee/irssi (well, actually, I’ve found irssi + OTR to be flaky at best.

Web

Not much changed here either, Firefox dominates, and I haven’t looked further into uzbl although that is still on the TODO list, for some day.

I do some times also use w3m, elinks, wget, curl and perl-libwww.

My Firefox is customized with NoScript, RequestPolicy, some other stuff, and Pentadactyl.

Privoxy is nowadays also part of the loadout, to filter out ads and other undesirable web “resources”.

Development

In this category there has actually been some changes:

  • gvim has been completely dropped
  • eclipse has been dropped, using vim instead
  • mercurial has been replaced by git

Thanks in no small part to my job, I have gotten more intimate knowledge of awk and expect, as well as beginning to learn Perl.

I still do some Python hacking, a whole lot of shell scripting, and for many of these hacks, SQLite is a faithful companion.

Doh! I completely forgot that I’ve been dabbling around with Erlang as well, and that mscgen has been immensely helpful in helping me visualize communication paths between various modules.

“Office suite”

I still use LaTeX for PDF creation (sorry hook, still haven’t gotten around to checking out ConTeXt), I haven’t really used sc at all, it was just too hard to learn the controls, and I had too few spreadsheets in need of creating. I use qalculate almost on a weekly basis, but for shell scripts I’ve started using bc instead.

A potential replacement for sc could be teapot, but again, I usually don’t create spreadsheets…

Server

Since I’ve dropped mercurial, and since the mercurial-server package suddenly stopped working after a system update, I couldn’t be bothered to fix it, and it is now dropped.

screen and irssi is of course always a winning combination.

nginx and uwsgi has not been used to any extent, I haven’t tried setting up a VPN service, but I have a couple of ideas for the coming year (mumble, some VPN service, some nginx + Python/Perl thingies, bitlbee) and maybe replace the Ubuntu installation with Debian.

Organization

I still use both vimwiki and vim outliner, and my Important Dates Notifier script.

Still no TaskJuggler, and I haven’t gotten much use out of abook.

remind has completely replaced when, while I haven’t gotten any use what so ever out of wyrd.

Entertainment

For consuming stuff I use evince (PDF), mplayer (video), while for music, moc has had to step down from the throne, to leave place for mpd and ncmpcpp.

eog along with gthumb (replacing geeqie) handles viewing images.

For manipulation/creation needs I use LaTeX, or possibly Scribus, ffmpeg, audacity, imagemagick, inkscape, and gimp.

Bonus: Security

I thought I’d add another category, security, since I finally have something worthwhile to report here.

I’ve begun encrypting selected parts of my hard drive (mostly my email directory) using EncFS, and I use my passtore script for password management.

And sometimes (this was mostly relevant for when debugging passtore after having begun actively using it) when I have a sensitive file which I for a session need to store on the hard drive, in clear text, I use quixand to create an encrypted directory with a session key only stored in RAM. So once the session has ended, there is little chance of retrieving the key and decrypting the encrypted directory.

Ending notes

That’s about it. Some new stuff, mostly old stuff, only a few things getting kicked off the list. My stack is pretty stable for now. I wonder what cool stuff I will find in 2012 :D

:wq

awk, filtering and counting

Monday, December 5th, 2011

Suppose that you have a file containing some structured data, something perhaps along the lines of this, highly fictive but yet remarkably common, syntax:

<id><separator><somestring><separator><integer>

Now, let’s say that there were 99999 lines of this to go through, and the file is unsorted, and you wanted to find all the lines where SOMESTRING is foo, and then sum up the INTEGER field of those lines.

I almost had this problem at work, except my file probably didn’t contain more than a hundred or so lines.

For this I wrote a Perl script, which worked well, with the small inconvenience that I’d have to move that script onto each system where I’d want to use it.

Pontus, never the one to berate anyones efforts, but still finding room for improvements, both in the fact that my approach, the script, carried that inconvenience, and that is was very verbose when compared to the solution he ultimately suggested, he showed me a better way, the awk way.

$ awk -F<separatorGoesHere> 'BEGIN { SUM = 0 } /<someStringGoesHERE/ { SUM += $3 } END { print SUM }' <fileToBeParsedGoesHere>

I said before that my real file, at work, was small, so awk crunched through it at lightning speed. I also suggested a file containing 99.999 lines, and I did that to prove a point, namely:

Using this script:

#!/usr/bin/env python2

import random

filename = "awk.example.txt"
index = 0
iterations = 100000
choices = ['foo', 'bar', 'baz']
fh = open(filename, 'w')

for index in range(1, iterations):
    fh.write("%d, %s, %d\n" % (index,
                               random.choice(choices),
                               random.randint(0, 100)))
fh.close()

I generated a file (~1.5Mb) with a couple of lines ;) and let awk loose on it:

$ time awk -F, 'BEGIN { SUM = 0 } /foo/ { SUM += $3 } END { print SUM }' awk.example.txt

Which on my netbook took 0.241 seconds to complete.

real	0m0.241s
user	0m0.237s
sys	0m0.000s

Or in other words: awk if pretty frakking fast!

Now, let’s break it down:

awk

obviously, is the command, and it rocks, ‘nuf said.

-F,

means “change the field separator (from whitespace) to commas”

And then it gets tricky, but not as tricky as at least I was lead to believe.

There are two single-quotes, and between these we place all the things we want awk to do for us.

One good thing to note is that the syntax for awk is quite simple, something I didn’t grasp at first. It goes like this:

<somePattern> { <someAction> }

And that’s it. You can chain several <pattern>{<actions>} after each other.

In my, well Pontus’, command above, there are three such pairs:

BEGIN { SUM = 0 }

which is just another way of saying “before we start executing, create a variable SUM and set its value to 0″

/foo/ { SUM += $3 }

If you’re familiar with regular expressions you might have stumbled upon the pattern in which you enclose an expression between two slashes, and that pattern is used to search (or match) contents of lines or files. That’s what we’re doing here. So we’re basically saying “find lines containing foo, and from these lines extract column number three ($3), and increment the variable SUM by the value stored in column three.”

If instead, you’d wanted to count all the lines containing foo, SUM += 1 would have done that job.

Finally:

END { print SUM }

which should be pretty obvious: “When all is said and done, print whatever is stored in the variable SUM”

And last but not least, outside the single-quotes, we give awk the name of the file we wish it to process.

This is just a fantastic tool which I regret not having taking the time to learn the basics of earlier. Thank you Pontus for making me see the light (again) ;)

:wq