Posts Tagged ‘grep’

AWK hackery++

Thursday, September 13th, 2012

While I cannot take credit for the solution which follows (the real star is Pontus) what I can do is document it for posterity:

The scenario is as follows: You have a couple of log files, foo.X.log for X in [1,2,..,N-1,N] and you want to watch all of these for a certain pattern, P.

And all of these matching results you want to write to a condensed log, in the format <originatingLogFile>: <lineThatMatchedPattern>

You have figured out that tail -f foo.*.log will yield you all the lines, and separate output from the various files using ==> foo.X.log <==

You might have tried with grep, to get the correct lines, and perhaps even got it to output to the file using --line-buffered but you are not out of the woods, that pesky filename at the beginning of the line is still nowhere to be seen.

grep? where we’re going, we don’t need grep

Every time that tail outputs a line from one of the other files than the last one indicated by ==> ... <== it outputs that pattern again on the line before the output.


$ tail -f foo.*.log
==> foo.1.log <==

==> foo.2.log <==

==> foo.3.log <==

Now, from another terminal, while the tail is still following all three of those logs, let us insert some content:

$ echo "PATTERN, some more stuff" >> foo.2.log

Which results in the first terminal now displaying:

$ tail -f foo.*.log
==> foo.1.log <==

==> foo.2.log <== 

==> foo.3.log <==

==> foo.2.log <==
PATTERN, some more stuff

Ok, so whenever new content is pushed onto any one of the tailed files, we will get notified of it, along with what file it was pushed to (note though, that if we were to push more content into the same file as was previously pushed to (foo.2.log) it would NOT generate another ==> foo.2.log <== line, it would simply append to the ongoing tail:

$ echo "PATTERN, even more stuff" >> foo.2.log

$ tail -f foo.*.log
==> foo.1.log <==

==> foo.2.log <==

==> foo.3.log <==

==> foo.2.log <==
PATTERN, some more stuff
PATTERN, even more stuff

This doesn’t matter much, our solution will adapt to this, but it is a gotcha worth noting (although it should be instantly obvious to anyone reproducing this on their own anyway)

Anyway, with that out of the way, lets get down to business: AWK hackery!

First order of business: We have a pattern we wish to match. This may or may not change at a later data, and sure, we could practice YAGNI (You Ain’t Gonna Need It) but let us prepare for it anyway by assigning the pattern to be matched in the invocation of AWK (this will assume that we have stored the pattern to be matched inside the variable “$_P”):

awk -v pattern="$_P"

Since AWK operates on lines (that is a simplification since one could redefine what constitutes a line) and it won’t remember whatever was in the line if processed in the previous iteration (previous line) we need to store the latest file(name) into which a line was added.

To achieve this we need to detect whenever new content has been added to a new file (i.e. whenever a new line consisting of ==> foo.*.log <== appears.

BEGIN { filename = "" }
/==> .* <==/ { filename = $2; next }

Since BEGIN triggers before AWK starts any processing, it is the perfect place to define our placeholder variable filename in which we are going to temporarily store the filename of the file into which data is presently being entered into.

The next line matches any line containing ==> followed by anything really, followed by <==. Upon encountering one of those lines, we store the stuff between the arrows in the filename variable.

$0 signifies the entire line, and everything inside that line, separated by newlines (start and end of line) or whitespace, is assigned a numbered variable $1 .. $x for x in number of whitespace-separated words in line. (Again, we could reconfigure AWK to not split things on whitespace, but in this case there would be no sense in doing that.)

And just to ensure that we won’t ever match the ==> .* <== line with whatever we put in $_P we tell AWK that once we have stored the filename, we should move on to the next line.

Next on the agenda is to match the pattern:

$0 ~ pattern { printf "%s: %s\n", filename, $0 }

Essentially this tells AWK to look inside $0 (the whole (current) line) for the stuff we have stored in the variable pattern (i.e. $_P) and if there is a match, print it out, first the value stored in the filename variable, then a colon and a space, followed by the entire line that matched ($0) and finally a newline.

Rinse, repeat, until we abort the command with C^c.

Now, this is not the end of the story though. We wanted to write this to a file. But AWK does not write to file until it has processed all of the input, and this will not happen since tail is following and will not end until we kill it (and in doing that we lose it all before it has a chance to be written to file). pesa points out that I got this wrong, go downwards and read his comment for the real deal


What we need is a way to tell AWK to write directly to file directly after it has found a match and printed. fflush() is the function to make this happen for us:

$0 ~ pattern { printf "%s: %s\n", filename, $0; fflush() }

Now all we have to do is wrap the whole thing up, redirect to the filename of the condensed log we want to establish:

tail -f foo.*.log | awk -v pattern="$_P" '
    BEGIN { filename = "" }
    /==> .* <==/ { filename = $2; next }
    $0 ~ pattern { printf "%s: %s\n", filename, $0; fflush() }
' > condensedLog.log

There is one big gotcha here though:

If you are starting a tail on log files which are already populated, tail will first output everything from the first log (which is the first file to be read I suppose depends on some system settings (locales comes to mind)) and then move on to the next file, etc. So regardless of the chronological appearence of matching lines in the different files, the first files matching lines will always be presented first.

All matching lines entered AFTER this one-liner has started running, will appear in the correct chronological order.

This simply means that you will have to be prepared for this, and possibly do processing on the condensed log file as well (hopefully the logs you are tailing all have timestamps in them so that sort has something easy to operate on once the lines are entered into the condensed log)



Sunday, June 24th, 2012

Quite a while since I wrote a post now, I’ve not been sick or anything, but there has been a lot of work abound, and outside work I prioritized sleeping over writing. But now I’m back for the moment, so let’s get down to business :)

Since last time I’ve come up with new ways of abusing awk, such as having it find the highest value from a command outputting in the following syntax:

\t<characters, integers, fullstop>: <integer>\n

To make it a little more different, the command also spits out a header, as well as an additional newline after the end of output.

I just now, while writing this, came up with a different solution, which doesn’t use awk:

theCommand | grep -v '^[^ \t]\+' | tr -d ' ' | cut -d':' -f2 | sort -r | head -n 1

but what I ended up using was:

theCommand | awk 'BEGIN { highest = 0 } $0 ~ /^[ \t]/ { if ( $2 > highest ) { highest = $2 } } END { print highest }'

In this case, from what I can gather, awk is the more efficient solution. One process versus five.

Update: As Werner points out, the if statement isn’t really necessary (which also makes it possible to cut out the BEGIN statement as well):

theCommand | awk '/^[^ \t]/ && $2 > highest { highest = $2 } END { printf "%d\n", highest }'


  • ditaa (a.k.a DIagrams Through Ascii Art) makes it easy to generate nice-looking diagram images from… rather nice-looking ASCII diagrams
  • docopt, a command-line interface description language, which also seems to support generating the parser for the CLI being described
  • Peity for generating different types of charts using jQuery and <canvas>
  • interacting with web pages, programmatically

As of late I have been thinking a great deal about backups and the project which seems the most interesting to me is Duplicity.

Random tech stuff

Other random not-so-techy stuff

What I pass for humour



Sunday, November 20th, 2011

First of all: this is really disturbing.

Commands and flags

I think I’ve already mentioned watch, and how that could be useful at time (e.g. $ watch -n 10 -d 'ls -l')

I just found out about a value which can optionally be appended to the -d flag: -d=cumulative

It has its own flag as well --cumulative, and quoting the man-page it makes highlighting “sticky”, presenting a running display of all positions that have ever changed.

Also, this week I learnt about sdiff, which seems neat if you’re on a system which doesn’t have vim (and thus vimdiff) installed.

Anoter nice flag I just found for grep is -m <int> which tells grep to stop looking after the INT first matches.

Scripting Vim

Ok, so I’ve been running into this problem where I am using my own .vimrc configuration in other places, in systems where the vim version isn’t the same as the one I use myself.

This has proven problematic as some of the configuration options I use (most notably set cul (which gives me a better indication about which line the cursor is on)) doesn’t exist in … say a vim version less than 7.

Which meant that if I loaded the same .vimrc config on a system running a vim version earlier than 7, I’d get a warning at startup, which I’d have to press enter to pass by. Irritating.

As luck would have it, it isn’t all that difficult to make a little conditional to check which version is currently loading the config and just ignore the settings which won’t work for that version, such as:

if v:version >= 700
    set cul


Finally, at this years FSCONS I was introduced to the site where people can go to either contribute CPU-cycles, or get CPU-cycles, (or both) to help speed up rendering.



Sunday, October 23rd, 2011


Progress! This week I wrote my first perl script, to parse some data on one of my colleagues nodes. In doing so I also, inadvertedly, made another one of my colleagues express something along the lines of <q>”very nice, now we have another scripting guy on our team.”</q> ;D

grep count occurrences on single line

Say you have a line (or multiple, that you are iterating through one at a time) of data structured in some way representable and matchable by a regular expression, and that you feel an overwhelming need to count the number of occurences in each line.

Did you ever imagine that grep and a couple of pipes were all you’d ever need to realize this wish?

$ echo "foo foo foo" | grep -o 'foo' | grep -c 'foo'

Important Dates Notifier

Saturday, January 29th, 2011

I have never been especially good at remembering dates or appointments. My memory just doesn’t seem constructed to handle things like “on YYYY-MM-DD do X” (peculiarly enough, “on Friday next week, have X done” works much better).

I guess things relative to “now” works better for me (at least in the short term) than some abstract date some time into the future.

Luckily enough, I don’t seem to be the only one suffering from having an “appointment-impaired memory” so others have created calendar applications and what not.

These work great, the one I presently use, remind is awesome. But sometimes it isn’t “intrusive” enough.

There are some appointments/dates that are important enough that I would like for remind to hunt me down an alley and smack me over the head with the notification, as opposed to presenting it to me when I query it for today’s appointments.

So I put together a little shell script which push notifications (the ones I feel are the most important) to me via jabber.

The solution involves cron, remind, grep and a nifty little program called sendxmpp.

This script I have placed on my “laptop-made-server” which coincidentally also happens to be the central mercurial server, through which I synchronize the repositories on my desktop and netbook.

Which means that if I just take care to clone the specific repository containing my remind files to some place local on the server, I could have a cronjob pull and update that repository and it would thus always (as long as I have pushed changes made from the desktop/netbook) have the most up to date files available.

If setting up a repository server seems to big of a hassle, one could of course (at least with remind) have a master ~/.reminders file, which then invokes an INCLUDE expression.

This makes remind look in the specified directory for other files, and in that directory have one file for each computer (along the lines of .rem) and have each individual machine scp (and overwrite) their individual file every now and then.

In any case, once the server have fresh and updated sources of data, all it need do is execute the script once every day (preferably early, this will work pretty well as the jabber-server I use caches all messages sent to me while I was offline, so once I log in again, I’ll get the notifications then).

As sendxmpp takes the message from STDIN, recipient-addresses as parameters, and parses a configuration file to divine what account should be used to send the message (I set up a “bot” account to send from, and then just authorized that bot in my primary jabber-account), I see no reason why someone couldn’t modify the script to instead (or in addition) use something like msmtp to send out and email instead.

The script itself, in its current form, is rather straightforward, although I’m sure there are still room for optimizations.


for t in `echo "$TAGS"`;
    rem | grep -i "$t" | while read line;
        echo "$line" | sendxmpp
exit 0

Relevant parts of my crontab:

0   7   *   *   *       cd /home/patrik/remind-repo; /usr/bin/hg pull -u 2>&1
5   7   *   *   *       cd /home/patrik/idn-repo; /usr/bin/hg pull -u 2>&1
10  7   *   *   *       /bin/bash /home/patrik/bin/ 2>&1

In /home/patrik I have created symlinks /home/patrik/.remind -> /home/patrik/remind-repo/.remind and /home/patrik/.reminders -> /home/patrik/remind-repo/.reminders

And in /home/patrik/bin/ I have a symlink ( to /home/patrik/idn-repo/ So in case I change the script, like add a tag to look for or something (ok, that should be moved out to a configuration file, that will be part of the next version) that will be picked up as well, before the notifications goes out.

And that’s about it. Risk of forgetting something important: mitigated.


Awk hackery

Sunday, May 30th, 2010

I’ve always leaned more towards sed than awk, since I’ve always gotten the impression that they have more or less the same capabilities, just with different syntaxes.

But the more command line parsing I do, the more I’ve begun to realize that there are certain things I find easier to do in awk, while some things are easier with sed.

One of these things, that I find awk a better tool for, is getting specific columns of data from structured files (most often, but not limited to, logs).

I have for some time known about

cat somefile | awk '{ print $1 }'

Which will output the first column of every line from somefile. A couple of weeks ago, I needed to fetch two columns from a file (I can’t remember now what the file or task was, I’ll substitute with a poor example instead)

ls -l | awk '{ print $1, $8 }'

This will give you the permissions and names of all directories and files in pwd. One could of course switch places of $1 and $8 (i.e. print $8, $1) to get names first and then the permissions.

Recently I found myself needing to find all the commands executed from a crontab (part of a script, to create another script which was to verify that a migration had gone right, by allowing me to execute those commands whenever I wanted, and not just whenever the crontab specified)

Luckily for me, and this blogpost ;), none of those commands were executed with parameters, and since I am too lazy to actually count how many fields there are in a crontab file, I got to use:

crontab -l | grep '^*0-9' | awk '{ print $(NF) }'

Which lists the content of the present users crontab, finds all lines which either begin with a number or an asterisk, and then prints the last column of that line. Magic!

Netbooks, bash-scripting and rmmod

Saturday, September 12th, 2009

I recently bought a netbook (Acer Aspire One A531H) which I promptly installed Ubuntu Netbook Remix on. This has worked very well so far, and except for an early problem with wlan (which was fixed after a couple of minutes worth of searching and reading) the only real problem I have had with this little guy is something I experience with all laptops.

The Problem

The sensitive touch-pad of doom. Perhaps I am doing something wrong, I don’t know, but the touch-pads ALWAYS gives me trouble (mostly by “conveniently” moving the cursor to another part of the text while I am writing something).

I tried a workaround, using syndaemon with the flag -d, to have the touch-pad temporarily disabled while using the keyboard, moving that into a script and configure that script to run at startup. It is a nice idea, but it re-enables the touch-pad too quickly again, so while cutting the number of incidents in more than half, I still wasn’t satisfied.

On my regular laptop, which I always connect an external trackball to, I have permanently disabled the touch-pad (sudo rmmod psmouse at upstart) but permanently disabling it on the netbook wouldn’t work either, since for some tasks (like web-surfing, no I haven’t gotten around to learning the vimperator add-on just yet) are quite a lot easier with a mouse than without it.

The Solution

So what I really wanted was a convenient way of quickly enabling and disabling the touch-pad, when I needed to.

Reusing old knowledge about how to manually add shortcuts to metacity, all I had to do was to create a script which ascertained the status of the psmouse module (loaded or not) and upon that, either removed or loaded it.

To get the state of a module Foo, one can use lsmod | grep Foo, which in this case leads to lsmod | grep psmouse. This will either yield nothing (module not loaded) or a line (module loaded).

We can improve on this a bit, making sure we always get some kind of value returned, something like lsmod | grep Foo | wc -l. Since the last command in the chain now counts the number of lines that was returned from grep, we now either get 0 or 1 returned.

So there I was, thinking I am done, having entered the gconf-editor, pointed the script to command_2 (apps > metacity > keybinding_commands) and assigned a key-binding (<Control><Alt>t) to run_command_2 (apps > metacity > global_keybindings). Life was playing, all was well. Except for the fact that hitting that key combination did absolutely nothing to shut down the touch-pad.

Which was odd, since running the script worked. The individual commands to disable and enable the touch-pad (sudo rmmod psmouse and sudo modprobe psmouse, respectively) worked flawlessly… so why didn’t this work?

Then it hit me. Running either of those commands from the command-line, would result in it prompting me for my password, something a poor script without any ability to accept input from stdin can’t do. It couldn’t even tell me about it since there was no stdout for it to use either.

gksudo to the rescue. Since gksudo pushes up a graphical password prompt, the script could once more ask me for a password, and I could again supply it. And now it works nicely :D

In closing, the script:


if [ $(lsmod | grep psmouse | wc -l) -eq 0 ]
    gksudo modprobe psmouse
    gksudo rmmod psmouse
exit 0

Strange things you find out about your system half past six on a Thursday morning

Thursday, May 28th, 2009

Woke up somewhere around 0500 hours, heartburn… couldn’t go back to sleep so landed in front of the computer. Read an article (in Swedish) at about EU and the Telecoms-package nonsense. Apparently cookies are still unsafe… uh-huh.

There was a comment to that article about Local_Shared_Objects which caught my eye, and after having examined my ~/.macromedia-directory I could conclude that Flash stores its “cookies” there. To my surprise they took up quite some space, so I removed those domain-directories which lay inside the “random-id” directory.

For some reason, while Googling in order to ascertain whether it would be safe to remove the directories (I found nothing that indicated it would be safe, nor that it wouldn’t be safe), I found a post about an Ubuntu user who needed help cleaning up his “filled-to-the-brim” partition, and asking what he could remove.

Some responses told him to set his eye on /var/log among other places, and realizing that it was quite some time since I did that myself, I too headed for /var/log

And I started chopping away at the gzipped archive files there (to be honest, it was on fell “sudo rm *.gz” swoop, but who is counting?)

du -sh . indicated there was still some  309 Mb of “stuff” in /var/log (down from 312 Mb or something) so I was not impressed. What was taking up all that space?

Digging a little further I finally zeroed in on the guilty party. /var/log/acpid occupying 297 Mb of my harddrive. Running tail on that file a couple of times made me realize that it made entries into that log more than once every second…

So just to ensure that this wasn’t all just some stupid me poking around the system, spur of activity logging, I told grep to find all lines containing the string “May 27″ (which now in retrospect would match previous years May 27 as well, which means I could have been greping lines as far back as May 2007, this is a Feisty-box, although I am pretty sure that it took me a while after Feisty was released for me to give up Edgy, all in all, I don’t think I had Feisty installed by May 27th 2007, so two years worth of logs) and counted the lines of that output  grep ‘May 27′ .acpid | wc -l, which returned around 1.2 million hits.

I assume an equal distribution of entries per year, so 600.000 entries made yesterday. 600000 (log entries) / 86400 (seconds in a day) is almost 7 writes a second!

This was clearly not acceptable. I hit Google again, what would be the best way to kill all acpi logging? The launchpad bug report I found indicates that the bug is closed, having been fixed, which is good, once I upgrade when my harddrive goes to… whatever place harddrives go when they have served their time, this will not come back to haunt me.

But Feisty isn’t being bug fixed anymore, so how would I do it?

By adding the arguments “-l /dev/null” to whatever script that start the acpi daemon (acpid). I.e. /etc/init.d/acpid

Again, solutions offered in the forums seemed to target a different version (probably older) than Feisty, as I could not find a line containing $ACPID_BIN = /sbin/acpid

I did however find out that my version used start-stop-daemon to umm… start the daemon. Which takes the flags –exec [arg] and -c [args] (arg being the path to a daemon to start, and args being the arguments to pass to the daemon)

Very nice!

start-stop-daemon –start –quiet –exec /usr/sbin/acpid — -c /etc/acpi/events $OPTIONS


start-stop-daemon –start –quiet –exec /usr/sbin/acpid — -c /etc/acpi/events $OPTIONS -l /dev/null

I stopped and restarted the the acpid (since the restart sequence looked a little different and I didn’t want to muck with that, I know my own illiteracy and incompetence ;)), killed off the acpid log, and my /var/log is now down to 12 Mb in size all in all.

Reading further in the bug report it would seem that this little acpid “I’m gonna log the shit out of you” behaviour is, to some extent, connected to the laptop harddrive-killing bug. Thankfully my harddrive seem to have survived that bug quite well (probably due to my early hacking of /etc/hdparm.conf as per this page).

LaTeX, ligatures and grep

Thursday, April 16th, 2009

Having finally finished a long overdue paper, I thought I’d share a little knowledge, well, semi-knowledge/-ugly hack actually, that I have found useful while working on this paper.

I like justified text, I think it make the content look sharp. LaTeX seem to agree with me on that point, at least in the style I used (report). Justified text in LaTeX has one drawback however. Sometimes the letter spacing between certain letters become too small, resulting in what I surmise typographers call “broken ligatures”. The term “ligature” seem to simply  refer to a specific part of a letter. A broken ligature, then, would happen when the ligature in a preceding letter “floats into” the next one.

Justified text is sharp, justified text with broken ligatures… not so much. And LaTeX doesn’t seem to be fully able to handle this on its own, so manual intervention seem necessary. (It could of course just be that the version I use (texlive) is silly, but I recall having similar problems back in Uni while I used tetex)

In any case, ugly-hacking tiem!


First priority: find all occurrences of potential broken ligatures.

One could visually (using the ole trusty eyeball mk.1) scan the generated document for imperfections. That takes a lot of time and there is a large risk that some occurrences “slip through”. Also, in some places the ligatures won’t be broken, because the text has a good fit on the row at present time. But then someone adds a word, a sentence, or just fix a grammatical bug, whatever, and the fit is not so good anymore.

Of course, it is wholly unnecessary to run this procedure until the document is “frozen” and won’t accept any more addition to it in terms of text. I ran it three times, one time before each “beta”/”release candidate” which I sent to some friends for critique/proof-reading/sanity checking, and then once more after having incorporated the input from my friends.

To identify potential trouble, grep is called in to find every instance of the character combinations which can break. In my experience, these combinations are “ff”, “fi” and “fl”.

$ grep -rn f[fil] chapters/*.tex

Only lower-case letters seem to cause trouble, but that is an assumption I make. I could well see problems stemming from having an initial lower-case f, followed by an upper-case letter. I have never encountered this, so I don’t search for it, but as usual, ymmw.

Now I have a nifty little list with all occurrences of the letter sequences “ff”, “fi” and “fl”, nice! Now what?


The solution should, preferably, be applied to nearly all instances of these sequences, so that a present “good fit” line, if modified, would just automagically work later on as well. This means that the solution should not screw up the formatting of the “good fit” cases, while kicking into action, iff the good fit turn bad.

The solution I use is “\hbox{}”. This is inserted between the characters (f\hbox{}f, f\hbox{}i, f\hbox{}l) What makes this ugly is of course that your LaTeX code is now littered with this… well umm… shit. This method will of course give your spell checker a nervous breakdown.

Now you are probably thinking that this is a non-issue, just create a small shell-script to use sed, and produce new files with the modified content, copy these files into a build directory and have the make script invoke that shell-script before invoking the build command.

There is a potential pitfall in that solution. My paper linked to a couple of websites, as in clickable hyperlinks inside the pdf. Imagine the fun that would be derived when sed would hit upon \url{} and transform that into \url{http://www.openof\hbox{}f\hbox{}}.

Making sed aware of the \url{} tag, and verbatim quotes (probably all of the quoting systems), and making it leave the content inside well enough alone is probably doable, but having my favorite text-editor to an interactive search/replace was the method I opted for.


Monday, August 25th, 2008

Grep is one of those tools that every GNU/Linux user should have at least a rudimentary understanding of. You will get by without it of course, but it can speed up things quite a bit.

Just today a friend and former classmate had a problem: In a large C++ code base, find the one file printing a specific error message. Opening every file and manually checking them: not feasible and surely not cost-effective.

He asked me for any insight in searching, and from the ole’ toolbox I brought grep. Now I will readily admit, I am no superuser, or guru or anything of the sort. My grep skills are not what they probably should be, so my first attempts was rather unsuccessful.

Framing the problem even more, the .cpp files where spread over a number of directories, and in the project root directory there where no files, only directories.

Since I mostly program in Python those where the files I had available to test my grep commands on:

$ grep 'import' *.py

I was greeted with an error, *.py no such file or directory. But the syntax was right, right? Went into a sub directory containing python files, ran the same command again, and was rewarded with a list of files.

Ok, so the problem wasn’t the syntax, it was targeting. What about

$ grep -R 'import' *.py

Again with the error message… ok, quick check in the man-page, yes, -R -r or –recursive all works, great, next try:

$ grep -r 'import' ./*.py

That error message is getting tedious… what about

$ grep -r 'import' ./

Now we are rolling, but it is chewing on things I have no interest in listing… like Vim’s .swp files etc. How do we fix that? Enter the man-page again, aha –include

$ grep -r --include '*.py' 'import' ./

Very nice, recursive search throughout all sub directories for files ending with .py containing the string ‘import’. Now to help him out a little more, let’s add -n also, so that he will see on what line the error message is printed.

$ grep -r -n --include '*.py' 'import' ./

And there you have it. Just one of the various uses of grep.