AWK hackery++

While I cannot take credit for the solution which follows (the real star is Pontus) what I can do is document it for posterity:

The scenario is as follows: You have a couple of log files, foo.X.log for X in [1,2,..,N-1,N] and you want to watch all of these for a certain pattern, P.

And all of these matching results you want to write to a condensed log, in the format <originatingLogFile>: <lineThatMatchedPattern>

You have figured out that tail -f foo.*.log will yield you all the lines, and separate output from the various files using ==> foo.X.log <==

You might have tried with grep, to get the correct lines, and perhaps even got it to output to the file using --line-buffered but you are not out of the woods, that pesky filename at the beginning of the line is still nowhere to be seen.

grep? where we’re going, we don’t need grep

Every time that tail outputs a line from one of the other files than the last one indicated by ==> ... <== it outputs that pattern again on the line before the output.

E.g.:

$ tail -f foo.*.log
==> foo.1.log <==

==> foo.2.log <==

==> foo.3.log <==

Now, from another terminal, while the tail is still following all three of those logs, let us insert some content:

$ echo "PATTERN, some more stuff" >> foo.2.log

Which results in the first terminal now displaying:

$ tail -f foo.*.log
==> foo.1.log <==

==> foo.2.log <== 

==> foo.3.log <==

==> foo.2.log <==
PATTERN, some more stuff

Ok, so whenever new content is pushed onto any one of the tailed files, we will get notified of it, along with what file it was pushed to (note though, that if we were to push more content into the same file as was previously pushed to (foo.2.log) it would NOT generate another ==> foo.2.log <== line, it would simply append to the ongoing tail:

$ echo "PATTERN, even more stuff" >> foo.2.log

$ tail -f foo.*.log
==> foo.1.log <==

==> foo.2.log <==

==> foo.3.log <==

==> foo.2.log <==
PATTERN, some more stuff
PATTERN, even more stuff

This doesn’t matter much, our solution will adapt to this, but it is a gotcha worth noting (although it should be instantly obvious to anyone reproducing this on their own anyway)

Anyway, with that out of the way, lets get down to business: AWK hackery!

First order of business: We have a pattern we wish to match. This may or may not change at a later data, and sure, we could practice YAGNI (You Ain’t Gonna Need It) but let us prepare for it anyway by assigning the pattern to be matched in the invocation of AWK (this will assume that we have stored the pattern to be matched inside the variable “$_P”):

awk -v pattern="$_P"

Since AWK operates on lines (that is a simplification since one could redefine what constitutes a line) and it won’t remember whatever was in the line if processed in the previous iteration (previous line) we need to store the latest file(name) into which a line was added.

To achieve this we need to detect whenever new content has been added to a new file (i.e. whenever a new line consisting of ==> foo.*.log <== appears.

BEGIN { filename = "" }
/==> .* <==/ { filename = $2; next }

Since BEGIN triggers before AWK starts any processing, it is the perfect place to define our placeholder variable filename in which we are going to temporarily store the filename of the file into which data is presently being entered into.

The next line matches any line containing ==> followed by anything really, followed by <==. Upon encountering one of those lines, we store the stuff between the arrows in the filename variable.

$0 signifies the entire line, and everything inside that line, separated by newlines (start and end of line) or whitespace, is assigned a numbered variable $1 .. $x for x in number of whitespace-separated words in line. (Again, we could reconfigure AWK to not split things on whitespace, but in this case there would be no sense in doing that.)

And just to ensure that we won’t ever match the ==> .* <== line with whatever we put in $_P we tell AWK that once we have stored the filename, we should move on to the next line.

Next on the agenda is to match the pattern:

$0 ~ pattern { printf "%s: %s\n", filename, $0 }

Essentially this tells AWK to look inside $0 (the whole (current) line) for the stuff we have stored in the variable pattern (i.e. $_P) and if there is a match, print it out, first the value stored in the filename variable, then a colon and a space, followed by the entire line that matched ($0) and finally a newline.

Rinse, repeat, until we abort the command with C^c.

Now, this is not the end of the story though. We wanted to write this to a file. But AWK does not write to file until it has processed all of the input, and this will not happen since tail is following and will not end until we kill it (and in doing that we lose it all before it has a chance to be written to file). pesa points out that I got this wrong, go downwards and read his comment for the real deal

Doh!

What we need is a way to tell AWK to write directly to file directly after it has found a match and printed. fflush() is the function to make this happen for us:

$0 ~ pattern { printf "%s: %s\n", filename, $0; fflush() }

Now all we have to do is wrap the whole thing up, redirect to the filename of the condensed log we want to establish:

tail -f foo.*.log | awk -v pattern="$_P" '
    BEGIN { filename = "" }
    /==> .* <==/ { filename = $2; next }
    $0 ~ pattern { printf "%s: %s\n", filename, $0; fflush() }
' > condensedLog.log

There is one big gotcha here though:

If you are starting a tail on log files which are already populated, tail will first output everything from the first log (which is the first file to be read I suppose depends on some system settings (locales comes to mind)) and then move on to the next file, etc. So regardless of the chronological appearence of matching lines in the different files, the first files matching lines will always be presented first.

All matching lines entered AFTER this one-liner has started running, will appear in the correct chronological order.

This simply means that you will have to be prepared for this, and possibly do processing on the condensed log file as well (hopefully the logs you are tailing all have timestamps in them so that sort has something easy to operate on once the lines are entered into the condensed log)

:wq

Tags: , ,

2 Responses to “AWK hackery++”

  1. pesa says:

    Nice post :)
    However, I have a comment on one statement regarding awk and fflush:

    “But AWK does not write to file until it has processed all of the input, and this will not happen since tail is following and will not end until we kill it (and in doing that we lose it all before it has a chance to be written to file).”

    This is not entirely correct, I’m afraid. When you redirect the output to a file it will just buffer its output for efficiency purposes. When the buffer is full, it will write the whole buffer to the file you redirect to.
    If the process gets killed (not -9) it should shutdown gracefully by flushing all buffers before exiting.

  2. Patrik says:

    Thanks! I enjoyed writing it :)

    Heh, it all seemed so clear that night, there and then in the moment.
    Just goes to show; the moment you think you’ve got it figured; you’re wrong ;)

    Thanks for catching this my friend :)

    Cheers