Pipelines in the Shell

One of the most powerful things of the *nix shell, and one which is currently not even covered in the tutorial, is the pipeline. I try to keep this blog and the tutorial from overlapping, but I really must rectify this gap in the main site some time.

In the meantime, this is what it is all about.

UNIX (and therefore GNU/Linux) is full of small text-based utilities : wc to count words (and lines, and characters) in a text file; sort to sort a text file; uniq to get only the unique lines from a text file; grep to get certain lines (but not others) from a text file, and so on.

Did you see the common trait there? Yes, it’s not just that “Everything is a file”, nearly everything is also text. It’s largely from this tradition that HTML, XML, RSS, Email (SMTP, POP, IMAP) and the like are all text-based. Contrast with MS Office, for example, where all data is in binary files which can only (really) be manipulated by the application which created them.

So what? It’s crude, simple, works on an old-fashioned green-and-black screen. How could that be relevant in the 21st Century?

It’s relevant, not because of what each tool itself provides, but what they can do when combined.

Ugly Apache Access Log

Here’s a line from my access.log (Apache2) – Yes, honest, it’s one line. It just looks very ugly:
12.106.111.10 - - [02/May/2007:12:09:58 -0700] "GET /sh/sh.shtml HTTP/1.1" 200 33080 "http://www.google.ca/search?hl=en&q=bourne+shell&btnG=Google+Search&meta=" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3"
That’s interesting (well, kind of). But it doesn’t tell us much about this visitor. They came from Google Canada (google.ca), searched for “bourne shell”, and clicked on the link to give them http://steve-parker.org/sh/sh.shtml. They’re using FireFox 1.5.0.3 on Windows.

Great. What happened to them? Did they like the site? Did they stay? Did they actually read anything?

Filtering it

We can use grep to filter out just this visitor. However, this gives us lots of stuff we’re not interested in – all the CSS, PNG, GIF files which support the pages themselves:
$ grep "^12.106.111.10 " access_log 12.106.111.10 - - [02/May/2007:12:09:58 -0700] "GET /sh/sh.shtml HTTP/1.1" 200 33080 "http://www.google.ca/search?hl=en&q=bourne+shell&btnG=Google+Search&meta=" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3" 12.106.111.10 - - [02/May/2007:12:09:58 -0700] "GET /steve-parker.css HTTP/1.1" 200 8757 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3" 12.106.111.10 - - [02/May/2007:12:09:59 -0700] "GET /images/1.png HTTP/1.1" 200 471 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3" 12.106.111.10 - - [02/May/2007:12:09:59 -0700] "GET /images/prevd.png HTTP/1.1" 200 1397 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3" 12.106.111.10 - - [02/May/2007:12:10:00 -0700] "GET /images/2.png HTTP/1.1" 200 648 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3" ... etc ...

This is looking ugly already, and not just because of the small width of the webpage – even at 80 characters wide, this is hard to understand.

Filtering Some More

At first glance, I should be able to pipe this through a grep for html files:

$ grep "^12.106.111.10 " access_log | grep shtml

However, Apache also logs the referrer, so even the “GET /images/2.png” request above, includes a .shtml request. So I can use “grep -v“. I’ll add the “-w” option to grep to say that the search term must match a whole word. So – “grep -v gif” would let “gifford.html” through, whereas “grep -vw gif” would not. I’ll add “\” to the code, so that you can cut and paste it… the “\” means that although the line breaks, it’s still really part of the same line of code:
$ grep "^12.106.111.10 " access_log | grep -vw css \ | grep -vw gif | grep -vw jpg \ | grep -vw png | grep -vw ico 12.106.111.10 - - [02/May/2007:12:09:58 -0700] "GET /sh/sh.shtml HTTP/1.1" 200 33080 "http://www.google.ca/search?hl=en&q=bourne+shell&btnG=Google+Search&meta=" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3" 12.106.111.10 - - [02/May/2007:12:10:32 -0700] "GET /sh/variables1.shtml HTTP/1.1" 200 48431 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3" 12.106.111.10 - - [02/May/2007:12:13:23 -0700] "GET /sh/external.shtml HTTP/1.1" 200 41322 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3" 12.106.111.10 - - [02/May/2007:12:13:45 -0700] "GET /sh/quickref.shtml HTTP/1.1" 200 42454 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3" 12.106.111.10 - - [02/May/2007:12:14:27 -0700] "GET /sh/test.shtml HTTP/1.1" 200 48844 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3"
This pumps the output through a grep which gets rid of CSS files (“| grep -vw css“), then gif, then jpg, then png, then ico. That should just leave us with the HTML files, which is what we’re really interested in.

It’s really hard to see what’s going on here. The narrow web page doesn’t help, but we just want to get the key information out, which should be nice and simple. It should look good however we view it.

If you look carefully, you can see that this visitor accessed /sh/sh.shtml, then /sh/variables1.shtml (after a minute), /sh/external.shtml (after 3 minutes), then /sh/quickref.shtml (about 20 seconds later, presumably – given the referrer is the same, in a new tab). A second later, they opened /sh/test.shtml (which also suggests that they’ve loaded the two pages in tabs, to read at their leisure).

Getting the Data we need

However, none of this is really very easy to read. If we just want to know what pages they visited, and when, we need to do some more filtering. awk is a very powerful tool, of whose abilities we will only scratch the surface here. We will get “fields” 4 and 7 – the timestamp and the URL accessed.
$ grep "^12.106.111.10 " access_log \ | grep -vw css | grep -vw gif \ | grep -vw jpg | grep -vw png \ | grep -vw ico | awk '{ print $4,$7 }' [02/May/2007:12:09:58 /sh/sh.shtml [02/May/2007:12:10:32 /sh/variables1.shtml [02/May/2007:12:13:23 /sh/external.shtml [02/May/2007:12:13:45 /sh/quickref.shtml [02/May/2007:12:14:27 /sh/test.shtml
Okay, it’s the info we wanted, but it’s still not great. That “[” looks out of place now. We can use cut to tidy things up. In this case, we’ll use its positional spacing, because we want to get rid of the first character. Cut’s “-c” paramater tells it what character to cut from. We want the 2nd character onwards, so we just add it to the end of the pipe line:
$ grep "^12.106.111.10 " access_log | grep -vw css | grep -vw gif | grep -vw jpg | grep -vw png | grep -vw ico | awk '{ print $4,$7 }'|cut -c2- 02/May/2007:12:09:58 /sh/sh.shtml 02/May/2007:12:10:32 /sh/variables1.shtml 02/May/2007:12:13:23 /sh/external.shtml 02/May/2007:12:13:45 /sh/quickref.shtml 02/May/2007:12:14:27 /sh/test.shtml

And that’s the kind of thing that we can do with a pipe. We can get exactly what we want from a file.

Moving on

At the start of this post, we mentioned sort and uniq. Actually, “sort -u” will do the same as “sort | uniq“. So if we want to get the unique visitors, we can just get the first field (“cut -d" " -f1“) and sort it uniquely:
$ cut -d" " -f1 access_log | sort -u

This entry was posted on Thursday, May 3rd, 2007 at 12:13 am and is filed under bash, files, grep, regexp, shell, unix. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

*nix Shell