Happy First Birthday!

January 6, 2008

This blog has now been running for a year; the first post was Hello World on 17th Jan 2007.

I hadn’t realised it had been going for so long; in that time, I’ve made 41 posts, so I haven’t quite managed to make one post per week 😦 I have been a bit slack lately, for which I do apologise. New Years Resolution: I must make more posts here!

In the meantime, my main site, steve-parker.org, has celebrated its seventh birthday, having been born in June 2000 – looking forward to making the 8th birthday celebrations this June!

IFS – Internal Field Separator

September 26, 2007

It seems like an esoteric concept, but it’s actually very useful.

If your input file is “1 apple steve@example.com”, then your script could say:

while read qty product customer
  echo "${customer} wants ${qty} ${product}(s)"

The read command will read in the three variables, because they’re spaced out from each other.

However, critical data is often presented in spreadsheet format. If you save these as CSV files, it will come out like this:


This contains no spaces, and the above code will not be able to understand it. It will take the whole thing as one item – the first thing, quanity, $qty, and set the other two fields as blank.

The way around this, is to tell the entire shell, that “,” (the comma itself) separates fields; it’s the “internal field separator”, or IFS.

The IFS variable is set to space/tab/newline, which isn’t easy to set in the shell, so it’s best to save the original IFS to another variable, so you can put it back again after you’ve messed around with it. I tend to use “oIFS=$IFS” to save the current value into “oIFS”.

Also, when the IFS variable is set to something other than the default, it can really mess with other code.

Here’s a script I wrote today to parse a CSV file:

oIFS=$IFS     # Always keep the original IFS!
IFS=","          # Now set it to what we want the "read" loop to use
while read qty product customer
  # process the information
  IFS=","       # Put it back to the comma, for the loop to go around again
done < myfile.txt

It really is that easy, and it’s very versatile. You do have to be careful to keep a copy of the original (I always use the name oIFS, but whatever suits you), and to put it back as soon as possible, because so many things invisibly use the IFS – grep, cut, you name it. It’s surprising how many things within the “while read” loop actually did depend on the IFS being the default value.

Pipes Primer

May 8, 2007

The previous post dealt with pipes, though the example may not have been the best for those who are not accustomed to the concept.

There are a few concepts to be understood – mainly, that of two (or more) processes operating together, how they put their data out, and how the get their data in. UNIX deals with multiple processes, all running (conceptually, at least) at the same time, on different CPUs, each with a standard input (stdin), and standard output (stdout). Pipes connect one process’s stdout to another’s stdin.

What do we want to pipe? Let’s say we’ve got a small 80×25 terminal screen, and lots of files. The ls command will spew out tons of data, faster than we can read it. There’s a handy utility called “more“, which will show a screen-worth of text, then prompt “more”. When you hit the space bar, it will scroll down a screen. You can hit ENTER to scroll one line.

I’m sure that you’ve worked this out already, but here is how we combine these two commands:

$ ls | more
<the first screenful of files is shown>

What happens here, is that the “more” command is started up first, then the “ls” command. The output of “ls” is piped to the input of “more”, so it can read the data.

Most such tools can also work another way, too:

$ more myfile.txt
<the first screenful of "myfile.txt" is shown>

That is to say, “myfile.txt” is taken as standard input (stdin).

Pipelines in the Shell

May 3, 2007

One of the most powerful things of the *nix shell, and one which is currently not even covered in the tutorial, is the pipeline. I try to keep this blog and the tutorial from overlapping, but I really must rectify this gap in the main site some time.

In the meantime, this is what it is all about.

UNIX (and therefore GNU/Linux) is full of small text-based utilities : wc to count words (and lines, and characters) in a text file; sort to sort a text file; uniq to get only the unique lines from a text file; grep to get certain lines (but not others) from a text file, and so on.

Did you see the common trait there? Yes, it’s not just that “Everything is a file”, nearly everything is also text. It’s largely from this tradition that HTML, XML, RSS, Email (SMTP, POP, IMAP) and the like are all text-based. Contrast with MS Office, for example, where all data is in binary files which can only (really) be manipulated by the application which created them.

So what? It’s crude, simple, works on an old-fashioned green-and-black screen. How could that be relevant in the 21st Century?

It’s relevant, not because of what each tool itself provides, but what they can do when combined.

Ugly Apache Access Log

Here’s a line from my access.log (Apache2) – Yes, honest, it’s one line. It just looks very ugly: - - [02/May/2007:12:09:58 -0700] "GET /sh/sh.shtml HTTP/1.1" 200 33080 "http://www.google.ca/search?hl=en&q=bourne+shell&btnG=Google+Search&meta=" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060426 Firefox/"
That’s interesting (well, kind of). But it doesn’t tell us much about this visitor. They came from Google Canada (google.ca), searched for “bourne shell”, and clicked on the link to give them http://steve-parker.org/sh/sh.shtml. They’re using FireFox on Windows.

Great. What happened to them? Did they like the site? Did they stay? Did they actually read anything?

Filtering it

We can use grep to filter out just this visitor. However, this gives us lots of stuff we’re not interested in – all the CSS, PNG, GIF files which support the pages themselves:
$ grep "^ " access_log - - [02/May/2007:12:09:58 -0700] "GET /sh/sh.shtml HTTP/1.1" 200 33080 "http://www.google.ca/search?hl=en&q=bourne+shell&btnG=Google+Search&meta=" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060426 Firefox/" - - [02/May/2007:12:09:58 -0700] "GET /steve-parker.css HTTP/1.1" 200 8757 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060426 Firefox/" - - [02/May/2007:12:09:59 -0700] "GET /images/1.png HTTP/1.1" 200 471 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060426 Firefox/" - - [02/May/2007:12:09:59 -0700] "GET /images/prevd.png HTTP/1.1" 200 1397 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060426 Firefox/" - - [02/May/2007:12:10:00 -0700] "GET /images/2.png HTTP/1.1" 200 648 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060426 Firefox/"
... etc ...

This is looking ugly already, and not just because of the small width of the webpage – even at 80 characters wide, this is hard to understand.

Filtering Some More

At first glance, I should be able to pipe this through a grep for html files:

$ grep "^ " access_log | grep shtml

However, Apache also logs the referrer, so even the “GET /images/2.png” request above, includes a .shtml request. So I can use “grep -v“. I’ll add the “-w” option to grep to say that the search term must match a whole word. So – “grep -v gif” would let “gifford.html” through, whereas “grep -vw gif” would not. I’ll add “\” to the code, so that you can cut and paste it… the “\” means that although the line breaks, it’s still really part of the same line of code:
$ grep "^ " access_log | grep -vw css \
| grep -vw gif | grep -vw jpg \
| grep -vw png | grep -vw ico - - [02/May/2007:12:09:58 -0700] "GET /sh/sh.shtml HTTP/1.1" 200 33080 "http://www.google.ca/search?hl=en&q=bourne+shell&btnG=Google+Search&meta=" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060426 Firefox/" - - [02/May/2007:12:10:32 -0700] "GET /sh/variables1.shtml HTTP/1.1" 200 48431 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060426 Firefox/" - - [02/May/2007:12:13:23 -0700] "GET /sh/external.shtml HTTP/1.1" 200 41322 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060426 Firefox/" - - [02/May/2007:12:13:45 -0700] "GET /sh/quickref.shtml HTTP/1.1" 200 42454 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060426 Firefox/" - - [02/May/2007:12:14:27 -0700] "GET /sh/test.shtml HTTP/1.1" 200 48844 "http://steve-parker.org/sh/sh.shtml" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060426 Firefox/"

This pumps the output through a grep which gets rid of CSS files (“| grep -vw css“), then gif, then jpg, then png, then ico. That should just leave us with the HTML files, which is what we’re really interested in.

It’s really hard to see what’s going on here. The narrow web page doesn’t help, but we just want to get the key information out, which should be nice and simple. It should look good however we view it.

If you look carefully, you can see that this visitor accessed /sh/sh.shtml, then /sh/variables1.shtml (after a minute), /sh/external.shtml (after 3 minutes), then /sh/quickref.shtml (about 20 seconds later, presumably – given the referrer is the same, in a new tab). A second later, they opened /sh/test.shtml (which also suggests that they’ve loaded the two pages in tabs, to read at their leisure).

Getting the Data we need

However, none of this is really very easy to read. If we just want to know what pages they visited, and when, we need to do some more filtering. awk is a very powerful tool, of whose abilities we will only scratch the surface here. We will get “fields” 4 and 7 – the timestamp and the URL accessed.
$ grep "^ " access_log \
| grep -vw css | grep -vw gif \
| grep -vw jpg | grep -vw png \
| grep -vw ico | awk '{ print $4,$7 }'
[02/May/2007:12:09:58 /sh/sh.shtml
[02/May/2007:12:10:32 /sh/variables1.shtml
[02/May/2007:12:13:23 /sh/external.shtml
[02/May/2007:12:13:45 /sh/quickref.shtml
[02/May/2007:12:14:27 /sh/test.shtml

Okay, it’s the info we wanted, but it’s still not great. That “[” looks out of place now. We can use cut to tidy things up. In this case, we’ll use its positional spacing, because we want to get rid of the first character. Cut’s “-c” paramater tells it what character to cut from. We want the 2nd character onwards, so we just add it to the end of the pipe line:
$ grep "^ " access_log | grep -vw css | grep -vw gif | grep -vw jpg | grep -vw png | grep -vw ico | awk '{ print $4,$7 }'|cut -c2-
02/May/2007:12:09:58 /sh/sh.shtml
02/May/2007:12:10:32 /sh/variables1.shtml
02/May/2007:12:13:23 /sh/external.shtml
02/May/2007:12:13:45 /sh/quickref.shtml
02/May/2007:12:14:27 /sh/test.shtml

And that’s the kind of thing that we can do with a pipe. We can get exactly what we want from a file.

Moving on

At the start of this post, we mentioned sort and uniq. Actually, “sort -u” will do the same as “sort | uniq“. So if we want to get the unique visitors, we can just get the first field (“cut -d" " -f1“) and sort it uniquely:
$ cut -d" " -f1 access_log | sort -u

Regular Expressions

April 18, 2007

http://etext.lib.virginia.edu/services/helpsheets/unix/regex.html has a good introduction to Regular Expressions – grep, sed, and friends.

It includes a brief discussion on Backreferences (aka “the stuff that * matched”)

Harnessing the flexibility of Regular Expressions with Grep

February 14, 2007

There are two sides to grep – like any command, there’s the learning of syntax, the beginning
of which I covered in the grep tool tip. I’ll
come back to the syntax later, because there is a lot of it.

However, the more powerful side is grep‘s use of regular expressions. Again, there’s not room
here to provide a complete rundown, but it should be enough to cover 90% of usage. Once I’ve got a library
of grep-related stuff, I’ll post an entry with links to them all, with some covering text.

This or That

Without being totally case-insensitive (which -i) does,
we can search for “Hello” or “hello” by specifying the optional
characters in square brackets:

$ grep [Hh]ello *.txt
test1.txt:Hello. This is  test file.
test3.txt:Why, hello there!

If we’re not bothered what the third letter is, then we can say “grep [Hh]e.o *.txt“, because the dot (“.”) will match any single character.

If we don’t care what the third and fourth letters are, so long as it’s “he..o”, then we say exactly that: “grep he..o” will match “hello”, hecko”, heolo”, so long as it is “he” + 1 character + “lo”.

If we want to find anything like that, other than “hello”, we can do that, too:

$ grep he[^l]lo *.txt

Notice how it doesn’t pick up any of the “Hello” variations which have a “llo” in them?

How many?!

We can specify how many times a character can repeat, too. We have to put the expression we’re talking about in [square brackets]:

  • “?” means “it might be there”
  • “+” means “it’s there, but there might be loads of them”
  • “*” means “lots (or none) might be there”

So, we can match “he”, followed by as many “l”s as you like (even none), followed by an “o” with “grep he[l]*o *.txt“:

$ grep he[l]*o *.txt
test3.txt:Why, hello there!

Tool Tip: “grep”

January 17, 2007

A powerful and useful tool in the shell scripter’s arsenal is grep. If you’ve not come across it before, it’s similar to the “find” tool that DOS had; it finds strings in files. Grep stands for “get regular expression”; a “regular expression” is a string, or something more than just a string.

$ grep foo myfile.txt
and Steve said, "foo! that's crazy"

That searches for “foo” in the file called “myfile.txt”. It gets any line (yes, the whole line) which contains the search text.

But you can do other stuff, with “switches”. For example “-i” means “insensitive to case”:
$ grep -i foo myfile.txt
"Foo" is a word, associated with "Bar".
and Steve said, "foo! that's crazy"

This time, grep finds that the word “foo” is actually mentioned twice in “myfile.txt”; once as “Foo” and once as “foo”.

The “-i” flag is a pretty common one, then, because it’s often what we really want it to find.

Here’s a good one, though: Under Linux, a special file /proc/bus/usb/devices lists your USB devices. That’s good, but yuck, it’s a mess of (too much) detailed information:

T:  Bus=01 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=12  MxCh= 2
B:  Alloc=  0/900 us ( 0%), #Int=  0, #Iso=  0
D:  Ver= 1.10 Cls=09(hub  ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=0000 ProdID=0000 Rev= 2.06
S:  Manufacturer=Linux 2.6.15-27-server uhci_hcd
S:  Product=UHCI Host Controller
S:  SerialNumber=0000:00:07.2
C:* #Ifs= 1 Cfg#= 1 Atr=c0 MxPwr=  0mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   2 Ivl=255ms

T:  Bus=01 Lev=01 Prnt=01 Port=01 Cnt=01 Dev#=  2 Spd=12  MxCh= 0
D:  Ver= 1.10 Cls=ff(vend.) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=06b9 ProdID=4061 Rev= 0.00
S:  Manufacturer=ALCATEL
S:  Product=Speed Touch USB
S:  SerialNumber=0090D00D0B25
C:* #Ifs= 3 Cfg#= 1 Atr=80 MxPwr=500mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=ff(vend.) Sub=00 Prot=00 Driver=usbfs

How do I just get what I need from the file? One switch to grep, which I don’t use as much as I should, is “-A”, for “After”. (Note that it’s a capital “A”).

After the Vendor ID and Product ID, /proc/bus/usb/devices includes the name of the device, so I can find out what I’ve got installed with a Vendor ID of 06b9 quite easily:

$ grep -A 2 06b9 /proc/bus/usb/devices
P: Vendor=06b9 ProdID=4061 Rev= 0.00
S: Manufacturer=ALCATEL
S: Product=Speed Touch USB

Or what have I got from Alcatel?
$ grep -i -A1 Alcatel /proc/bus/usb/devices
S: Manufacturer=ALCATEL
S: Product=Speed Touch USB

I can also ask: Who made my Speed Touch modem, or what’s its ID? “-B” displays lines before the line that matches:

$ grep -B 2 Speed /proc/bus/usb/devices
P: Vendor=06b9 ProdID=4061 Rev= 0.00
S: Manufacturer=ALCATEL
S: Product=Speed Touch USB

There’s a lot you can do with grep; I’ve only really covered the first line from “man grep”