AWK

AWK is a standard tool on every POSIX-compliant UNIX system. It’s like flex/lex, from the command-line, perfect for text-processing tasks and other scripting needs. It has a C-like syntax, but without mandatory semicolons (although, you should use them anyway, because they are required when you’re writing one-liners, something AWK excels at), manual memory management, or static typing. It excels at text processing. You can call to it from a shell script, or you can use it as a stand-alone scripting language.
Why use AWK instead of Perl? Readability. AWK is easier to read than Perl. For simple text-processing scripts, particularly ones that read files line by line and split on delimiters, AWK is probably the right tool for the job.
  1#!/usr/bin/awk -f
  2
  3# Comments are like this
  4
  5
  6# AWK programs consist of a collection of patterns and actions.
  7pattern1 { action; } # just like lex
  8pattern2 { action; }
  9
 10# There is an implied loop and AWK automatically reads and parses each
 11# record of each file supplied. Each record is split by the FS delimiter,
 12# which defaults to white-space (multiple spaces,tabs count as one)
 13# You can assign FS either on the command line (-F C) or in your BEGIN
 14# pattern
 15
 16# One of the special patterns is BEGIN. The BEGIN pattern is true
 17# BEFORE any of the files are read. The END pattern is true after
 18# an End-of-file from the last file (or standard-in if no files specified)
 19# There is also an output field separator (OFS) that you can assign, which
 20# defaults to a single space
 21
 22BEGIN {
 23
 24    # BEGIN will run at the beginning of the program. It's where you put all
 25    # the preliminary set-up code, before you process any text files. If you
 26    # have no text files, then think of BEGIN as the main entry point.
 27
 28    # Variables are global. Just set them or use them, no need to declare.
 29    count = 0;
 30
 31    # Operators just like in C and friends
 32    a = count + 1;
 33    b = count - 1;
 34    c = count * 1;
 35    d = count / 1; # integer division
 36    e = count % 1; # modulus
 37    f = count ^ 1; # exponentiation
 38
 39    a += 1;
 40    b -= 1;
 41    c *= 1;
 42    d /= 1;
 43    e %= 1;
 44    f ^= 1;
 45
 46    # Incrementing and decrementing by one
 47    a++;
 48    b--;
 49
 50    # As a prefix operator, it returns the incremented value
 51    ++a;
 52    --b;
 53
 54    # Notice, also, no punctuation such as semicolons to terminate statements
 55
 56    # Control statements
 57    if (count == 0)
 58        print "Starting with count of 0";
 59    else
 60        print "Huh?";
 61
 62    # Or you could use the ternary operator
 63    print (count == 0) ? "Starting with count of 0" : "Huh?";
 64
 65    # Blocks consisting of multiple lines use braces
 66    while (a < 10) {
 67        print "String concatenation is done" " with a series" " of"
 68            " space-separated strings";
 69        print a;
 70
 71        a++;
 72    }
 73
 74    for (i = 0; i < 10; i++)
 75        print "Good ol' for loop";
 76
 77    # As for comparisons, they're the standards:
 78    # a < b   # Less than
 79    # a <= b  # Less than or equal
 80    # a != b  # Not equal
 81    # a == b  # Equal
 82    # a > b   # Greater than
 83    # a >= b  # Greater than or equal
 84
 85    # Logical operators as well
 86    # a && b  # AND
 87    # a || b  # OR
 88
 89    # In addition, there's the super useful regular expression match
 90    if ("foo" ~ "^fo+$")
 91        print "Fooey!";
 92    if ("boo" !~ "^fo+$")
 93        print "Boo!";
 94
 95    # Arrays
 96    arr[0] = "foo";
 97    arr[1] = "bar";
 98
 99    # You can also initialize an array with the built-in function split()
100
101    n = split("foo:bar:baz", arr, ":");
102
103    # You also have associative arrays (indeed, they're all associative arrays)
104    assoc["foo"] = "bar";
105    assoc["bar"] = "baz";
106
107    # And multi-dimensional arrays, with some limitations I won't mention here
108    multidim[0,0] = "foo";
109    multidim[0,1] = "bar";
110    multidim[1,0] = "baz";
111    multidim[1,1] = "boo";
112
113    # You can test for array membership
114    if ("foo" in assoc)
115        print "Fooey!";
116
117    # You can also use the 'in' operator to traverse the keys of an array
118    for (key in assoc)
119        print assoc[key];
120
121    # The command line is in a special array called ARGV
122    for (argnum in ARGV)
123        print ARGV[argnum];
124
125    # You can remove elements of an array
126    # This is particularly useful to prevent AWK from assuming the arguments
127    # are files for it to process
128    delete ARGV[1];
129
130    # The number of command line arguments is in a variable called ARGC
131    print ARGC;
132
133    # AWK has several built-in functions. They fall into three categories. I'll
134    # demonstrate each of them in their own functions, defined later.
135
136    return_value = arithmetic_functions(a, b, c);
137    string_functions();
138    io_functions();
139}
140
141# Here's how you define a function
142function arithmetic_functions(a, b, c,     d) {
143
144    # Probably the most annoying part of AWK is that there are no local
145    # variables. Everything is global. For short scripts, this is fine, even
146    # useful, but for longer scripts, this can be a problem.
147
148    # There is a work-around (ahem, hack). Function arguments are local to the
149    # function, and AWK allows you to define more function arguments than it
150    # needs. So just stick local variable in the function declaration, like I
151    # did above. As a convention, stick in some extra whitespace to distinguish
152    # between actual function parameters and local variables. In this example,
153    # a, b, and c are actual parameters, while d is merely a local variable.
154
155    # Now, to demonstrate the arithmetic functions
156
157    # Most AWK implementations have some standard trig functions
158    d = sin(a);
159    d = cos(a);
160    d = atan2(b, a); # arc tangent of b / a
161
162    # And logarithmic stuff
163    d = exp(a);
164    d = log(a);
165
166    # Square root
167    d = sqrt(a);
168
169    # Truncate floating point to integer
170    d = int(5.34); # d => 5
171
172    # Random numbers
173    srand(); # Supply a seed as an argument. By default, it uses the time of day
174    d = rand(); # Random number between 0 and 1.
175
176    # Here's how to return a value
177    return d;
178}
179
180function string_functions(    localvar, arr) {
181
182    # AWK, being a string-processing language, has several string-related
183    # functions, many of which rely heavily on regular expressions.
184
185    # Search and replace, first instance (sub) or all instances (gsub)
186    # Both return number of matches replaced
187    localvar = "fooooobar";
188    sub("fo+", "Meet me at the ", localvar); # localvar => "Meet me at the bar"
189    gsub("e", ".", localvar); # localvar => "M..t m. at th. bar"
190
191    # Search for a string that matches a regular expression
192    # index() does the same thing, but doesn't allow a regular expression
193    match(localvar, "t"); # => 4, since the 't' is the fourth character
194
195    # Split on a delimiter
196    n = split("foo-bar-baz", arr, "-");
197    # result: a[1] = "foo"; a[2] = "bar"; a[3] = "baz"; n = 3
198
199    # Other useful stuff
200    sprintf("%s %d %d %d", "Testing", 1, 2, 3); # => "Testing 1 2 3"
201    substr("foobar", 2, 3); # => "oob"
202    substr("foobar", 4); # => "bar"
203    length("foo"); # => 3
204    tolower("FOO"); # => "foo"
205    toupper("foo"); # => "FOO"
206}
207
208function io_functions(    localvar) {
209
210    # You've already seen print
211    print "Hello world";
212
213    # There's also printf
214    printf("%s %d %d %d\n", "Testing", 1, 2, 3);
215
216    # AWK doesn't have file handles, per se. It will automatically open a file
217    # handle for you when you use something that needs one. The string you used
218    # for this can be treated as a file handle, for purposes of I/O. This makes
219    # it feel sort of like shell scripting, but to get the same output, the
220    # string must match exactly, so use a variable:
221
222    outfile = "/tmp/foobar.txt";
223
224    print "foobar" > outfile;
225
226    # Now the string outfile is a file handle. You can close it:
227    close(outfile);
228
229    # Here's how you run something in the shell
230    system("echo foobar"); # => prints foobar
231
232    # Reads a line from standard input and stores in localvar
233    getline localvar;
234
235    # Reads a line from a pipe (again, use a string so you close it properly)
236    cmd = "echo foobar";
237    cmd | getline localvar; # localvar => "foobar"
238    close(cmd);
239
240    # Reads a line from a file and stores in localvar
241    infile = "/tmp/foobar.txt";
242    getline localvar < infile;
243    close(infile);
244}
245
246# As I said at the beginning, AWK programs consist of a collection of patterns
247# and actions. You've already seen the BEGIN pattern. Other
248# patterns are used only if you're processing lines from files or standard
249# input.
250#
251# When you pass arguments to AWK, they are treated as file names to process.
252# It will process them all, in order. Think of it like an implicit for loop,
253# iterating over the lines in these files. these patterns and actions are like
254# switch statements inside the loop.
255
256/^fo+bar$/ {
257
258    # This action will execute for every line that matches the regular
259    # expression, /^fo+bar$/, and will be skipped for any line that fails to
260    # match it. Let's just print the line:
261
262    print;
263
264    # Whoa, no argument! That's because print has a default argument: $0.
265    # $0 is the name of the current line being processed. It is created
266    # automatically for you.
267
268    # You can probably guess there are other $ variables. Every line is
269    # implicitly split before every action is called, much like the shell
270    # does. And, like the shell, each field can be access with a dollar sign
271
272    # This will print the second and fourth fields in the line
273    print $2, $4;
274
275    # AWK automatically defines many other variables to help you inspect and
276    # process each line. The most important one is NF
277
278    # Prints the number of fields on this line
279    print NF;
280
281    # Print the last field on this line
282    print $NF;
283}
284
285# Every pattern is actually a true/false test. The regular expression in the
286# last pattern is also a true/false test, but part of it was hidden. If you
287# don't give it a string to test, it will assume $0, the line that it's
288# currently processing. Thus, the complete version of it is this:
289
290$0 ~ /^fo+bar$/ {
291    print "Equivalent to the last pattern";
292}
293
294a > 0 {
295    # This will execute once for each line, as long as a is positive
296}
297
298# You get the idea. Processing text files, reading in a line at a time, and
299# doing something with it, particularly splitting on a delimiter, is so common
300# in UNIX that AWK is a scripting language that does all of it for you, without
301# you needing to ask. All you have to do is write the patterns and actions
302# based on what you expect of the input, and what you want to do with it.
303
304# Here's a quick example of a simple script, the sort of thing AWK is perfect
305# for. It will read a name from standard input and then will print the average
306# age of everyone with that first name. Let's say you supply as an argument the
307# name of a this data file:
308#
309# Bob Jones 32
310# Jane Doe 22
311# Steve Stevens 83
312# Bob Smith 29
313# Bob Barker 72
314#
315# Here's the script:
316
317BEGIN {
318
319    # First, ask the user for the name
320    print "What name would you like the average age for?";
321
322    # Get a line from standard input, not from files on the command line
323    getline name < "/dev/stdin";
324}
325
326# Now, match every line whose first field is the given name
327$1 == name {
328
329    # Inside here, we have access to a number of useful variables, already
330    # pre-loaded for us:
331    # $0 is the entire line
332    # $3 is the third field, the age, which is what we're interested in here
333    # NF is the number of fields, which should be 3
334    # NR is the number of records (lines) seen so far
335    # FILENAME is the name of the file being processed
336    # FS is the field separator being used, which is " " here
337    # ...etc. There are plenty more, documented in the man page.
338
339    # Keep track of a running total and how many lines matched
340    sum += $3;
341    nlines++;
342}
343
344# Another special pattern is called END. It will run after processing all the
345# text files. Unlike BEGIN, it will only run if you've given it input to
346# process. It will run after all the files have been read and processed
347# according to the rules and actions you've provided. The purpose of it is
348# usually to output some kind of final report, or do something with the
349# aggregate of the data you've accumulated over the course of the script.
350
351END {
352    if (nlines)
353        print "The average age for " name " is " sum / nlines;
354}