AWK is a standard tool on every POSIX-compliant UNIX system. It’s like flex/lex, from the command-line, perfect for text-processing tasks and other scripting needs. It has a C-like syntax, but without mandatory semicolons (although, you should use them anyway, because they are required when you’re writing one-liners, something AWK excels at), manual memory management, or static typing. It excels at text processing. You can call to it from a shell script, or you can use it as a stand-alone scripting language.
Why use AWK instead of Perl? Readability. AWK is easier to read than Perl. For simple text-processing scripts, particularly ones that read files line by line and split on delimiters, AWK is probably the right tool for the job.
1#!/usr/bin/awk -f
2
3# Comments are like this
4
5
6# AWK programs consist of a collection of patterns and actions.
7pattern1 { action; } # just like lex
8pattern2 { action; }
9
10# There is an implied loop and AWK automatically reads and parses each
11# record of each file supplied. Each record is split by the FS delimiter,
12# which defaults to white-space (multiple spaces,tabs count as one)
13# You can assign FS either on the command line (-F C) or in your BEGIN
14# pattern
15
16# One of the special patterns is BEGIN. The BEGIN pattern is true
17# BEFORE any of the files are read. The END pattern is true after
18# an End-of-file from the last file (or standard-in if no files specified)
19# There is also an output field separator (OFS) that you can assign, which
20# defaults to a single space
21
22BEGIN {
23
24 # BEGIN will run at the beginning of the program. It's where you put all
25 # the preliminary set-up code, before you process any text files. If you
26 # have no text files, then think of BEGIN as the main entry point.
27
28 # Variables are global. Just set them or use them, no need to declare.
29 count = 0;
30
31 # Operators just like in C and friends
32 a = count + 1;
33 b = count - 1;
34 c = count * 1;
35 d = count / 1; # integer division
36 e = count % 1; # modulus
37 f = count ^ 1; # exponentiation
38
39 a += 1;
40 b -= 1;
41 c *= 1;
42 d /= 1;
43 e %= 1;
44 f ^= 1;
45
46 # Incrementing and decrementing by one
47 a++;
48 b--;
49
50 # As a prefix operator, it returns the incremented value
51 ++a;
52 --b;
53
54 # Notice, also, no punctuation such as semicolons to terminate statements
55
56 # Control statements
57 if (count == 0)
58 print "Starting with count of 0";
59 else
60 print "Huh?";
61
62 # Or you could use the ternary operator
63 print (count == 0) ? "Starting with count of 0" : "Huh?";
64
65 # Blocks consisting of multiple lines use braces
66 while (a < 10) {
67 print "String concatenation is done" " with a series" " of"
68 " space-separated strings";
69 print a;
70
71 a++;
72 }
73
74 for (i = 0; i < 10; i++)
75 print "Good ol' for loop";
76
77 # As for comparisons, they're the standards:
78 # a < b # Less than
79 # a <= b # Less than or equal
80 # a != b # Not equal
81 # a == b # Equal
82 # a > b # Greater than
83 # a >= b # Greater than or equal
84
85 # Logical operators as well
86 # a && b # AND
87 # a || b # OR
88
89 # In addition, there's the super useful regular expression match
90 if ("foo" ~ "^fo+$")
91 print "Fooey!";
92 if ("boo" !~ "^fo+$")
93 print "Boo!";
94
95 # Arrays
96 arr[0] = "foo";
97 arr[1] = "bar";
98
99 # You can also initialize an array with the built-in function split()
100
101 n = split("foo:bar:baz", arr, ":");
102
103 # You also have associative arrays (indeed, they're all associative arrays)
104 assoc["foo"] = "bar";
105 assoc["bar"] = "baz";
106
107 # And multi-dimensional arrays, with some limitations I won't mention here
108 multidim[0,0] = "foo";
109 multidim[0,1] = "bar";
110 multidim[1,0] = "baz";
111 multidim[1,1] = "boo";
112
113 # You can test for array membership
114 if ("foo" in assoc)
115 print "Fooey!";
116
117 # You can also use the 'in' operator to traverse the keys of an array
118 for (key in assoc)
119 print assoc[key];
120
121 # The command line is in a special array called ARGV
122 for (argnum in ARGV)
123 print ARGV[argnum];
124
125 # You can remove elements of an array
126 # This is particularly useful to prevent AWK from assuming the arguments
127 # are files for it to process
128 delete ARGV[1];
129
130 # The number of command line arguments is in a variable called ARGC
131 print ARGC;
132
133 # AWK has several built-in functions. They fall into three categories. I'll
134 # demonstrate each of them in their own functions, defined later.
135
136 return_value = arithmetic_functions(a, b, c);
137 string_functions();
138 io_functions();
139}
140
141# Here's how you define a function
142function arithmetic_functions(a, b, c, d) {
143
144 # Probably the most annoying part of AWK is that there are no local
145 # variables. Everything is global. For short scripts, this is fine, even
146 # useful, but for longer scripts, this can be a problem.
147
148 # There is a work-around (ahem, hack). Function arguments are local to the
149 # function, and AWK allows you to define more function arguments than it
150 # needs. So just stick local variable in the function declaration, like I
151 # did above. As a convention, stick in some extra whitespace to distinguish
152 # between actual function parameters and local variables. In this example,
153 # a, b, and c are actual parameters, while d is merely a local variable.
154
155 # Now, to demonstrate the arithmetic functions
156
157 # Most AWK implementations have some standard trig functions
158 d = sin(a);
159 d = cos(a);
160 d = atan2(b, a); # arc tangent of b / a
161
162 # And logarithmic stuff
163 d = exp(a);
164 d = log(a);
165
166 # Square root
167 d = sqrt(a);
168
169 # Truncate floating point to integer
170 d = int(5.34); # d => 5
171
172 # Random numbers
173 srand(); # Supply a seed as an argument. By default, it uses the time of day
174 d = rand(); # Random number between 0 and 1.
175
176 # Here's how to return a value
177 return d;
178}
179
180function string_functions( localvar, arr) {
181
182 # AWK, being a string-processing language, has several string-related
183 # functions, many of which rely heavily on regular expressions.
184
185 # Search and replace, first instance (sub) or all instances (gsub)
186 # Both return number of matches replaced
187 localvar = "fooooobar";
188 sub("fo+", "Meet me at the ", localvar); # localvar => "Meet me at the bar"
189 gsub("e", ".", localvar); # localvar => "M..t m. at th. bar"
190
191 # Search for a string that matches a regular expression
192 # index() does the same thing, but doesn't allow a regular expression
193 match(localvar, "t"); # => 4, since the 't' is the fourth character
194
195 # Split on a delimiter
196 n = split("foo-bar-baz", arr, "-");
197 # result: a[1] = "foo"; a[2] = "bar"; a[3] = "baz"; n = 3
198
199 # Other useful stuff
200 sprintf("%s %d %d %d", "Testing", 1, 2, 3); # => "Testing 1 2 3"
201 substr("foobar", 2, 3); # => "oob"
202 substr("foobar", 4); # => "bar"
203 length("foo"); # => 3
204 tolower("FOO"); # => "foo"
205 toupper("foo"); # => "FOO"
206}
207
208function io_functions( localvar) {
209
210 # You've already seen print
211 print "Hello world";
212
213 # There's also printf
214 printf("%s %d %d %d\n", "Testing", 1, 2, 3);
215
216 # AWK doesn't have file handles, per se. It will automatically open a file
217 # handle for you when you use something that needs one. The string you used
218 # for this can be treated as a file handle, for purposes of I/O. This makes
219 # it feel sort of like shell scripting, but to get the same output, the
220 # string must match exactly, so use a variable:
221
222 outfile = "/tmp/foobar.txt";
223
224 print "foobar" > outfile;
225
226 # Now the string outfile is a file handle. You can close it:
227 close(outfile);
228
229 # Here's how you run something in the shell
230 system("echo foobar"); # => prints foobar
231
232 # Reads a line from standard input and stores in localvar
233 getline localvar;
234
235 # Reads a line from a pipe (again, use a string so you close it properly)
236 cmd = "echo foobar";
237 cmd | getline localvar; # localvar => "foobar"
238 close(cmd);
239
240 # Reads a line from a file and stores in localvar
241 infile = "/tmp/foobar.txt";
242 getline localvar < infile;
243 close(infile);
244}
245
246# As I said at the beginning, AWK programs consist of a collection of patterns
247# and actions. You've already seen the BEGIN pattern. Other
248# patterns are used only if you're processing lines from files or standard
249# input.
250#
251# When you pass arguments to AWK, they are treated as file names to process.
252# It will process them all, in order. Think of it like an implicit for loop,
253# iterating over the lines in these files. these patterns and actions are like
254# switch statements inside the loop.
255
256/^fo+bar$/ {
257
258 # This action will execute for every line that matches the regular
259 # expression, /^fo+bar$/, and will be skipped for any line that fails to
260 # match it. Let's just print the line:
261
262 print;
263
264 # Whoa, no argument! That's because print has a default argument: $0.
265 # $0 is the name of the current line being processed. It is created
266 # automatically for you.
267
268 # You can probably guess there are other $ variables. Every line is
269 # implicitly split before every action is called, much like the shell
270 # does. And, like the shell, each field can be access with a dollar sign
271
272 # This will print the second and fourth fields in the line
273 print $2, $4;
274
275 # AWK automatically defines many other variables to help you inspect and
276 # process each line. The most important one is NF
277
278 # Prints the number of fields on this line
279 print NF;
280
281 # Print the last field on this line
282 print $NF;
283}
284
285# Every pattern is actually a true/false test. The regular expression in the
286# last pattern is also a true/false test, but part of it was hidden. If you
287# don't give it a string to test, it will assume $0, the line that it's
288# currently processing. Thus, the complete version of it is this:
289
290$0 ~ /^fo+bar$/ {
291 print "Equivalent to the last pattern";
292}
293
294a > 0 {
295 # This will execute once for each line, as long as a is positive
296}
297
298# You get the idea. Processing text files, reading in a line at a time, and
299# doing something with it, particularly splitting on a delimiter, is so common
300# in UNIX that AWK is a scripting language that does all of it for you, without
301# you needing to ask. All you have to do is write the patterns and actions
302# based on what you expect of the input, and what you want to do with it.
303
304# Here's a quick example of a simple script, the sort of thing AWK is perfect
305# for. It will read a name from standard input and then will print the average
306# age of everyone with that first name. Let's say you supply as an argument the
307# name of a this data file:
308#
309# Bob Jones 32
310# Jane Doe 22
311# Steve Stevens 83
312# Bob Smith 29
313# Bob Barker 72
314#
315# Here's the script:
316
317BEGIN {
318
319 # First, ask the user for the name
320 print "What name would you like the average age for?";
321
322 # Get a line from standard input, not from files on the command line
323 getline name < "/dev/stdin";
324}
325
326# Now, match every line whose first field is the given name
327$1 == name {
328
329 # Inside here, we have access to a number of useful variables, already
330 # pre-loaded for us:
331 # $0 is the entire line
332 # $3 is the third field, the age, which is what we're interested in here
333 # NF is the number of fields, which should be 3
334 # NR is the number of records (lines) seen so far
335 # FILENAME is the name of the file being processed
336 # FS is the field separator being used, which is " " here
337 # ...etc. There are plenty more, documented in the man page.
338
339 # Keep track of a running total and how many lines matched
340 sum += $3;
341 nlines++;
342}
343
344# Another special pattern is called END. It will run after processing all the
345# text files. Unlike BEGIN, it will only run if you've given it input to
346# process. It will run after all the files have been read and processed
347# according to the rules and actions you've provided. The purpose of it is
348# usually to output some kind of final report, or do something with the
349# aggregate of the data you've accumulated over the course of the script.
350
351END {
352 if (nlines)
353 print "The average age for " name " is " sum / nlines;
354}
Further Reading:
- Awk tutorial
- Awk man page
- The GNU Awk User’s Guide GNU Awk is found on most Linux systems.
- AWK one-liner collection
- Awk alpinelinux wiki a technical summary and list of «gotchas» (places where different implementations may behave in different or unexpected ways).
- basic libraries for awk