AWK

Личный сайт Go-разработчика из Казани

AWK is a standard tool on every POSIX-compliant UNIX system. It’s like flex/lex, from the command-line, perfect for text-processing tasks and other scripting needs. It has a C-like syntax, but without mandatory semicolons (although, you should use them anyway, because they are required when you’re writing one-liners, something AWK excels at), manual memory management, or static typing. It excels at text processing. You can call to it from a shell script, or you can use it as a stand-alone scripting language.

Why use AWK instead of Perl? Readability. AWK is easier to read than Perl. For simple text-processing scripts, particularly ones that read files line by line and split on delimiters, AWK is probably the right tool for the job.

1#!/usr/bin/awk -f 2 3# Comments are like this 4 5 6# AWK programs consist of a collection of patterns and actions. 7pattern1 { action; } # just like lex 8pattern2 { action; } 9 10# There is an implied loop and AWK automatically reads and parses each 11# record of each file supplied. Each record is split by the FS delimiter, 12# which defaults to white-space (multiple spaces,tabs count as one) 13# You can assign FS either on the command line (-F C) or in your BEGIN 14# pattern 15 16# One of the special patterns is BEGIN. The BEGIN pattern is true 17# BEFORE any of the files are read. The END pattern is true after 18# an End-of-file from the last file (or standard-in if no files specified) 19# There is also an output field separator (OFS) that you can assign, which 20# defaults to a single space 21 22BEGIN { 23 24 # BEGIN will run at the beginning of the program. It's where you put all 25 # the preliminary set-up code, before you process any text files. If you 26 # have no text files, then think of BEGIN as the main entry point. 27 28 # Variables are global. Just set them or use them, no need to declare. 29 count = 0; 30 31 # Operators just like in C and friends 32 a = count + 1; 33 b = count - 1; 34 c = count * 1; 35 d = count / 1; # integer division 36 e = count % 1; # modulus 37 f = count ^ 1; # exponentiation 38 39 a += 1; 40 b -= 1; 41 c *= 1; 42 d /= 1; 43 e %= 1; 44 f ^= 1; 45 46 # Incrementing and decrementing by one 47 a++; 48 b--; 49 50 # As a prefix operator, it returns the incremented value 51 ++a; 52 --b; 53 54 # Notice, also, no punctuation such as semicolons to terminate statements 55 56 # Control statements 57 if (count == 0) 58 print "Starting with count of 0"; 59 else 60 print "Huh?"; 61 62 # Or you could use the ternary operator 63 print (count == 0) ? "Starting with count of 0" : "Huh?"; 64 65 # Blocks consisting of multiple lines use braces 66 while (a < 10) { 67 print "String concatenation is done" " with a series" " of" 68 " space-separated strings"; 69 print a; 70 71 a++; 72 } 73 74 for (i = 0; i < 10; i++) 75 print "Good ol' for loop"; 76 77 # As for comparisons, they're the standards: 78 # a < b # Less than 79 # a <= b # Less than or equal 80 # a != b # Not equal 81 # a == b # Equal 82 # a > b # Greater than 83 # a >= b # Greater than or equal 84 85 # Logical operators as well 86 # a && b # AND 87 # a || b # OR 88 89 # In addition, there's the super useful regular expression match 90 if ("foo" ~ "^fo+$") 91 print "Fooey!"; 92 if ("boo" !~ "^fo+$") 93 print "Boo!"; 94 95 # Arrays 96 arr[0] = "foo"; 97 arr[1] = "bar"; 98 99 # You can also initialize an array with the built-in function split() 100 101 n = split("foo:bar:baz", arr, ":"); 102 103 # You also have associative arrays (indeed, they're all associative arrays) 104 assoc["foo"] = "bar"; 105 assoc["bar"] = "baz"; 106 107 # And multi-dimensional arrays, with some limitations I won't mention here 108 multidim[0,0] = "foo"; 109 multidim[0,1] = "bar"; 110 multidim[1,0] = "baz"; 111 multidim[1,1] = "boo"; 112 113 # You can test for array membership 114 if ("foo" in assoc) 115 print "Fooey!"; 116 117 # You can also use the 'in' operator to traverse the keys of an array 118 for (key in assoc) 119 print assoc[key]; 120 121 # The command line is in a special array called ARGV 122 for (argnum in ARGV) 123 print ARGV[argnum]; 124 125 # You can remove elements of an array 126 # This is particularly useful to prevent AWK from assuming the arguments 127 # are files for it to process 128 delete ARGV[1]; 129 130 # The number of command line arguments is in a variable called ARGC 131 print ARGC; 132 133 # AWK has several built-in functions. They fall into three categories. I'll 134 # demonstrate each of them in their own functions, defined later. 135 136 return_value = arithmetic_functions(a, b, c); 137 string_functions(); 138 io_functions(); 139} 140 141# Here's how you define a function 142function arithmetic_functions(a, b, c, d) { 143 144 # Probably the most annoying part of AWK is that there are no local 145 # variables. Everything is global. For short scripts, this is fine, even 146 # useful, but for longer scripts, this can be a problem. 147 148 # There is a work-around (ahem, hack). Function arguments are local to the 149 # function, and AWK allows you to define more function arguments than it 150 # needs. So just stick local variable in the function declaration, like I 151 # did above. As a convention, stick in some extra whitespace to distinguish 152 # between actual function parameters and local variables. In this example, 153 # a, b, and c are actual parameters, while d is merely a local variable. 154 155 # Now, to demonstrate the arithmetic functions 156 157 # Most AWK implementations have some standard trig functions 158 d = sin(a); 159 d = cos(a); 160 d = atan2(b, a); # arc tangent of b / a 161 162 # And logarithmic stuff 163 d = exp(a); 164 d = log(a); 165 166 # Square root 167 d = sqrt(a); 168 169 # Truncate floating point to integer 170 d = int(5.34); # d => 5 171 172 # Random numbers 173 srand(); # Supply a seed as an argument. By default, it uses the time of day 174 d = rand(); # Random number between 0 and 1. 175 176 # Here's how to return a value 177 return d; 178} 179 180function string_functions( localvar, arr) { 181 182 # AWK, being a string-processing language, has several string-related 183 # functions, many of which rely heavily on regular expressions. 184 185 # Search and replace, first instance (sub) or all instances (gsub) 186 # Both return number of matches replaced 187 localvar = "fooooobar"; 188 sub("fo+", "Meet me at the ", localvar); # localvar => "Meet me at the bar" 189 gsub("e", ".", localvar); # localvar => "M..t m. at th. bar" 190 191 # Search for a string that matches a regular expression 192 # index() does the same thing, but doesn't allow a regular expression 193 match(localvar, "t"); # => 4, since the 't' is the fourth character 194 195 # Split on a delimiter 196 n = split("foo-bar-baz", arr, "-"); 197 # result: a[1] = "foo"; a[2] = "bar"; a[3] = "baz"; n = 3 198 199 # Other useful stuff 200 sprintf("%s %d %d %d", "Testing", 1, 2, 3); # => "Testing 1 2 3" 201 substr("foobar", 2, 3); # => "oob" 202 substr("foobar", 4); # => "bar" 203 length("foo"); # => 3 204 tolower("FOO"); # => "foo" 205 toupper("foo"); # => "FOO" 206} 207 208function io_functions( localvar) { 209 210 # You've already seen print 211 print "Hello world"; 212 213 # There's also printf 214 printf("%s %d %d %d\n", "Testing", 1, 2, 3); 215 216 # AWK doesn't have file handles, per se. It will automatically open a file 217 # handle for you when you use something that needs one. The string you used 218 # for this can be treated as a file handle, for purposes of I/O. This makes 219 # it feel sort of like shell scripting, but to get the same output, the 220 # string must match exactly, so use a variable: 221 222 outfile = "/tmp/foobar.txt"; 223 224 print "foobar" > outfile; 225 226 # Now the string outfile is a file handle. You can close it: 227 close(outfile); 228 229 # Here's how you run something in the shell 230 system("echo foobar"); # => prints foobar 231 232 # Reads a line from standard input and stores in localvar 233 getline localvar; 234 235 # Reads a line from a pipe (again, use a string so you close it properly) 236 cmd = "echo foobar"; 237 cmd | getline localvar; # localvar => "foobar" 238 close(cmd); 239 240 # Reads a line from a file and stores in localvar 241 infile = "/tmp/foobar.txt"; 242 getline localvar < infile; 243 close(infile); 244} 245 246# As I said at the beginning, AWK programs consist of a collection of patterns 247# and actions. You've already seen the BEGIN pattern. Other 248# patterns are used only if you're processing lines from files or standard 249# input. 250# 251# When you pass arguments to AWK, they are treated as file names to process. 252# It will process them all, in order. Think of it like an implicit for loop, 253# iterating over the lines in these files. these patterns and actions are like 254# switch statements inside the loop. 255 256/^fo+bar$/ { 257 258 # This action will execute for every line that matches the regular 259 # expression, /^fo+bar$/, and will be skipped for any line that fails to 260 # match it. Let's just print the line: 261 262 print; 263 264 # Whoa, no argument! That's because print has a default argument: $0. 265 # $0 is the name of the current line being processed. It is created 266 # automatically for you. 267 268 # You can probably guess there are other $ variables. Every line is 269 # implicitly split before every action is called, much like the shell 270 # does. And, like the shell, each field can be access with a dollar sign 271 272 # This will print the second and fourth fields in the line 273 print $2, $4; 274 275 # AWK automatically defines many other variables to help you inspect and 276 # process each line. The most important one is NF 277 278 # Prints the number of fields on this line 279 print NF; 280 281 # Print the last field on this line 282 print $NF; 283} 284 285# Every pattern is actually a true/false test. The regular expression in the 286# last pattern is also a true/false test, but part of it was hidden. If you 287# don't give it a string to test, it will assume $0, the line that it's 288# currently processing. Thus, the complete version of it is this: 289 290$0 ~ /^fo+bar$/ { 291 print "Equivalent to the last pattern"; 292} 293 294a > 0 { 295 # This will execute once for each line, as long as a is positive 296} 297 298# You get the idea. Processing text files, reading in a line at a time, and 299# doing something with it, particularly splitting on a delimiter, is so common 300# in UNIX that AWK is a scripting language that does all of it for you, without 301# you needing to ask. All you have to do is write the patterns and actions 302# based on what you expect of the input, and what you want to do with it. 303 304# Here's a quick example of a simple script, the sort of thing AWK is perfect 305# for. It will read a name from standard input and then will print the average 306# age of everyone with that first name. Let's say you supply as an argument the 307# name of a this data file: 308# 309# Bob Jones 32 310# Jane Doe 22 311# Steve Stevens 83 312# Bob Smith 29 313# Bob Barker 72 314# 315# Here's the script: 316 317BEGIN { 318 319 # First, ask the user for the name 320 print "What name would you like the average age for?"; 321 322 # Get a line from standard input, not from files on the command line 323 getline name < "/dev/stdin"; 324} 325 326# Now, match every line whose first field is the given name 327$1 == name { 328 329 # Inside here, we have access to a number of useful variables, already 330 # pre-loaded for us: 331 # $0 is the entire line 332 # $3 is the third field, the age, which is what we're interested in here 333 # NF is the number of fields, which should be 3 334 # NR is the number of records (lines) seen so far 335 # FILENAME is the name of the file being processed 336 # FS is the field separator being used, which is " " here 337 # ...etc. There are plenty more, documented in the man page. 338 339 # Keep track of a running total and how many lines matched 340 sum += $3; 341 nlines++; 342} 343 344# Another special pattern is called END. It will run after processing all the 345# text files. Unlike BEGIN, it will only run if you've given it input to 346# process. It will run after all the files have been read and processed 347# according to the rules and actions you've provided. The purpose of it is 348# usually to output some kind of final report, or do something with the 349# aggregate of the data you've accumulated over the course of the script. 350 351END { 352 if (nlines) 353 print "The average age for " name " is " sum / nlines; 354}

Further Reading: