Brief primer on AWK on Unix/ Linux - Suvendra's Playground

AWK is a scripting language that has been available with Unix for a very long time. It was named after the initials of the three people who had written this scripting language, viz. Alfred Aho, Peter Weinberger and Brian Kernighan (of legendary C fame). They were all working at Bell Laboratories during that time.

Even though the usage for AWK has lessened a lot in current days, but it is always a good tool to keep under your belt. For very quick text manipulation, it can be used in any Unix like environment. I will be working on my MacBook, and the version that I will be using is as follows (there is no standard way of getting AWK version, but the one I am using supports —version).

% awk --version
awk version 20200816

Use Cases for AWK

I have already given a one-liner history of AWK above, so will not go into it any more. Instead I will try to provide some instances when I just write a small script to solve my problem. One of the greatest use case that comes to my mind is when I have some data exported from a database. Normally these will be in CSV, so, I will easily be able to manipulate this data using AWK. For example, think about an use-case where I want to create a simple one time report for business, instead of writing a full fledged program, an AWK script would suffice. When I am trying to analyze a program, and I want to extract some information from it for easier understanding, instead of going through it line by line, I can write a script to extract the specific information I am looking for.

Basics Before we Start

Let’s start with some basics. AWK is built to loop through files one record at a time. Before I start explanation, let me start by discussing the two files that I will use as examples.

% cat prices.txt
PRODUCT, PRICE, MANUFACTURER
Executive Chair, 300.00, Henredon
Manager Office Chair, 170.00, La-Z-Boy
Mesh Office Chair, 120.00, True Innovations
Office Task Chair, 90.00, True Innovations
48 inches Swivel Desk, 370.00, Camden
Adjustable Height Desk, 350.00, ApexDesk
Panorama Desk, 790.00, Bestar
Executive Desk, 3400.00, Harrington
3-piece Bookcase, 4000.00, Tuscan
3-piece Bookcase, 3000.00, Harrington

% cat employees.txt
NAME, DESIGNATION, FURNITURE, BRAND
Leon Rollins, CEO, Executive Chair, Henredon
Leon Rollins, CEO, Panorama Desk, Bestar
Leon Rollins, CEO, 3-piece Bookcase, Tuscan
Angelo Barnett, CFO, Executive Chair, Henredon
Angelo Barnett, CFO, 48 inches Swivel Desk, Camden
Angelo Barnett, CFO, 3-piece Bookcase, Harrington
Dane Rubio, Senior Manager, Mesh Office Chair, True Innovations
Dane Rubio, Senior Manager, Adjustable Height Desk, ApexDesk
Kenji Dyer, Manager, Office Task Chair, True Innovations
Kenji Dyer, Manager, Adjustable Height Desk, ApexDesk

Consider prices.txt above. When we run a script, AWK will loop through the heading first, then the first record and so on. All AWK scripts are between two braces {}. We can setup some variables before the program starts looping in BEGIN. Anything after the loop has to be in END. So, here is how it looks.

BEGIN {
    # Before the main loop starts
}
{
    # Main Loop
}
END {
    # After the main loop
}

Next we will talk about some $ variables. When AWK reads the line, by default, entire line is kept in $0 variable. Individual fields will be maintained in $1, $2 and so on variables. Let’s take an example. We will show how these variables print in employees.txt.

# $ variables assigned:
# NAME, DESIGNATION, FURNITURE, BRAND
#  $1.       $2.         $3.      $4
% awk -F, '{printf("|%-15s|%-20s|%-20s|\n", $4, $3, $1)}' employees.txt | head -5
| BRAND         | FURNITURE          |NAME                |
| Henredon      | Executive Chair    |Leon Rollins        |
| Bestar        | Panorama Desk      |Leon Rollins        |
| Tuscan        | 3-piece Bookcase   |Leon Rollins        |
| Henredon      | Executive Chair    |Angelo Barnett      |

There are a few things that may not be clear at this time. However, this example is purely for showing how the $ variables work.

AWK Built-in Variables

AWK provides some built-in variables for convenience.

We will just use the above two random files. One of them contains prices for some furnitures, the second one contains what furnitures have been provided to some employees.

Let’s start on the variables now. Specifically we will look at FS, OFS, RS, ORS, NR, NF, FNR and FILENAME variables.

NR (Number of Records)

We use NR to display the total number of records in file. This way we do not have to keep a counter as AWK already does that for us.

% awk 'END{print NR}' prices.txt
11

Here we are printing the total number of records present in the file.

NF (Number of Fields)

NF is used to get the count of fields.

% awk '{print "Field Count:", NF, "::", $0}' prices.txt
Field Count: 3 :: PRODUCT, PRICE, MANUFACTURER
Field Count: 4 :: Executive Chair, 300.00, Henredon
Field Count: 5 :: Manager Office Chair, 170.00, La-Z-Boy
Field Count: 6 :: Mesh Office Chair, 120.00, True Innovations

We know that each record has three fields. But as per AWK, we see the field counts are inconsistent. This is because by default, AWK treats spaces as field separator. However, in this case, field separator is commas.

FS (Input Field Separator)

Let’s see if we can fix the problem from above using the Field Separator variable.

% awk 'BEGIN{FS=","}{print "Field Count:", NF, "::", $0}' prices.txt
Field Count: 3 :: PRODUCT, PRICE, MANUFACTURER
Field Count: 3 :: Executive Chair, 300.00, Henredon
Field Count: 3 :: Manager Office Chair, 170.00, La-Z-Boy
Field Count: 3 :: Mesh Office Chair, 120.00, True Innovations
Field Count: 3 :: Office Task Chair, 90.00, True Innovations

Better! Now each of them will be showing just 3 as the number of fields.

OFS (Output Field Separator)

For the next examples, we will not use the files. We will just echo a file for the examples. OFS is used for separating output fields. In the example below, we have asked AWK to use : as output field separator.

% echo "Bye Cruel World\nHello New World" | awk 'BEGIN{OFS=":"}{print $1, $2, $3}'
Bye:Cruel:World
Hello:New:World

RS (Record Separator)

Now let’s check Record Separator. By default, new line is record separator. Let’s switch to |.

% echo "Bye Cruel World | Hello New World" | awk 'BEGIN{RS="|"}{print $1, $2, $3}
Bye Cruel World
Hello New World

ORS (Output Record Separator)

Let’s see if we can do a ^ as output record separator.

% echo "Bye Cruel World\nHello New World" | awk 'BEGIN{ORS=" ^ "}{print $1, $2, $3}'
Bye Cruel World ^ Hello New World

FNR (Number of Records in Current File)

FNR is a bit different. This variable is used when we are dealing with more than one file. Let’s assume that we are reading two files. FNR will always give the record number for the current file being read. On the other hand, NR keeps a running count.

FILENAME (Name of the data file)

FILENAME, as the name suggests, returns the name of data file.

Sample Calls for Experimenting

I will put some sample examples here. Most of the time I will not be dumping full output.

Loop through the file and dump

% awk '{print $0}' prices.txt | head -5
PRODUCT, PRICE, MANUFACTURER
Executive Chair, 300.00, Henredon
Manager Office Chair, 170.00, La-Z-Boy
Mesh Office Chair, 120.00, True Innovations
Office Task Chair, 90.00, True Innovations

Reformat the Prices and Print

% awk -F, 'NR>1{printf("%-20s%-20s %6.2f\n", $3, $1, $2)}' prices.txt | head -5
 Henredon           Executive Chair      300.00
 La-Z-Boy           Manager Office Chair 170.00
 True Innovations   Mesh Office Chair    120.00
 True Innovations   Office Task Chair     90.00
 Camden             48 inches Swivel Desk 370.00

There are a few things of interest here.

-F, signifies that field separator is ,. This is the other way of representing FS
NR>1, this indicates skip Record 1 (start processing from record 2). This ensures that the header is not printed. We can print a static header in BEGIN block

Find all Bookcases

% awk 'BEGIN{FS=","}/Bookcase/{print $0}' prices.txt
3-piece Bookcase, 4000.00, Tuscan
3-piece Bookcase, 3000.00, Harrington

Here we are filtering only for bookcases.

Larger Program

Now let’s work on a larger program. In this one we will read all furniture prices from prices.txt, the start adding up cost per employee for each employee in employee.txt. We will write a program called empcost.awk.

Let’s write the program first and explain.

# empcost.awk
function ltrim(s) { sub(/^[ \t\r\n]+/, "", s); return s }
function rtrim(s) { sub(/[ \t\r\n]+$/, "", s); return s }
function trim(s)  { return rtrim(ltrim(s)); }
BEGIN {
	FS=","
	printf("%-25s%-25s%8s\n", "EMPLOYEE NAME", "FURNITURE", "COST");
	printf("----------------------------------------------------------\n");
}
{
	if (FNR != 1) {
		if (FNR==NR) {
			a=trim($1) "-" trim($3);
			prices[a]=$2;
			next;
		} else {
			a=trim($3) "-" trim($4);
			printf("%-25s%-25s%8s\n", $1, $3, prices[a]);
			empcost[trim($1)] = empcost[trim($1)] + prices[a];
		}
	}
}
END {
	printf("----------------------------------------------------------\n");
	print("\n\nCost per Employee...");
	printf("%-25s%8s\n",  "EMPLOYEE NAME", "EXPENSE");
	printf("---------------------------------\n");
	for (key in empcost) {
		printf("%-25s%8.2f\n", key, empcost[key]);
	}
	printf("---------------------------------\n");
}

Here some interesting things are happening. Lines #2, #3, #4 have routines to trim the text. As improbable it may seem, AWK does not have a trim() function. We will always define our own.

Since we are working on two files, we have used FNR (file specific NR). FNR != 1 just means skip line 1 in both files. We are skipping it because in both cases, it just has a heading.

FNR==NR, this means we are reading the first file. FNR resets every file, however, NR is a consecutive value. So, FNR and NR will only be equal when we are working on the first file. In this case we are creating a map for prices for each of the furnitures.

When we are reading the second file, we start printing the furniture cost for each employee and then sum up the total for the employees. Finally totals are reported in END block.

% awk -f empcost.awk prices.txt employees.txt

EMPLOYEE NAME            FURNITURE                    COST
----------------------------------------------------------
Leon Rollins              Executive Chair           300.00
Leon Rollins              Panorama Desk             790.00
Leon Rollins              3-piece Bookcase         4000.00
Angelo Barnett            Executive Chair           300.00
Angelo Barnett            48 inches Swivel Desk     370.00
Angelo Barnett            3-piece Bookcase         3000.00
Dane Rubio                Mesh Office Chair         120.00
Dane Rubio                Adjustable Height Desk    350.00
Kenji Dyer                Office Task Chair          90.00
Kenji Dyer                Adjustable Height Desk    350.00
----------------------------------------------------------


Cost per Employee...
EMPLOYEE NAME             EXPENSE
---------------------------------
Leon Rollins              5090.00
Dane Rubio                 470.00
Kenji Dyer                 440.00
Angelo Barnett            3670.00
---------------------------------

Conclusion

I thought about writing on AWK as I keep on going back to it for every small one time requirements. Anything larger, I will resort to Python. I will not use Java for any of these as it has a more descriptive command list. Hope you can also get inspiration from this blog to start using AWK for small one timers. Ciao for now!