Project Description

Awk is a very unique scripting language used in Linux. It’s technically a command, such as “cd” or “ls”, but for input it accepts written scripts, like an interpreter. It is especially useful for data manipulation and analysis, since it’s primary feature is scanning documents line by line, matching and manipulating them according to how the inputted script is written.

Although its functionality is incredibly useful, it becomes more tedious when working with tsv, or tab separated, files. There is no built in functionality for referencing headers, as the program only runs line by line.

Thus, the goal of awkcel is to create a wrapper for awk, where columns in the file can be referenced by their headers, and data can be collected and analyzed solely through reference.

#!/bin/bash

awkstring=$1
file=$2

#Get headers from file and store in bash array

headers=($(awk -F'\t' '
	BEGIN {
	HEADER=0
	}
	substr($1,1,1) !="#" && HEADER==0 {
	HEADER=1;
	for (i=1;i<=NF;i++) {
		print $i
	}
}' $file))

#Skip header line and comments

varstring="BEGIN {HEADER=0;"

for (( i=0 ; i < ${#headers[@]} ; i++))
do
	varstring="${varstring}COLNAMES[$i]=\"${headers[$i]}\";"
done

varstring="$varstring} substr(\$1,1,1)==\"#\"{next} HEADER==0{HEADER=1;next}{"


for (( i=0 ; i < ${#headers[@]} ; i++))
do
	#OPTIMIZATION::: Check if variable is called in the awk command passed by user
	#so unused columns are not loaded into memory needlessly

	#NOTE: This runs the risk of loading unused columns if column names are refrenced
	#not as variables but rather as parts of strings
	if [[ $awkstring =~ ${headers[$i]} ]]; then
		varstring=${varstring}${headers[$i]}=\$$(($i+1))\;
	fi
done
varstring=${varstring}}

#Load variables into user inputted awk command

awkstring="${varstring}${awkstring}"

#Finally, call awk

awk "$awkstring" $file

Below is a quick example of how the script is used, where column names can be reference directly in the scripting language. This is an example scenario with historical stock data, where each column header is the name of a new stock. The goal here is to determine Amazon’s worst single day performance.

SHELL COMMAND:


./awkcel 'BEGIN {
	count=0;
}

count==0{
	max=AMZN;
	count++;
}

count==1{
	max=max-AMZN;
	prev=AMZN
	prevdate=date;
	worst=date
	count++;
}

{
	if ((prev - AMZN) > max && prev!="_" && AMZN!="_") {
		max=prev-AMZN
		worst=date
	}
	prev=AMZN
	prevdate=date
}

END {
	printf "%s: %d\n", worst, max;
}
' historical.2011.tsv

---------------------------
OUTPUT:

2011-10-26: 29

---------------------------