Awk: Sum up column values across multiple files with identical column layout - linux

I have a number of files with the same header:
COL1, COL2, COL3, COL4
You can ignore COL1-COL3. COL4 contains a number. Each file contains about 200 rows. I am trying to sum up across the rows. For example:
File 1
COL1 COL2 COL3 COL4
x y z 3
a b c 4
File 2
COL1 COL2 COL3 COL4
x y z 5
a b c 10
Then a new file is returned:
COL1 COL2 COL3 COL4
x y z 8
a b c 14
Is there a simple way to do this without AWK? I will use AWK if need be, I just thought there might be a simple one-liner that I could just run right away. The AWK script I have in mind feels a bit long.
Thanks

Combining paste with awk, as in Kristo Mägi's answer, is your best bet:
paste merges the corresponding lines from the input files,
which sends a single stream of input lines to awk, with each input line containing all fields to sum up.
Assuming a fixed number of input files and columns, Kristo's answer can be simplified to (making processing much more efficient):
paste file1 file2 | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'
Note: The above produces space-separated output columns, because awk's default value for OFS, the output-field separator, is a single space.
Assuming that all files have the same column structure and line count, below is a generalization of the solution, which:
generalizes to more than 2 input files (and more than 2 data rows)
generalizes to any number of fields, as long as the field to sum up is the last one.
#!/bin/bash
files=( file1 file2 ) # array of input files
paste "${files[#]}" | awk -v numFiles=${#files[#]} -v OFS='\t' '
{
row = sep = ""
for(i=1; i < NF/numFiles; ++i) { row = row sep $i; sep = OFS }
sum = $(NF/numFiles) # last header col. / (1st) data col. to sum
if (NR > 1) { for(i=2; i<=numFiles; ++i) sum += $(NF/numFiles * i) } # add other cols.
printf "%s%s%s\n", row, OFS, sum
}
'
Note that \t (the tab char.) is used to separate output fields and that, due to relying on awk's default line-splitting into fields, preserving the exact input whitespace between fields is not guaranteed.

If all files have the same header - awk solution:
awk '!f && FNR==1{ f=1; print $0 }FNR>1{ s[FNR]+=$NF; $NF=""; r[FNR]=$0 }
END{ for(i=2;i<=FNR;i++) print r[i],s[i] }' File[12]
The output (for 2 files):
COL1 COL2 COL3 COL4
x y z 8
a b c 14
This approach can be applied to multiple files (in that case you may specify globbing File* for filename expansion)

One more option.
The command:
paste f{1,2}.txt | sed '1d' | awk '{print $1,$2,$3,$4+$8}' | awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'
The result:
COL1 COL2 COL3 COL4
x y z 8
a b c 14
What it does:
Test files:
$ cat f1.txt
COL1 COL2 COL3 COL4
x y z 3
a b c 4
$ cat f2.txt
COL1 COL2 COL3 COL4
x y z 5
a b c 10
Command: paste f{1,2}.txt
Joins 2 files and gives output:
COL1 COL2 COL3 COL4 COL1 COL2 COL3 COL4
x y z 3 x y z 5
a b c 4 a b c 10
Command: sed '1d'
Is meant to remove header temporarily
Command: awk '{print $1,$2,$3,$4+$8}'
Returns COL1-3 and sums $4 and $8 from paste result.
Command: awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'
Adds header back
EDIT:
Following #mklement0 comment, he is right about header handling as I forgot the NR==1 part.
So, I'll proxy his updated version here also:
paste f{1,2}.txt | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'

You state you have "a number of files". i.e., more than 2.
Given these 3 files (and should work with any number):
$ cat f1 f2 f3
COL1 COL2 COL3 COL4
x y z 3
a b c 4
COL1 COL2 COL3 COL4
x y z 5
a b c 10
COL1 COL2 COL3 COL4
x y z 10
a b c 15
You can do:
$ awk 'FNR==1{next}
{sum[$1]+=$4}
END{print "COL1 COL4";
for (e in sum) print e, sum[e]} ' f1 f2 f3
COL1 COL4
x 18
a 29
It is unclear what you intend to do with COL2 or COL3, so I did not add that.

$ awk '
NR==1 { print }
{ sum[FNR]+=$NF; sub(/[^[:space:]]+[[:space:]]*$/,""); pfx[FNR]=$0 }
END { for(i=2;i<=FNR;i++) print pfx[i] sum[i] }
' file1 file2
COL1 COL2 COL3 COL4
x y z 8
a b c 14
The above will work robustly and efficiently with any awk on any UNIX system, with any number of input files and with any contents of those files. The only potential problem would be that it has to retain the equivalent of 1 of those files in memory so if each file was absolutely massive then you may exhaust available memory.

Related

Bash: Reading CSV text file and finding average of rows

This is the sample input (the data has user-IDs and the number of hours spent by the user):
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
I need to read the data, find all User-IDs ending in even numbers (2,4,6,8..) and find average number of hours spent (over five days).
I wrote the following script:
hoursarray=(0,0,0,0,0)
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
if [[ $col2 == *"2" ]]; then
#echo "$col2"
((hoursarray[0] = col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"4" ]]; then
#echo "$col2"
((hoursarray[1] = hoursarray[1] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"6" ]]; then
#echo "$col2"
((hoursarray[2] = hoursarray[2] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"8" ]]; then
#echo "$col2"
((hoursarray[3] = hoursarray[3] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"10" ]]; then
#echo "$col2"
((hoursarray[4] = hoursarray[4] + col3 + col4 + col5 + col6 + col7))
fi
done < <(tail -n+2 user-list.txt)
echo ${hoursarray[0]}
echo "$((hoursarray[0]/5))"
This is not a very good way of doing this. Also, the numbers arent adding up correctly.
I am getting the following output (for the first one - user2):
27
5
I am expecting the following output:
27
5.4
What would be a better way to do it? Any help would be appreciated.
TIA
Your description is fairly imprecise, but here's an attempt primarily based on the sample output:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};print a;printf "%.2g\n",a/5; a=0}' file
20
4
27
5.4
$2~/[24680]$/ makes sure we only look at "even" user-IDs.
for(i=3;i<=7;i++){} iterates over the day columns and adds them.
Edit 1:
Accommodating new requirement:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' saad
User4 4
User2 5.4
Sample data showing userIDs with even and odd endings, userID showing up more than once (eg, User2), and some non-integer values:
$ cat user-list.txt
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer5,User120,9,8,10,0,0
Computer5,User2,4,7,12,3.5,1.5
One awk solution to find total hours plus averages, across 5x days, with duplicate userIDs rolled into a single set of numbers, but limited to userIDs that end in an even number:
$ awk -F',' 'FNR==1 { next } $2 ~ /[02468]$/ { tot[$2]+=($3+$4+$5+$6+$7) } END { for ( i in tot ) { print i, tot[i], tot[i]/5 } }' user-list.txt
Where:
-F ',' - use comma as input field delimiter
FNR==1 { next } - skip first line
$2 ~ /[02468]$/ - if field 2 ends in an even number
tot[$2]+=($3+$4+$5+$6+$7) - add current line's hours to array where userID is the array index; this will add up hours from multiple input lines (for same userID) into a single array cell
for (...) { print ...} - loop through array indices printing the index, total hours and average hours (total divided by 5)
The above generates:
User120 27 5.4
User2 55 11
User4 20 4
Depending on OPs desired output the print can be replaced with printf and the desired format string ...
You issue is echo "$((hoursarray[0]/5))" Bash does not have floating point, so it returns the integer portion only.
Easy to demonstrate:
$ hours=27
$ echo "$((hours/5))"
5
If you want to stick to Bash, you could use bc for the floating point result:
$ echo "$hours / 5.0" | bc -l
5.40000000000000000000
Or use awk, perl, python, ruby etc.
Here is an awk you can parse out. Easily modified to you use (which is a little unclear to me)
awk -F, 'FNR==1{print $2; next}
{arr[$2]+=($3+$4+$5+$6+$7) }
END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file
Prints:
User ID
User1 27 5.4
User2 27 5.4
User3 22 4.4
User4 20 4
User5 40 8
If you only want even users, filter for User that end in any of 0,2,4,6,8:
awk -F, 'FNR==1{print $2; next}
$2~/[24680]$/ {arr[$2]+=($3+$4+$5+$6+$7) }
END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file
Prints:
User ID
User2 27 5.4
User4 20 4
Here is your script modified a little bit:
while IFS=, read -r col1 col2 col3 || [[ -n $col1 ]]
do
(( $(sed 's/[^[:digit:]]*//' <<<$col2) % 2 )) || ( echo -n "For $col1 $col2 average is: " && echo "($(tr , + <<<$col3))/5" | bc -l )
done < <(tail -n+2 list.txt)
prints:
For Computer3 User4 average is: 4.00000000000000000000
For Computer5 User2 average is: 5.40000000000000000000

Linux: I want to fetch the top 10 records of the column3. The column has some missing data. I have sorted the file

I am not sure how to ignore the missing data here.
My ; separated file looks like (writing it with spaces so that it is readable):
Col1 Col2 Col3 Col4 Col5
12 a ? ? ?
1 b ? ? ?
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
and so on....
I want to fetch the records with maximum values of column3
Col1 Col2 Col3 Col4 Col5
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
I sorted the file on column3 in reverse order. Not sure how to proceed further.
sort -t';' -k3 -r original.txt > newfile.txt
Something great about the command line is that you can easily use the best tool for the proper application, either chaining output with pipes | or by creating temporary files like newfile.txt.
In this case, using sort is the apt choice for sorting your data. Once it's sorted, you can use a separate tool that's very efficient at parsing data, awk.
Starting from your sorted newfile.txt, this awk operation will only print a line with 5 fields (assuming your missing data is actually missing and there are no empty separators, e.g. your line looks like 45;c; rather than 45;c;;;)
awk -F';' 'NF == 5 { print }' newfile.txt
However, in the case that the empty fields are delimited (e.g. 45;c;;;), and assuming that only columns 3 through 5 may have missing data, this will handle it:
awk -F';' 'NF == 5 && $3 && $4 && $5 { print }' newfile.txt
Note that since the default behavior of awk is to print, the above { print } is actually unnecessary, but included pedagogically.
Thus, from start to finish, you can get your desired result with,
sort -t ';' -rk3 original.txt | awk 'NF==5 && $3 && $4 && $5' > result.txt
You can use the following command:
$ head -1 fileToSort; (sort -k3 -n -r <(tail -n +2 fileToSort) | head)
Col1 Col2 Col3 Col4 Col5
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
1 b ? ? ?
12 a ? ? ?
where fileToSort is
cat fileToSort
Col1 Col2 Col3 Col4 Col5
12 a ? ? ?
1 b ? ? ?
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
Explanations:
Use -t';' if your field separator is ;
<(tail -n +2 fileToSort) will exclude the header of the input file
You then sort it via the 3rd key in reverse and numeric mode using -n
head will limit the output at the first 10 records
head -1 fileToSort; will print the header line before printing the top 10 records
If you need an awk solution:
awk 'NR==1;NF == 5 && $3~/^[0-9]+(\.[0-9]+)+$/ && $4~/^[0-9]+(\.[0-9]+)+$/ && $5~/^[0-9]+(\.[0-9]+)+$/{buff[$3]=$0}END{n=asorti(buff,out); for (i = n; i >= 1; i--){print buff[out[i]]}}' fileToSort
Col1 Col2 Col3 Col4 Col5
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
You might need to add -F';' just after the awk command if your file does have ; instead of spaces. Your command will become: awk -F';' ...
NR==1; print the first line
NF == 5 && $3~/^[0-9]+(\.[0-9]+)+$/ && $4~/^[0-9]+(\.[0-9]+)+$/ && $5~/^[0-9]+(\.[0-9]+)+$/ check that you have 5 fields and that the values of the 3 last columns are numerical
{buff[$3]=$0} save each line in a buffer indexed by the col3 value
END{n=asorti(buff,out); for (i = n; i >= 1; i--){print buff[out[i]]}} at the end of the processing just order the array depending on the value of the index and print it in the reverse order.

Swap two columns depending on condition for third column in linux

I have a file with 3 columns like this
Col1 Col2 Col3
A B <-
C D ->
E F ->
I want to swap the entries of the Col1 and Col2 whenever there is
<-
in the third column. I want my output file to be like
Col1 Col2 Col3
B A ->
C D ->
E F ->
awk '($3=="<-"){$3=$2;$2=$1;$1=$3;$3="->"}1' <file>
Essentially, if $3=="<-", then swap the columns and redefine $3. Then print.
An short awk example is
cat foooo | awk '{if (match($3,"<-")){print $2,$1,$3}else{print $1,$2,$3}}'
where foooo is the file name.
If you also want to change the "<-" then the code would be
cat foooo | awk '{if (match($3,"<-")){print $2,$1,"->"}else{print $1,$2,$3}}'

awk difference between subsequent lines

This is a great example how to solve the problem if I want to print differences between subsequent lines of a single column.
awk 'NR>1{print $1-p} {p=$1}' file
But how would I do it if I have multiple (unknown) number of columns in the file and I want the differences for all of them, eg. (note that the number of columns is not necessarily 3, it can be 10 or 15 or more)
col1 col2 col3
---- ---- ----
1 3 2
2 4 10
1 9 -3
. . .
the output would be:
col1 col2 col3
---- ---- ----
1 1 8
-1 5 -13
. . .
Instead of saving the first column, save the entire line and you would able to split it then print the difference using a loop:
awk 'NR>1{for(i=1;i<=NF;i++) printf "%d ", $i - a[i] ; print ""}
{p=split($0, a)}' file
If you need the column title then you can print it using BEGIN.
$ awk 'NR<3; NR>3{for (i=1;i<=NF;i++) printf "%d%s", $i-p[i], (i<NF?OFS:ORS)} {split($0,p)}' file | column -t
col1 col2 col3
---- ---- ----
1 1 8
-1 5 -13

How do I check if a field is empty or null in a text file using awk and bash

I have two text files and I want to compare their correspondent values according to their rows and columns. Each value (field) in the text file is separated by tabs.
Here are the files:
file1.txt
Name Col1 Col2 Col3
-----------------------
row1 1 4 7
row2 2 5 8
row3 3 6 9
file2.txt
Name Col1 Col2 Col3
-----------------------
row2 1 4 11
row1 2 5 12
row3 3 9
Here is the code I have so far:
awk '
FNR < 2 {next}
FNR == NR {
for (i = 2; i <= NF; i++) {
a[i,$1] = $i;
}
next;
}
# only compare if a row in file2 exists in file1
($1 in b) {
for (i = 2; i <= NF; i++)
{
if (a[i,$1] == $i)
{
print "EQUAL"
}
else if ( //condition that checks if value is null// )
{
print "NULL"
}
else
{
print "NOT EQUAL"
}
}
}' file1.txt file2.txt
I am having difficulties with checking if there is a null value (row3 and col2 in file2.txt) in file2.txt. I don't even get an output for that null value. So far I tried if ($i == "") and it is still not giving me any output. Any suggestions? Thanks. (I'm using gnu awk in a bash script)
Let me know if further explanation is required.
Just set the FS to tab:
awk -F'\t' '....'

Resources