How do I check if a field is empty or null in a text file using awk and bash - linux

I have two text files and I want to compare their correspondent values according to their rows and columns. Each value (field) in the text file is separated by tabs.
Here are the files:
file1.txt
Name Col1 Col2 Col3
-----------------------
row1 1 4 7
row2 2 5 8
row3 3 6 9
file2.txt
Name Col1 Col2 Col3
-----------------------
row2 1 4 11
row1 2 5 12
row3 3 9
Here is the code I have so far:
awk '
FNR < 2 {next}
FNR == NR {
for (i = 2; i <= NF; i++) {
a[i,$1] = $i;
}
next;
}
# only compare if a row in file2 exists in file1
($1 in b) {
for (i = 2; i <= NF; i++)
{
if (a[i,$1] == $i)
{
print "EQUAL"
}
else if ( //condition that checks if value is null// )
{
print "NULL"
}
else
{
print "NOT EQUAL"
}
}
}' file1.txt file2.txt
I am having difficulties with checking if there is a null value (row3 and col2 in file2.txt) in file2.txt. I don't even get an output for that null value. So far I tried if ($i == "") and it is still not giving me any output. Any suggestions? Thanks. (I'm using gnu awk in a bash script)
Let me know if further explanation is required.

Just set the FS to tab:
awk -F'\t' '....'

Related

Bash: Reading CSV text file and finding average of rows

This is the sample input (the data has user-IDs and the number of hours spent by the user):
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
I need to read the data, find all User-IDs ending in even numbers (2,4,6,8..) and find average number of hours spent (over five days).
I wrote the following script:
hoursarray=(0,0,0,0,0)
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
if [[ $col2 == *"2" ]]; then
#echo "$col2"
((hoursarray[0] = col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"4" ]]; then
#echo "$col2"
((hoursarray[1] = hoursarray[1] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"6" ]]; then
#echo "$col2"
((hoursarray[2] = hoursarray[2] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"8" ]]; then
#echo "$col2"
((hoursarray[3] = hoursarray[3] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"10" ]]; then
#echo "$col2"
((hoursarray[4] = hoursarray[4] + col3 + col4 + col5 + col6 + col7))
fi
done < <(tail -n+2 user-list.txt)
echo ${hoursarray[0]}
echo "$((hoursarray[0]/5))"
This is not a very good way of doing this. Also, the numbers arent adding up correctly.
I am getting the following output (for the first one - user2):
27
5
I am expecting the following output:
27
5.4
What would be a better way to do it? Any help would be appreciated.
TIA
Your description is fairly imprecise, but here's an attempt primarily based on the sample output:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};print a;printf "%.2g\n",a/5; a=0}' file
20
4
27
5.4
$2~/[24680]$/ makes sure we only look at "even" user-IDs.
for(i=3;i<=7;i++){} iterates over the day columns and adds them.
Edit 1:
Accommodating new requirement:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' saad
User4 4
User2 5.4
Sample data showing userIDs with even and odd endings, userID showing up more than once (eg, User2), and some non-integer values:
$ cat user-list.txt
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer5,User120,9,8,10,0,0
Computer5,User2,4,7,12,3.5,1.5
One awk solution to find total hours plus averages, across 5x days, with duplicate userIDs rolled into a single set of numbers, but limited to userIDs that end in an even number:
$ awk -F',' 'FNR==1 { next } $2 ~ /[02468]$/ { tot[$2]+=($3+$4+$5+$6+$7) } END { for ( i in tot ) { print i, tot[i], tot[i]/5 } }' user-list.txt
Where:
-F ',' - use comma as input field delimiter
FNR==1 { next } - skip first line
$2 ~ /[02468]$/ - if field 2 ends in an even number
tot[$2]+=($3+$4+$5+$6+$7) - add current line's hours to array where userID is the array index; this will add up hours from multiple input lines (for same userID) into a single array cell
for (...) { print ...} - loop through array indices printing the index, total hours and average hours (total divided by 5)
The above generates:
User120 27 5.4
User2 55 11
User4 20 4
Depending on OPs desired output the print can be replaced with printf and the desired format string ...
You issue is echo "$((hoursarray[0]/5))" Bash does not have floating point, so it returns the integer portion only.
Easy to demonstrate:
$ hours=27
$ echo "$((hours/5))"
5
If you want to stick to Bash, you could use bc for the floating point result:
$ echo "$hours / 5.0" | bc -l
5.40000000000000000000
Or use awk, perl, python, ruby etc.
Here is an awk you can parse out. Easily modified to you use (which is a little unclear to me)
awk -F, 'FNR==1{print $2; next}
{arr[$2]+=($3+$4+$5+$6+$7) }
END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file
Prints:
User ID
User1 27 5.4
User2 27 5.4
User3 22 4.4
User4 20 4
User5 40 8
If you only want even users, filter for User that end in any of 0,2,4,6,8:
awk -F, 'FNR==1{print $2; next}
$2~/[24680]$/ {arr[$2]+=($3+$4+$5+$6+$7) }
END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file
Prints:
User ID
User2 27 5.4
User4 20 4
Here is your script modified a little bit:
while IFS=, read -r col1 col2 col3 || [[ -n $col1 ]]
do
(( $(sed 's/[^[:digit:]]*//' <<<$col2) % 2 )) || ( echo -n "For $col1 $col2 average is: " && echo "($(tr , + <<<$col3))/5" | bc -l )
done < <(tail -n+2 list.txt)
prints:
For Computer3 User4 average is: 4.00000000000000000000
For Computer5 User2 average is: 5.40000000000000000000

How to sum column values of items with shared substring in first column using bash

I am trying to sum values across rows of a dataframe for rows which have a shared substring in the first column. The data looks like this:
ID Data_1 Data_2 Data_3 Data_4
SRW8002300_T01 1 2 3 4
SRW8002300_T02 1 2 3 4
SRW8002300_T03 1 2 3 4
SRW8004500_T01 1 2 3 4
SRW8004500_T02 1 2 3 4
SRW8006000_T01 1 2 3 4
I want to sum the 2nd to 5th column values when the first part of the ID (the part before the underscore) is shared. So the above would become:
ID Data_1 Data_2 Data_3 Data_4
SRW8002300 3 6 9 12
SRW8004500 2 4 6 8
SRW8006000 1 2 3 4
So far I've got an awk command that can strip the IDs of the string after the underscore:
awk '{print $1}' filename | awk -F'_' '{print $1}'
And another to sum column values if the value in the first column is shared:
awk '{a[$1]+=$2;b[$1]+=$3;c[$1]+=$4;d[$1]+=$5} END {for (i in a) print i, a[i], b[i], c[i], d[i]}' filename
However, I am struggling to combine these two commands to create a new dataframe with summed values for the shared IDs.
I usually code in python but am trying to get into the habit of writing bash scripts for these sorts of tasks.
Thank you for any help.
Assuming your key values are contiguous as shown in your sample input:
$ cat tst.awk
NR==1 { print; next }
{
curr = $1
sub(/_.*/,"",curr)
if ( curr != prev ) {
prt()
}
for (i=2; i<=NF; i++) {
sum[i] += $i
}
prev = curr
}
END { prt() }
function prt() {
if ( prev != "" ) {
printf "%s%s", prev, OFS
for (i=2; i<=NF; i++) {
printf "%d%s", sum[i], (i<NF ? OFS : ORS)
}
delete sum
}
}
$ awk -f tst.awk file
ID Data_1 Data_2 Data_3 Data_4
SRW8002300 3 6 9 12
SRW8004500 2 4 6 8
SRW8006000 1 2 3 4

Insert a row and a column in a matrix using awk

I have a gridded dataset with 250 rows x 300 columns in matrix form:
ifile.txt
2 3 4 1 2 3
3 4 5 2 4 6
2 4 0 5 0 7
0 0 5 6 3 8
I would like to insert the latitude values at the first column and longitude values at the top. Which looks like:
ofile.txt
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
The increment is 0.33
I can do it for a small size matrix in manually, but I can't able to get any idea how to get my output in my desired format. I was writing a script in the following way, but completely useless.
echo 20 > latitude.txt
for i in `seq 1 250`;do
i1=$(( i + 0.33 )) #bash can't recognize fractions
echo $i1 >> latitude.txt
done
echo 100 > longitude.txt
for j in `seq 1 300`;do
j1=$(( j + 0.33 ))
echo $j1 >> longitude.txt
done
paste longitude.txt ifile.txt > dummy_file.txt
cat latitude.txt dummy_file.txt > ofile.txt
$ cat tst.awk
BEGIN {
lat = 100
lon = 20
latWid = lonWid = 6
latDel = lonDel = 0.33
latFmt = lonFmt = "%*.2f"
}
NR==1 {
printf "%*s", latWid, ""
for (i=1; i<=NF; i++) {
printf lonFmt, lonWid, lon
lon += lonDel
}
print ""
}
{
printf latFmt, latWid, lat
lat += latDel
for (i=1; i<=NF; i++) {
printf "%*s", lonWid, $i
}
print ""
}
$ awk -f tst.awk file
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
Following awk may also help you on same.
awk -v col=100 -v row=20 'FNR==1{printf OFS;for(i=1;i<=NF;i++){printf row OFS;row=row+.33;};print ""} {col+=.33;$1=$1;print col OFS $0}' OFS="\t" Input_file
Adding non one liner form of above solution too now:
awk -v col=100 -v row=20 '
FNR==1{
printf OFS;
for(i=1;i<=NF;i++){
printf row OFS;
row=row+.33;
};
print ""
}
{
col+=.33;
$1=$1;
print col OFS $0
}
' OFS="\t" Input_file
Awk solution:
awk 'NR == 1{
long = 20.00; lat = 100.00; printf "%12s%.2f", "", long;
for (i=1; i<NF; i++) { long += 0.33; printf "\t%.2f", long } print "" }
NR > 1{ lat += 0.33 }
{
printf "%.2f%6s", lat, "";
for (i=1; i<=NF; i++) printf "\t%d", $i; print ""
}' file
With perl
$ perl -lane 'print join "\t", "", map {20.00+$_*0.33} 0..$#F if $.==1;
print join "\t", 100+(0.33*$i++), #F' ip.txt
20 20.33 20.66 20.99 21.32 21.65
100 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
-a to auto-split input on whitespaces, result saved in #F array
See https://perldoc.perl.org/perlrun.html#Command-Switches for details on command line options
if $.==1 for the first line of input
map {20.00+$_*0.33} 0..$#F iterate based on size of #F array, and for each iteration, we get a value based on equation inside {} where $_ will be 0, 1, etc upto last index of #F array
print join "\t", "", map... use tab separator to print empty element and results of map
For all the lines, print contents of #F array pre-fixed with results of 100+(0.33*$i++) where $i will be initially 0 in numeric context. Again, tab is used as separator while joining these values
Use sprintf if needed for formatting, also $, can be initialized instead of using join
perl -lane 'BEGIN{$,="\t"; $st=0.33}
print "", map { sprintf "%.2f", 20+$_*$st} 0..$#F if $.==1;
print sprintf("%.2f", 100+($st*$i++)), #F' ip.txt

Awk: Sum up column values across multiple files with identical column layout

I have a number of files with the same header:
COL1, COL2, COL3, COL4
You can ignore COL1-COL3. COL4 contains a number. Each file contains about 200 rows. I am trying to sum up across the rows. For example:
File 1
COL1 COL2 COL3 COL4
x y z 3
a b c 4
File 2
COL1 COL2 COL3 COL4
x y z 5
a b c 10
Then a new file is returned:
COL1 COL2 COL3 COL4
x y z 8
a b c 14
Is there a simple way to do this without AWK? I will use AWK if need be, I just thought there might be a simple one-liner that I could just run right away. The AWK script I have in mind feels a bit long.
Thanks
Combining paste with awk, as in Kristo Mägi's answer, is your best bet:
paste merges the corresponding lines from the input files,
which sends a single stream of input lines to awk, with each input line containing all fields to sum up.
Assuming a fixed number of input files and columns, Kristo's answer can be simplified to (making processing much more efficient):
paste file1 file2 | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'
Note: The above produces space-separated output columns, because awk's default value for OFS, the output-field separator, is a single space.
Assuming that all files have the same column structure and line count, below is a generalization of the solution, which:
generalizes to more than 2 input files (and more than 2 data rows)
generalizes to any number of fields, as long as the field to sum up is the last one.
#!/bin/bash
files=( file1 file2 ) # array of input files
paste "${files[#]}" | awk -v numFiles=${#files[#]} -v OFS='\t' '
{
row = sep = ""
for(i=1; i < NF/numFiles; ++i) { row = row sep $i; sep = OFS }
sum = $(NF/numFiles) # last header col. / (1st) data col. to sum
if (NR > 1) { for(i=2; i<=numFiles; ++i) sum += $(NF/numFiles * i) } # add other cols.
printf "%s%s%s\n", row, OFS, sum
}
'
Note that \t (the tab char.) is used to separate output fields and that, due to relying on awk's default line-splitting into fields, preserving the exact input whitespace between fields is not guaranteed.
If all files have the same header - awk solution:
awk '!f && FNR==1{ f=1; print $0 }FNR>1{ s[FNR]+=$NF; $NF=""; r[FNR]=$0 }
END{ for(i=2;i<=FNR;i++) print r[i],s[i] }' File[12]
The output (for 2 files):
COL1 COL2 COL3 COL4
x y z 8
a b c 14
This approach can be applied to multiple files (in that case you may specify globbing File* for filename expansion)
One more option.
The command:
paste f{1,2}.txt | sed '1d' | awk '{print $1,$2,$3,$4+$8}' | awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'
The result:
COL1 COL2 COL3 COL4
x y z 8
a b c 14
What it does:
Test files:
$ cat f1.txt
COL1 COL2 COL3 COL4
x y z 3
a b c 4
$ cat f2.txt
COL1 COL2 COL3 COL4
x y z 5
a b c 10
Command: paste f{1,2}.txt
Joins 2 files and gives output:
COL1 COL2 COL3 COL4 COL1 COL2 COL3 COL4
x y z 3 x y z 5
a b c 4 a b c 10
Command: sed '1d'
Is meant to remove header temporarily
Command: awk '{print $1,$2,$3,$4+$8}'
Returns COL1-3 and sums $4 and $8 from paste result.
Command: awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'
Adds header back
EDIT:
Following #mklement0 comment, he is right about header handling as I forgot the NR==1 part.
So, I'll proxy his updated version here also:
paste f{1,2}.txt | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'
You state you have "a number of files". i.e., more than 2.
Given these 3 files (and should work with any number):
$ cat f1 f2 f3
COL1 COL2 COL3 COL4
x y z 3
a b c 4
COL1 COL2 COL3 COL4
x y z 5
a b c 10
COL1 COL2 COL3 COL4
x y z 10
a b c 15
You can do:
$ awk 'FNR==1{next}
{sum[$1]+=$4}
END{print "COL1 COL4";
for (e in sum) print e, sum[e]} ' f1 f2 f3
COL1 COL4
x 18
a 29
It is unclear what you intend to do with COL2 or COL3, so I did not add that.
$ awk '
NR==1 { print }
{ sum[FNR]+=$NF; sub(/[^[:space:]]+[[:space:]]*$/,""); pfx[FNR]=$0 }
END { for(i=2;i<=FNR;i++) print pfx[i] sum[i] }
' file1 file2
COL1 COL2 COL3 COL4
x y z 8
a b c 14
The above will work robustly and efficiently with any awk on any UNIX system, with any number of input files and with any contents of those files. The only potential problem would be that it has to retain the equivalent of 1 of those files in memory so if each file was absolutely massive then you may exhaust available memory.

Calculate mean of each column ignoring missing data with awk

I have a large tab-separated data table with thousands of rows and dozens of columns and it has missing data marked as "na". For example,
na 0.93 na 0 na 0.51
1 1 na 1 na 1
1 1 na 0.97 na 1
0.92 1 na 1 0.01 0.34
I would like to calculate the mean of each column, but making sure that the missing data are ignored in the calculation. For example, the mean of column 1 should be 0.97. I believe I could use awk but I am not sure how to construct the command to do this for all columns and account for missing data.
All I know how to do is to calculate the mean of a single column but it treats the missing data as 0 rather than leaving it out of the calculation.
awk '{sum+=$1} END {print sum/NR}' filename
This is obscure, but works for your example
awk '{for(i=1; i<=NF; i++){sum[i] += $i; if($i != "na"){count[i]+=1}}} END {for(i=1; i<=NF; i++){if(count[i]!=0){v = sum[i]/count[i]}else{v = 0}; if(i<NF){printf "%f\t",v}else{print v}}}' input.txt
EDIT:
Here is how it works:
awk '{for(i=1; i<=NF; i++){ #for each column
sum[i] += $i; #add the sum to the "sum" array
if($i != "na"){ #if value is not "na"
count[i]+=1} #increment the column "count"
} #endif
} #endfor
END { #at the end
for(i=1; i<=NF; i++){ #for each column
if(count[i]!=0){ #if the column count is not 0
v = sum[i]/count[i] #then calculate the column mean (here represented with "v")
}else{ #else (if column count is 0)
v = 0 #then let mean be 0 (note: you can set this to be "na")
}; #endif col count is not 0
if(i<NF){ #if the column is before the last column
printf "%f\t",v #print mean + TAB
}else{ #else (if it is the last column)
print v} #print mean + NEWLINE
}; #endif
}' input.txt #endfor (note: input.txt is the input file)
```
A possible solution:
awk -F"\t" '{for(i=1; i <= NF; i++)
{if($i == $i+0){sum[i]+=$i; denom[i] += 1;}}}
END{for(i=1; i<= NF; i++){line=line""sum[i]/(denom[i]?denom[i]:1)FS}
print line}' inputFile
The output for the given data:
0.973333 0.9825 0 0.7425 0.01 0.7125
Note that the third column contains only "na" and the output is 0. If you want the output to be na, then change the END{...}-block to:
END{for(i=1; i<= NF; i++){line=line""(denom[i] ? sum[i]/denom[i]:"na")FS}
print line}'

Resources