Calculate mean of each column ignoring missing data with awk - linux

I have a large tab-separated data table with thousands of rows and dozens of columns and it has missing data marked as "na". For example,
na 0.93 na 0 na 0.51
1 1 na 1 na 1
1 1 na 0.97 na 1
0.92 1 na 1 0.01 0.34
I would like to calculate the mean of each column, but making sure that the missing data are ignored in the calculation. For example, the mean of column 1 should be 0.97. I believe I could use awk but I am not sure how to construct the command to do this for all columns and account for missing data.
All I know how to do is to calculate the mean of a single column but it treats the missing data as 0 rather than leaving it out of the calculation.
awk '{sum+=$1} END {print sum/NR}' filename

This is obscure, but works for your example
awk '{for(i=1; i<=NF; i++){sum[i] += $i; if($i != "na"){count[i]+=1}}} END {for(i=1; i<=NF; i++){if(count[i]!=0){v = sum[i]/count[i]}else{v = 0}; if(i<NF){printf "%f\t",v}else{print v}}}' input.txt
EDIT:
Here is how it works:
awk '{for(i=1; i<=NF; i++){ #for each column
sum[i] += $i; #add the sum to the "sum" array
if($i != "na"){ #if value is not "na"
count[i]+=1} #increment the column "count"
} #endif
} #endfor
END { #at the end
for(i=1; i<=NF; i++){ #for each column
if(count[i]!=0){ #if the column count is not 0
v = sum[i]/count[i] #then calculate the column mean (here represented with "v")
}else{ #else (if column count is 0)
v = 0 #then let mean be 0 (note: you can set this to be "na")
}; #endif col count is not 0
if(i<NF){ #if the column is before the last column
printf "%f\t",v #print mean + TAB
}else{ #else (if it is the last column)
print v} #print mean + NEWLINE
}; #endif
}' input.txt #endfor (note: input.txt is the input file)
```

A possible solution:
awk -F"\t" '{for(i=1; i <= NF; i++)
{if($i == $i+0){sum[i]+=$i; denom[i] += 1;}}}
END{for(i=1; i<= NF; i++){line=line""sum[i]/(denom[i]?denom[i]:1)FS}
print line}' inputFile
The output for the given data:
0.973333 0.9825 0 0.7425 0.01 0.7125
Note that the third column contains only "na" and the output is 0. If you want the output to be na, then change the END{...}-block to:
END{for(i=1; i<= NF; i++){line=line""(denom[i] ? sum[i]/denom[i]:"na")FS}
print line}'

Related

How to sum column values of items with shared substring in first column using bash

I am trying to sum values across rows of a dataframe for rows which have a shared substring in the first column. The data looks like this:
ID Data_1 Data_2 Data_3 Data_4
SRW8002300_T01 1 2 3 4
SRW8002300_T02 1 2 3 4
SRW8002300_T03 1 2 3 4
SRW8004500_T01 1 2 3 4
SRW8004500_T02 1 2 3 4
SRW8006000_T01 1 2 3 4
I want to sum the 2nd to 5th column values when the first part of the ID (the part before the underscore) is shared. So the above would become:
ID Data_1 Data_2 Data_3 Data_4
SRW8002300 3 6 9 12
SRW8004500 2 4 6 8
SRW8006000 1 2 3 4
So far I've got an awk command that can strip the IDs of the string after the underscore:
awk '{print $1}' filename | awk -F'_' '{print $1}'
And another to sum column values if the value in the first column is shared:
awk '{a[$1]+=$2;b[$1]+=$3;c[$1]+=$4;d[$1]+=$5} END {for (i in a) print i, a[i], b[i], c[i], d[i]}' filename
However, I am struggling to combine these two commands to create a new dataframe with summed values for the shared IDs.
I usually code in python but am trying to get into the habit of writing bash scripts for these sorts of tasks.
Thank you for any help.
Assuming your key values are contiguous as shown in your sample input:
$ cat tst.awk
NR==1 { print; next }
{
curr = $1
sub(/_.*/,"",curr)
if ( curr != prev ) {
prt()
}
for (i=2; i<=NF; i++) {
sum[i] += $i
}
prev = curr
}
END { prt() }
function prt() {
if ( prev != "" ) {
printf "%s%s", prev, OFS
for (i=2; i<=NF; i++) {
printf "%d%s", sum[i], (i<NF ? OFS : ORS)
}
delete sum
}
}
$ awk -f tst.awk file
ID Data_1 Data_2 Data_3 Data_4
SRW8002300 3 6 9 12
SRW8004500 2 4 6 8
SRW8006000 1 2 3 4

Compute average if satisfies the given condition in shell script

I have a dataset with many kind of missing values such as 9990, 9999, 9999000, 999999 and many more. But all are greater than 9990. I would like to take average of each 24 values. I am trying with the following command but not getting my desire output.
awk '{if ($1 < 9990) sum += $1; count++} NR%24==0{print count ? (sum) :9999;sum=count=0}'ifile
For example: I need average of each 3 lines in the following data
3
3
4
9999
4
99990
13
3
999999
9999
9991
99954
I tried with this, but showing different result:
awk '{if ($1 < 9990)sum += $1; count++} NR%3==0{print count ? (sum/count) :9999;sum=count=0}'ifile
My desire output is
3.33
4 Average of 9999 4 99990 is done with 4/1. Because 9999 and 99990 are undefined values.
8 Average of 13 3 999999 is done with (13+8)/2. Because 999999 is an undefined value, so excluded from the average.
9999 All are undefined values, so denoted as 9999.
$1 < 9990 {
sum += $1;
count++;
}
NR % 3 == 0 {
if (count == 0) {
print "9999";
} else {
print sum / count;
}
sum = 0;
count = 0;
}
Your mistake is to increment count when the value is "undefined". If you write
{if ($1 < 9990) sum += $1; count++}
then the if statement ends at the next semicolon, not at the closing bracket.

How to remove the columns which contains NA in linux

I would like to remove the column which contains any number of NA. I used this command
awk ' $0 !="NA" {print $0}' file
But it does not work.
For example, the file is as following
1 2 3 NA 6 male
4 6 2 1 NA female
NA 2 2 NA 3 male
7 2 2 7 NA male
I want to the output file as
2 3 male
6 2 female
2 2 male
2 2 male
You need to make two passes over the data. The first pass should save all the input in an array, find the column numbers that contain NA, and save that in another array. Then at the end you print all the saved data, but skip over the columns that are in the second array.
awk '{ lines[NR] = $0; for (i = 1; i <= NF; i++) if ($i == "NA") skip[i] = 1;}
END { for (i = 1; i <= NR; i++) {
nf = split(lines[i], fields);
for (j = 1; j <= nf; j++) if (!(j in skip)) printf("%s ", fields[j]);
printf("\n");
}
}' inputfile > outputfile

How do I check if a field is empty or null in a text file using awk and bash

I have two text files and I want to compare their correspondent values according to their rows and columns. Each value (field) in the text file is separated by tabs.
Here are the files:
file1.txt
Name Col1 Col2 Col3
-----------------------
row1 1 4 7
row2 2 5 8
row3 3 6 9
file2.txt
Name Col1 Col2 Col3
-----------------------
row2 1 4 11
row1 2 5 12
row3 3 9
Here is the code I have so far:
awk '
FNR < 2 {next}
FNR == NR {
for (i = 2; i <= NF; i++) {
a[i,$1] = $i;
}
next;
}
# only compare if a row in file2 exists in file1
($1 in b) {
for (i = 2; i <= NF; i++)
{
if (a[i,$1] == $i)
{
print "EQUAL"
}
else if ( //condition that checks if value is null// )
{
print "NULL"
}
else
{
print "NOT EQUAL"
}
}
}' file1.txt file2.txt
I am having difficulties with checking if there is a null value (row3 and col2 in file2.txt) in file2.txt. I don't even get an output for that null value. So far I tried if ($i == "") and it is still not giving me any output. Any suggestions? Thanks. (I'm using gnu awk in a bash script)
Let me know if further explanation is required.
Just set the FS to tab:
awk -F'\t' '....'

How to loop an awk command on every column of a table and output to a single output file?

I have a multi column file composed of single unit 1s, 2s and 3s. There are a lot of repeats of a unit in each column, and sometimes it switches from one to another. I want to count how many times this switch happens on every column. For example in column 1 the switch change from 1 to 2 to 3 to 1, so there are 3 switches and the output should be 3. In the second column there are 2s the entire column, so the changes is 0 and the output is 0.
My input file has 4000 columns so it is impossible to do it by hand. The file is space separated.
For example:
Input:
1 2 3 1 2
1 2 2 1 3
1 2 3 1 2
2 2 2 1 2
2 2 2 1 2 ......
3 2 2 1 2
3 2 2 1 1
1 2 2 1 1
1 2 2 1 2
1 2 2 1 1
Desired output:
3 ## column 1 switch times
0 ## column 2 switch times
3 .....
0
5
I was using:
awk '{print $1}' <inputfile> | uniq | wc -l
awk '{print $2}' <inputfile> | uniq | wc -l
awk '{print $3}' <inputfile> | uniq | wc -l
....
This execute one column at a time. It will give me the output "4" for the first column, later I will just calculate 4-1 =3 to get my desired output. But Is there a way I can write this awk command into a loop and execute it on each column and output to one file?
Thanks!
awk tells you how many fields are in a given row in the variable NF, so you can create two arrays to keep track of the information you need. One array will keep the value of the last row in the given column. The other will count the number of switches in a given column. You'll also keep a track of the maximum number of columns (and set the counts for new columns to zero so that they are printed appropriately in the output at the end if the number of switches is 0 for that column). You'll also make sure you don't count the transition from an empty string to a non-empty string — which happens when the column is encountered for the first time.
If, in fact, the file is uniformly the same number of columns, that will only affect the first row of data. If subsequent rows actually have more columns than the first line, then it adds them. If a column stops appearing for a bit, I've assumed it should resume where it left off (as if the missing columns were the same value as before). You can decide on different algorithms; that could count as two transitions (from number to blank and from blank to number too. If that's the case, you have to modify the counting code. Or, perhaps more sensibly, you could decide that irregular numbers of columns are simply not allowed, in which case you can bail out early if the number of columns in the current row is not the same as in the previous row (beware blank lines, or are they outlawed too?).
And you won't try writing the whole program on one line because it will be incomprehensible and it really isn't necessary.
awk '{ if (NF > maxNF)
{
for (i = maxNF + 1; i <= NF; i++)
count[i] = 0;
maxNF = NF;
}
for (i = 1; i <= NF; i++)
{
if (col[i] != "" && $i != col[i])
count[i]++;
col[i] = $i;
}
}
END {
for (i = 1; i <= maxNF; i++)
print count[i];
}' data-file-with-4000-columns
Given your sample data (with the dots removed), the output from the script is as requested:
3
0
3
0
5
This alternative data file with jagged rows:
1 2 3 1 2
1 2 2 1 3
1 2 3 1 2
2 2 2 1 2
2 2 2 1 2 1 1 1
3 2 2 1 2 2 1
3 2 2 1 1
1 2 2 1 1 2 2 1
1 2 2 1
1 2 2 1 1 3
produces the output:
3
0
3
0
3
2
1
0
Which is correct according to the rules I formulated — but if you decide you want different rules to cover the data, you can end up with different answers.
If you used printf("%d\n", count[i]); in the final loop, you'd not need to set the count values to zero in a loop. You pays your money and takes your pick.
Use a loop and keep an array for each of the column current value and another array for the corresponding count:
awk '{for(i=0;i<5;i++) if(c[i]!=$(i+1)) {c[i]=$(i+1); t[i]++}} END{for(i=0;i<5;i++)print t[i]-1}' filename
Note that this assumes that column's value are not zero. If you happen to have zero values, then just initialize the array c to some unique value which will not be present in the file.
Coded out for ease of viewing, SaveColx, CountColx should be arrays. I'd print the column number itself in the results at least for checking :-)
BEGIN {
SaveCol1 = " "
CountCol1 = 0
CountCol2 = 0
CountCol3 = 0
CountCol4 = 0
CountCol5 = 0
}
{
if ( SaveCol1 == " " ) {
SaveCol1 = $1
SaveCol2 = $2
SaveCol3 = $3
SaveCol4 = $4
SaveCol5 = $5
next
}
if ( $1 != SaveCol1 ) {
CountCol1++
SaveCol1 = $1
}
if ( $2 != SaveCol2 ) {
CountCol2++
SaveCol2 = $2
}
if ( $3 != SaveCol3 ) {
CountCol3++
SaveCol3 = $3
}
if ( $4 != SaveCol4 ) {
CountCol4++
SaveCol4 = $4
}
if ( $5 != SaveCol5 ) {
CountCol5++
SaveCol5 = $5
}
}
END {
print CountCol1
print CountCol2
print CountCol3
print CountCol4
print CountCol5
}

Resources