Find the average of multiple columns for each distinct variable in column 1 - linux

Hi I have a file with 6 columns and I wish to know the average of three of these (columns 2,3,4) and the sum of the last two (columns 5 and 6) for each unique variable in column one.
A1234 0.526 0.123 0.456 0.986 1.123
A1234 0.423 0.256 0.397 0.876 0.999
A1234 0.645 0.321 0.402 0.903 1.101
A1234 0.555 0.155 0.406 0.888 1.009
B5678 0.111 0.345 0.285 0.888 0.789
B5678 0.221 0.215 0.305 0.768 0.987
B5678 0.336 0.289 0.320 0.789 0.921
I have come across code that will get the average for column 2 based on column one but is there anyway I can expand this across columns? Thanks
awk '{a[$1]+=$2; c[$1]++} END{for (i in a) printf "%d%s%.2f\n", i, OFS, a[i]/c[i]}'
I would like the output to be in the following format ;each variable in column one will also have a different number of rows
A1234 0.53725 0.21375 0.41525 3.653 4.232
B5678 0.22233 0.283 0.30333 2.445 2.697

awk '{a[$1]+=$2;b[$1]+=$3;c[$1]+=$4;d[$1]+=$5;e[$1]+=$6;f[$1]++} END{for (i in a) print i,a[i]/f[i],b[i]/f[i],c[i]/f[i],d[i],e[i]}' file
O/p:
B5678 0.222667 0.283 0.303333 2.445 2.697
A1234 0.53725 0.21375 0.41525 3.653 4.232

try following once and let me know if this helps you.
awk '{A[$1]=A[$1]?A[$1]+$5+$6:$5+$6;C[$1]=C[$1]?C[$1]+$2+$3+$4:$2+$3+$4;B[$1]++} END{for(i in A){print "Avg. for " i" =\t",C[i]/(B[i]*3) RS "Count for " i" =\t",A[i]}}' Input_file
EDIT: Adding a non-one liner form of solution too now.
awk '{
A[$1]=A[$1]?A[$1]+$5+$6:$5+$6;
C[$1]=C[$1]?C[$1]+$2+$3+$4:$2+$3+$4;
B[$1]++
}
END{
for(i in A){
print "Avg. for " i" =\t",C[i]/(B[i]*3) RS "Count for " i" =\t",A[i]
}
}
' Input_file

awk solution:
awk '{ a[$1]++; avg[$1]+=$2+$3+$4; sum[$1]+=$5+$6 }
END{ for(i in a) printf "%s%.2f%s%.2f\n",i OFS,avg[i]/(a[i]*3),OFS,sum[i] }' file
The output (the 2nd column - average value, the 3rd column - sum value):
B5678 0.27 5.14
A1234 0.39 7.88

To calculate average of column 2, 3, 4:
awk '{ sum += $2 + $3 + $4 } END { print sum / (NR * 3) }'
To calculate the sum of column 5 and 6 group by column 1:
awk '{ arr[$1] += $5 + $6 } END { for (a in arr) if (a) print a, arr[a] }'
To calculate column 5 and 6 of the last row:
tail file -1 | awk '{sum += $5 + $6} END {print sum}'

Related

Compare multiple rows to pick the one with smallest value

I would like to compare the rows in the second column, and get the row with the highest value in the consecutive columns, with priority of column 3> 4 > 5. I sorted my dataset for the second column so the same values will be together.
My dataset looks like this:
X1 A 0.38 24.68 2.93
X2 A 0.38 20.22 14.54
X3 A 0.38 20.08 00.48
X3.3 A 0.22 11.55 10.68
C43 B 0.22 11.55 20.08
C4.2 C 0.22 11.55 3.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
C44 D 0.22 1.10 1.24
P1 E 0.42 0.42 0.42
P2 E 0.42 0.42 0.42
P3 E 0.42 0.42 0.42
In here, I would like to say, if second column is the same value with another row, then I compare their values in the third column and pick the row with the highest value in the third column.
If the rows have the same second and third columns, then I go to forth column and compare their values in this column, and then get row with the highest value.
If the rows sharing second column still share the values in third and forth columns, then I pick the row with highest value in the fifth column.
If, second-third-forth-fifth columns are the same (complete duplicates), then I print them all, but add 'duplicate' next to their fifth column.
If a row does not share its value for the second column for any other rows, then there is no comparison and I keep this column.
Therefore, my expected output will be:
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42duplicate
P2 E 0.42 0.42 0.42duplicate
P3 E 0.42 0.42 0.42duplicate
What I tried at the moment fails, because I can only compare based on second column and not with multiple columns conditioning and I cannot keep complete duplicates.
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++'
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42
I appreciate to learn how to fix it.
I'm afraid the code below is not sophisticated, how about:
awk -v OFS="\t" '$1=$1' "data.txt" | sort -k2,2 -k3nr -k4nr -k5nr > "tmp.txt"
awk -v OFS="\t" '
NR==FNR {
vals = $3","$4","$5
if (max[$2] == "") max[$2] = vals
else if (max[$2] == vals) dupe[$2] = 1
next
} {
vals = $3","$4","$5
if (dupe[$2]) $6 = "duplicate"
if (max[$2] == vals) print
}' "tmp.txt" "tmp.txt"
rm -f "tmp.txt"
It saves the sorted result in a temporary file "tmp.txt".
The 2nd awk script processes the temporary file with two passes.
In the 1st pass, it extracts the "max value" for each 2nd column.
It also detects the duplications and set the variable dupe if found.
In the 2nd pass, it assigns the variable $6 to a string duplicate
if the line has the dupe flag.
Then it prints only the line(s) which have the max value for each 2nd column.
This may not be the most elegant solution but it works
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++' | cut -f2- > /tmp/fgrep.$$
cat data.txt | fgrep -f /tmp/fgrep.$$ | awk '{
rec[NR] = $0
idx = sprintf("%s %s %s %s",$2,$3,$4,$5)
irec[NR] = idx
dup[idx]++
}
END{
for(i in rec){
if(dup[irec[i]]> 1){
print rec[i] "duplicate"
}else{
print rec[i]
}
}
}'
rm /tmp/fgrep.$$

Convert column to matrix format using awk

I have a gridded data file in column format as:
ifile.txt
x y value
20.5 20.5 -4.1
21.5 20.5 -6.2
22.5 20.5 0.0
20.5 21.5 1.2
21.5 21.5 4.3
22.5 21.5 6.0
20.5 22.5 7.0
21.5 22.5 10.4
22.5 22.5 16.7
I would like to convert it to matrix format as:
ofile.txt
20.5 21.5 22.5
20.5 -4.1 1.2 7.0
21.5 -6.2 4.3 10.4
22.5 0.0 6.0 16.7
Where top 20.5 21.5 22.5 indicate y and side values indicate x and the inside values indicate the corresponding grid values.
I found a similar question here Convert a 3 column file to matrix format but the script is not working in my case.
The script is
awk '{ h[$1,$2] = h[$2,$1] = $3 }
END {
for(i=1; i<=$1; i++) {
for(j=1; j<=$2; j++)
printf h[i,j] OFS
printf "\n"
}
}' ifile
The following awk script handles :
any size of matrix
no relation between row and column indices so it keeps track of them separately.
If a certain row column index does not appear, the value will default to zero.
This is done in this way:
awk '
BEGIN{PROCINFO["sorted_in"] = "#ind_num_asc"}
(NR==1){next}
{row[$1]=1;col[$2]=1;val[$1" "$2]=$3}
END { printf "%8s",""; for (j in col) { printf "%8.3f",j }; printf "\n"
for (i in row) {
printf "%8.3f",i; for (j in col) { printf "%8.3f",val[i" "j] }; printf "\n"
}
}' <file>
How does it work:
PROCINFO["sorted_in"] = "#ind_num_asc", states that all arrays are sorted numerically by index.
(NR==1){next} : skip the first line
{row[$1]=1;col[$2]=1;val[$1" "$2]=$3}, process the line by storing the row and column index and accompanying value.
The end statement does all the printing.
This outputs:
20.500 21.500 22.500
20.500 -4.100 1.200 7.000
21.500 -6.200 4.300 10.400
22.500 0.000 6.000 16.700
note: the usage of PROCINFO is a gawk feature.
However, if you make a couple of assumptions, you can do it much shorter:
the file contains all possible entries, no missing values
you do not want the indices of the rows and columns printed out:
the indices are sorted in column-major-order
The you can use the following short versions:
sort -g <file> | awk '($1+0!=$1){next}
($1!=o)&&(NR!=1){printf "\n"}
{printf "%8.3f",$3; o=$1 }'
which outputs
-4.100 1.200 7.000
-6.200 4.300 10.400
0.000 6.000 16.700
or for the transposed:
awk '(NR==1){next}
($2!=o)&&(NR!=2){printf "\n"}
{printf "%8.3f",$3; o=$2 }' <file>
This outputs
-4.100 -6.200 0.000
1.200 4.300 6.000
7.000 10.400 16.700
Adjusted my old GNU awk solution for your current input data:
matrixize.awk script:
#!/bin/awk -f
BEGIN { PROCINFO["sorted_in"]="#ind_num_asc"; OFS="\t" }
NR==1{ next }
{
b[$1]; # accumulating unique indices
($1 != $2)? a[$1][$2] = $3 : a[$2][$1] = $3; # set `diagonal` relation between different indices
}
END {
h = "";
for (i in b) {
h = h OFS i # form header columns
}
print h; # print header column values
for (i in b) {
row = i; # index column
# iterating through the row values (for each intersection point)
for (j in a[i]) {
row = row OFS a[i][j]
}
print row
}
}
Usage:
awk -f matrixize.awk yourfile
The output:
20.5 21.5 22.5
20.5 -4.1 1.2 7.0
21.5 -6.2 4.3 10.4
22.5 0.0 6.0 16.7
Perl solution:
#!/usr/bin/perl -an
$h{ $F[0] }{ $F[1] } = $F[2] unless 1 == $.;
END {
#s = sort { $a <=> $b } keys %h;
print ' ' x 5;
printf '%5.1f' x #s, #s;
print "\n";
for my $u (#s) {
print "$u ";
printf '%5.1f', $h{$u}{$_} for #s;
print "\n";
}
}
-n reads the input line by line
-a splits each line on whitespace into the #F array
See sort, print, printf, and keys.
awk solution:
sort -n ifile.txt | awk 'BEGIN{header="\t"}NR>1{if((NR-1)%3==1){header=header sprintf("%4.1f\t",$1); matrix=matrix sprintf("%4.1f\t",$1)}matrix= matrix sprintf("%4.1f\t",$3); if((NR-1)%3==0 && NR!=10)matrix=matrix "\n"}END{print header; print matrix}';
20.5 21.5 22.5
20.5 -4.1 1.2 7.0
21.5 -6.2 4.3 10.4
22.5 0.0 6.0 16.7
Explanations:
sort -n ifile.txt sort the file numerically
header variable will store all the data necessary to create the header line it is initiated to header="\t" and will be appended with the necessary information thanks to header=header sprintf("%4.1f\t",$1) for lines respecting (NR-1)%3==1)
in the same way you construct the matrix using matrix variable: matrix=matrix sprintf("%4.1f\t",$1) will create the first column and
matrix= matrix sprintf("%4.1f\t",$3) will populate the matrix with the content then if((NR-1)%3==0 &&
NR!=10)matrix=matrix "\n" will add the adequate EOL

Calculate mean of each column ignoring missing data with awk

I have a large tab-separated data table with thousands of rows and dozens of columns and it has missing data marked as "na". For example,
na 0.93 na 0 na 0.51
1 1 na 1 na 1
1 1 na 0.97 na 1
0.92 1 na 1 0.01 0.34
I would like to calculate the mean of each column, but making sure that the missing data are ignored in the calculation. For example, the mean of column 1 should be 0.97. I believe I could use awk but I am not sure how to construct the command to do this for all columns and account for missing data.
All I know how to do is to calculate the mean of a single column but it treats the missing data as 0 rather than leaving it out of the calculation.
awk '{sum+=$1} END {print sum/NR}' filename
This is obscure, but works for your example
awk '{for(i=1; i<=NF; i++){sum[i] += $i; if($i != "na"){count[i]+=1}}} END {for(i=1; i<=NF; i++){if(count[i]!=0){v = sum[i]/count[i]}else{v = 0}; if(i<NF){printf "%f\t",v}else{print v}}}' input.txt
EDIT:
Here is how it works:
awk '{for(i=1; i<=NF; i++){ #for each column
sum[i] += $i; #add the sum to the "sum" array
if($i != "na"){ #if value is not "na"
count[i]+=1} #increment the column "count"
} #endif
} #endfor
END { #at the end
for(i=1; i<=NF; i++){ #for each column
if(count[i]!=0){ #if the column count is not 0
v = sum[i]/count[i] #then calculate the column mean (here represented with "v")
}else{ #else (if column count is 0)
v = 0 #then let mean be 0 (note: you can set this to be "na")
}; #endif col count is not 0
if(i<NF){ #if the column is before the last column
printf "%f\t",v #print mean + TAB
}else{ #else (if it is the last column)
print v} #print mean + NEWLINE
}; #endif
}' input.txt #endfor (note: input.txt is the input file)
```
A possible solution:
awk -F"\t" '{for(i=1; i <= NF; i++)
{if($i == $i+0){sum[i]+=$i; denom[i] += 1;}}}
END{for(i=1; i<= NF; i++){line=line""sum[i]/(denom[i]?denom[i]:1)FS}
print line}' inputFile
The output for the given data:
0.973333 0.9825 0 0.7425 0.01 0.7125
Note that the third column contains only "na" and the output is 0. If you want the output to be na, then change the END{...}-block to:
END{for(i=1; i<= NF; i++){line=line""(denom[i] ? sum[i]/denom[i]:"na")FS}
print line}'

Biggest and smallest of all lines

I have a output like this
3.69
0.25
0.80
1.78
3.04
1.99
0.71
0.50
0.94
I want to find the biggest number and the smallest number in the above output
I need output like
smallest is 0.25 and biggest as 3.69
Just sort your input first and print the first and last value. One method:
$ sort file | awk 'NR==1{min=$1}END{print "Smallest",min,"Biggest",$0}'
Smallest 0.25 Biggest 3.69
Hope this help.
OUTPUT="3.69 0.25 0.80 1.78 3.04 1.99 0.71 0.50 0.94"
SORTED=`echo $OUTPUT | tr ' ' '\n' | sort -n`
SMALLEST=`echo "$SORTED" | head -n 1`
BIGGEST=`echo "$SORTED" | tail -n 1`
echo "Smallest is $SMALLEST"
echo "Biggest is $BIGGEST"
Added op's awk oneliner request.
I'm not good at awk, but this works anyway. :)
echo "3.69 0.25 0.80 1.78 3.04 1.99 0.71 0.50 0.94" | awk '{
for (i=1; i<=NF; i++) {
if (length(s) == 0) s = $i;
if (length(b) == 0) b = $i;
if ($i < s) s = $i;
if (b < $i) b = $i;
}
print "Smallest is", s;
print "Biggest is", b;
}'
You want an awk solution?
echo "3.69 0.25 0.80 1.78 3.04 1.99 0.71 0.50 0.94" | \
awk -v RS=' ' '/.+/ { biggest = ((biggest == "") || ($1 > biggest)) ? $1 : biggest;
smallest = ((smallest == "") || ($1 < smallest)) ? $1:smallest}
END { print biggest, smallest}'
Produce the following output:
3.69 0.25
You can use this method also
sort file | echo -e `sed -nr '1{s/(.*)/smallest is :\1/gp};${s/(.*)/biggest no is :\1/gp'}`
TXR solution:
$ txr -e '(let ((nums [mapcar tofloat (gun (get-line))]))
(if nums
(pprinl `smallest is #(find-min nums) and biggest is #(find-max nums)`)
(pprinl "empty input")))'
0.1
-1.0
3.5
2.4
smallest is -1.0 and biggest is 3.5

Permutation columns without repetition

Can anybody give me some piece of code or algorithm or something else to solve the following problem?
I have several files, each with a different number of columns, like:
$> cat file-1
1 2
$> cat file-2
1 2 3
$> cat file-3
1 2 3 4
I would like to subtract the column absolute values and divide by the sum of all in a row for each different columns only once (combination without repeated column pairs):
in file-1 case I need to get:
0.3333 # because |1-2/(1+2)|
in file-2 case I need to get:
0.1666 0.1666 0.3333 # because |1-2/(1+2+3)| and |2-3/(1+2+3)| and |1-3/(1+2+3)|
in file-3 case I need to get:
0.1 0.2 0.3 0.1 0.2 0.1 # because |1-2/(1+2+3+4)| and |1-3/(1+2+3+4)| and |1-4/(1+2+3+4)| and |2-3/(1+2+3+4)| and |2-4/(1+2+3+4)| and |3-4/(1+2+3+4)|
This should work though I am guessing you have made a minor mistake in your input data. Based on your third pattern the following data should be -
Instead of:
in file-2 case I need to get:
0.1666 0.1666 0.3333 # because |1-2/(1+2+3)| and |2-3/(1+2+3)| and |1-3/(1+2+3)|
It should be:
in file-2 case I need to get:
0.1666 0.3333 0.1666 # because |1-2/(1+2+3)| and |1-3/(1+2+3)| and |2-3/(1+2+3)|
Here is the awk one liner:
awk '
NF{
a=0;
for(i=1;i<=NF;i++)
a+=$i;
for(j=1;j<=NF;j++)
{
for(k=j;k<NF;k++)
printf("%s ",-($j-$(k+1))/a)
}
print "";
next;
}1' file
Short version:
awk '
NF{for (i=1;i<=NF;i++) a+=$i;
for (j=1;j<=NF;j++){for (k=j;k<NF;k++) printf("%2.4f ",-($j-$(k+1))/a)}
print "";a=0;next;}1' file
Input File:
[jaypal:~/Temp] cat file
1 2
1 2 3
1 2 3 4
Test:
[jaypal:~/Temp] awk '
NF{
a=0;
for(i=1;i<=NF;i++)
a+=$i;
for(j=1;j<=NF;j++)
{
for(k=j;k<NF;k++)
printf("%s ",-($j-$(k+1))/a)
}
print "";
next;
}1' file
0.333333
0.166667 0.333333 0.166667
0.1 0.2 0.3 0.1 0.2 0.1
Test from shorter version:
[jaypal:~/Temp] awk '
NF{for (i=1;i<=NF;i++) a+=$i;
for (j=1;j<=NF;j++){for (k=j;k<NF;k++) printf("%2.4f ",-($j-$(k+1))/a)}
print "";a=0;next;}1' file
0.3333
0.1667 0.3333 0.1667
0.1000 0.2000 0.3000 0.1000 0.2000 0.1000
#Jaypal just beat me too it! Here's what I had:
awk '{for (x=1;x<=NF;x++) sum += $x; for (i=1;i<=NF;i++) for (j=2;j<=NF;j++) if (i < j) printf ("%.1f ",-($i-$j)/sum)} END {print ""}' file.txt
Output:
0.1 0.2 0.3 0.1 0.2 0.1
prints to one decimal place.
#Jaypal, Is there a quick way to printf an absolute value? Perhaps like: abs(value) ?
EDIT:
#Jaypal, yes I've tried searching too and couldn't find something simple :-( It seems if ($i < 0) $i = -$i is the way to go. I guess you could use sed to remove any minus signs:
awk '{for (x=1;x<=NF;x++) sum += $x; for (i=1;i<=NF;i++) for (j=2;j<=NF;j++) if (i < j) printf ("%.1f ", ($i-$j)/sum)} {print ""}' file.txt | sed "s%-%%g"
Cheers!
As it looks like a homework, I will act accordingly.
To find the total numbers present in the file, you can use
cat filename | wc -w
Find the first_number by:
cat filename | cut -d " " -f 1
To find the sum in a file:
cat filename | tr " " "+" | bc
Now, that you have the total_nos, use something like:
for i in {seq 1 1 $total_nos}
do
#Find the numerator by first_number - $i
#Use the sum you got from above to get the desired value.
done

Resources