Compare multiple rows to pick the one with smallest value - linux

I would like to compare the rows in the second column, and get the row with the highest value in the consecutive columns, with priority of column 3> 4 > 5. I sorted my dataset for the second column so the same values will be together.
My dataset looks like this:
X1 A 0.38 24.68 2.93
X2 A 0.38 20.22 14.54
X3 A 0.38 20.08 00.48
X3.3 A 0.22 11.55 10.68
C43 B 0.22 11.55 20.08
C4.2 C 0.22 11.55 3.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
C44 D 0.22 1.10 1.24
P1 E 0.42 0.42 0.42
P2 E 0.42 0.42 0.42
P3 E 0.42 0.42 0.42
In here, I would like to say, if second column is the same value with another row, then I compare their values in the third column and pick the row with the highest value in the third column.
If the rows have the same second and third columns, then I go to forth column and compare their values in this column, and then get row with the highest value.
If the rows sharing second column still share the values in third and forth columns, then I pick the row with highest value in the fifth column.
If, second-third-forth-fifth columns are the same (complete duplicates), then I print them all, but add 'duplicate' next to their fifth column.
If a row does not share its value for the second column for any other rows, then there is no comparison and I keep this column.
Therefore, my expected output will be:
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42duplicate
P2 E 0.42 0.42 0.42duplicate
P3 E 0.42 0.42 0.42duplicate
What I tried at the moment fails, because I can only compare based on second column and not with multiple columns conditioning and I cannot keep complete duplicates.
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++'
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42
I appreciate to learn how to fix it.

I'm afraid the code below is not sophisticated, how about:
awk -v OFS="\t" '$1=$1' "data.txt" | sort -k2,2 -k3nr -k4nr -k5nr > "tmp.txt"
awk -v OFS="\t" '
NR==FNR {
vals = $3","$4","$5
if (max[$2] == "") max[$2] = vals
else if (max[$2] == vals) dupe[$2] = 1
next
} {
vals = $3","$4","$5
if (dupe[$2]) $6 = "duplicate"
if (max[$2] == vals) print
}' "tmp.txt" "tmp.txt"
rm -f "tmp.txt"
It saves the sorted result in a temporary file "tmp.txt".
The 2nd awk script processes the temporary file with two passes.
In the 1st pass, it extracts the "max value" for each 2nd column.
It also detects the duplications and set the variable dupe if found.
In the 2nd pass, it assigns the variable $6 to a string duplicate
if the line has the dupe flag.
Then it prints only the line(s) which have the max value for each 2nd column.

This may not be the most elegant solution but it works
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++' | cut -f2- > /tmp/fgrep.$$
cat data.txt | fgrep -f /tmp/fgrep.$$ | awk '{
rec[NR] = $0
idx = sprintf("%s %s %s %s",$2,$3,$4,$5)
irec[NR] = idx
dup[idx]++
}
END{
for(i in rec){
if(dup[irec[i]]> 1){
print rec[i] "duplicate"
}else{
print rec[i]
}
}
}'
rm /tmp/fgrep.$$

Related

Compare column 2 between two files and print the output with all other columns

I would like to compare two files based on column 2 and print all other columns in the output.
File 1:
p1 p2 rg se p
F Fht 0.3 0.01 0.05
F Tom 0.01 0.004 0.34
File 2:
p1 p2 rg se p
M Fht 0.2 0.02 0.06
M Ram 0.03 0.004 0.32
Desired output:
p1 p2 rg se p p1 p2 rg se p
M Fht 0.2 0.02 0.06 F Fht 0.3 0.01 0.05
I figured out how to print the difference out, but not the common columns.
awk 'NR==FNR{++a[$2];next} !($2 in a)' file1 file2
You may use this awk:
awk 'NR == FNR {map[$2] = $0; next} $2 in map {print $0, map[$2]}' f1 f2 | column -t
p1 p2 rg se p p1 p2 rg se p
M Fht 0.2 0.02 0.06 F Fht 0.3 0.01 0.05
I used column -t for tabular output here.

Convert column to matrix format using awk

I have a gridded data file in column format as:
ifile.txt
x y value
20.5 20.5 -4.1
21.5 20.5 -6.2
22.5 20.5 0.0
20.5 21.5 1.2
21.5 21.5 4.3
22.5 21.5 6.0
20.5 22.5 7.0
21.5 22.5 10.4
22.5 22.5 16.7
I would like to convert it to matrix format as:
ofile.txt
20.5 21.5 22.5
20.5 -4.1 1.2 7.0
21.5 -6.2 4.3 10.4
22.5 0.0 6.0 16.7
Where top 20.5 21.5 22.5 indicate y and side values indicate x and the inside values indicate the corresponding grid values.
I found a similar question here Convert a 3 column file to matrix format but the script is not working in my case.
The script is
awk '{ h[$1,$2] = h[$2,$1] = $3 }
END {
for(i=1; i<=$1; i++) {
for(j=1; j<=$2; j++)
printf h[i,j] OFS
printf "\n"
}
}' ifile
The following awk script handles :
any size of matrix
no relation between row and column indices so it keeps track of them separately.
If a certain row column index does not appear, the value will default to zero.
This is done in this way:
awk '
BEGIN{PROCINFO["sorted_in"] = "#ind_num_asc"}
(NR==1){next}
{row[$1]=1;col[$2]=1;val[$1" "$2]=$3}
END { printf "%8s",""; for (j in col) { printf "%8.3f",j }; printf "\n"
for (i in row) {
printf "%8.3f",i; for (j in col) { printf "%8.3f",val[i" "j] }; printf "\n"
}
}' <file>
How does it work:
PROCINFO["sorted_in"] = "#ind_num_asc", states that all arrays are sorted numerically by index.
(NR==1){next} : skip the first line
{row[$1]=1;col[$2]=1;val[$1" "$2]=$3}, process the line by storing the row and column index and accompanying value.
The end statement does all the printing.
This outputs:
20.500 21.500 22.500
20.500 -4.100 1.200 7.000
21.500 -6.200 4.300 10.400
22.500 0.000 6.000 16.700
note: the usage of PROCINFO is a gawk feature.
However, if you make a couple of assumptions, you can do it much shorter:
the file contains all possible entries, no missing values
you do not want the indices of the rows and columns printed out:
the indices are sorted in column-major-order
The you can use the following short versions:
sort -g <file> | awk '($1+0!=$1){next}
($1!=o)&&(NR!=1){printf "\n"}
{printf "%8.3f",$3; o=$1 }'
which outputs
-4.100 1.200 7.000
-6.200 4.300 10.400
0.000 6.000 16.700
or for the transposed:
awk '(NR==1){next}
($2!=o)&&(NR!=2){printf "\n"}
{printf "%8.3f",$3; o=$2 }' <file>
This outputs
-4.100 -6.200 0.000
1.200 4.300 6.000
7.000 10.400 16.700
Adjusted my old GNU awk solution for your current input data:
matrixize.awk script:
#!/bin/awk -f
BEGIN { PROCINFO["sorted_in"]="#ind_num_asc"; OFS="\t" }
NR==1{ next }
{
b[$1]; # accumulating unique indices
($1 != $2)? a[$1][$2] = $3 : a[$2][$1] = $3; # set `diagonal` relation between different indices
}
END {
h = "";
for (i in b) {
h = h OFS i # form header columns
}
print h; # print header column values
for (i in b) {
row = i; # index column
# iterating through the row values (for each intersection point)
for (j in a[i]) {
row = row OFS a[i][j]
}
print row
}
}
Usage:
awk -f matrixize.awk yourfile
The output:
20.5 21.5 22.5
20.5 -4.1 1.2 7.0
21.5 -6.2 4.3 10.4
22.5 0.0 6.0 16.7
Perl solution:
#!/usr/bin/perl -an
$h{ $F[0] }{ $F[1] } = $F[2] unless 1 == $.;
END {
#s = sort { $a <=> $b } keys %h;
print ' ' x 5;
printf '%5.1f' x #s, #s;
print "\n";
for my $u (#s) {
print "$u ";
printf '%5.1f', $h{$u}{$_} for #s;
print "\n";
}
}
-n reads the input line by line
-a splits each line on whitespace into the #F array
See sort, print, printf, and keys.
awk solution:
sort -n ifile.txt | awk 'BEGIN{header="\t"}NR>1{if((NR-1)%3==1){header=header sprintf("%4.1f\t",$1); matrix=matrix sprintf("%4.1f\t",$1)}matrix= matrix sprintf("%4.1f\t",$3); if((NR-1)%3==0 && NR!=10)matrix=matrix "\n"}END{print header; print matrix}';
20.5 21.5 22.5
20.5 -4.1 1.2 7.0
21.5 -6.2 4.3 10.4
22.5 0.0 6.0 16.7
Explanations:
sort -n ifile.txt sort the file numerically
header variable will store all the data necessary to create the header line it is initiated to header="\t" and will be appended with the necessary information thanks to header=header sprintf("%4.1f\t",$1) for lines respecting (NR-1)%3==1)
in the same way you construct the matrix using matrix variable: matrix=matrix sprintf("%4.1f\t",$1) will create the first column and
matrix= matrix sprintf("%4.1f\t",$3) will populate the matrix with the content then if((NR-1)%3==0 &&
NR!=10)matrix=matrix "\n" will add the adequate EOL

compare 2nd column of two or more files and print union of all files

I have four tab separated files 1.txt, 2.txt, 3.txt, 4.txt. Each having following format
89 ABI1 0.19
93 ABL1 0.15
94 ABL2 0.07
170 ACSL3 0.21
I want to compare 2nd column of all files and print union (based on 2nd column) into new file, like following:
1.txt 2.txt 3.txt 4.txt
ABL2 0.07 0.01 0.11 0.009
AKT1 0.31 0.05 0.05 0.017
AKT2 0.33 0.05 0.01 0.004
How is it possible in awk?
I tried following but this only compares first columns,
awk 'NR==FNR {h[$1] = $0; next} {print $1,h[$1]}' OFS="\t" 2.txt 1.txt
but when I change it to compare 2nd column it doesn't work
awk 'NR==FNR {h[$2] = $0; next} {print $1,h[$2]}' OFS="\t" 2.txt 1.txt
Also this only works on two files at a time.
Is there any way to do it on four files by comparing 2nd column in awk?
Using join on sorted input files, and assuming a shell that understands process substitutions with <(...) (I've used a copy of the data that you provided for every input file, just adding a line at the top for identification, this is the AAA line):
$ join <( join -1 2 -2 2 -o 0,1.3,2.3 1.txt 2.txt ) \
<( join -1 2 -2 2 -o 0,1.3,2.3 3.txt 4.txt )
AAA 1 2 3 4
ABI1 0.19 0.19 0.19 0.19
ABL1 0.15 0.15 0.15 0.15
ABL2 0.07 0.07 0.07 0.07
ACSL3 0.21 0.21 0.21 0.21
There are three joins here. The first two to be performed are the ones in <(...). The first of these join the first two files, while the second join the last two files. The result of one of these joins looks like
AAA 1 2
ABI1 0.19 0.19
ABL1 0.15 0.15
ABL2 0.07 0.07
ACSL3 0.21 0.21
The option -o 0,1.3,2.3 means "output the join field along with field 3 from both files". -1 2 -2 2 means "use field 2 of each file as join field (rather than field 1)".
The outermost join takes the two results and performs the final join that produces the output.
If the input files are not sorted on the join field:
$ join <( join -1 2 -2 2 -o 0,1.3,2.3 <(sort -k2,2 1.txt) <(sort -k2,2 2.txt) ) \
<( join -1 2 -2 2 -o 0,1.3,2.3 <(sort -k2,2 3.txt) <(sort -k2,2 4.txt) )

Remove duplicates and keep line which contains max value from one column - LINUX

everyone!
I'd like to remove duplicates and keep lines with the highest value from one column (4th column) in a file with 4 fields. I must do this in a Linux server.
Before
gene subj e-value ident
g1 h1 0.05 75.5
g1 h2 0.03 60.6
g2 h7 0.00 80.5
g2 h9 0.00 50.3
g2 h4 0.03 90.7
g3 h5 0.10 30.5
g3 h8 0.00 76.8
g4 h11 0.00 80.7
After
gene subj e-value ident
g1 h1 0.05 75.5
g2 h4 0.03 90.7
g3 h8 0.00 76.8
g4 h11 0.00 80.7
Thank you so much and I'm sorry if I asked something repeated! But I didn't find an answer for my problem.
You can try this, if it is no problem to get the output without the header:
tail -n +2 file.txt | sort -k1,1 -k4,4rn | sort -uk1,1
Explanation:
tail -n +2 file.txt
will remove the headers so they don't get involved in all the sorting.
sort -k1,1 -k4,4rn
will sort by column 1 first (-k1,1) and then by column 4 numerically and in reverse order (-k4,4rn)
Finally:
sort -uk1,1
Will remove duplicates taking into account just the first column.
Be aware that -k1,1 means from column one to column one, hence -k4,4 is from column 4 to column 4. Adjust to fit your columns.
With GNU datamash tool:
datamash --headers -Wfs -g1 max 4 < file | cut -f1-4
The output:
gene subj e-value ident
g1 h1 0.05 75.5
g2 h4 0.03 90.7
g3 h8 0.00 76.8
g4 h11 0.00 80.7
An awk solution, but I like archimiro version for simplicity.
awk '
NR>1 && $1 in arr {
if ($4 > arr[$1][4])
split($0, arr[$1])
next
}
NR>1 {
arr[$1][1] = ""
split($0, arr[$1])
}
END {
for(i in arr) {
for(j in arr[i])
printf arr[i][j] "\t"
print ""
}
}
' data.file
The result:
g1 h1 0.05 75.5
g2 h4 0.03 90.7
g3 h8 0.00 76.8
g4 h11 0.00 80.7

Find the average of multiple columns for each distinct variable in column 1

Hi I have a file with 6 columns and I wish to know the average of three of these (columns 2,3,4) and the sum of the last two (columns 5 and 6) for each unique variable in column one.
A1234 0.526 0.123 0.456 0.986 1.123
A1234 0.423 0.256 0.397 0.876 0.999
A1234 0.645 0.321 0.402 0.903 1.101
A1234 0.555 0.155 0.406 0.888 1.009
B5678 0.111 0.345 0.285 0.888 0.789
B5678 0.221 0.215 0.305 0.768 0.987
B5678 0.336 0.289 0.320 0.789 0.921
I have come across code that will get the average for column 2 based on column one but is there anyway I can expand this across columns? Thanks
awk '{a[$1]+=$2; c[$1]++} END{for (i in a) printf "%d%s%.2f\n", i, OFS, a[i]/c[i]}'
I would like the output to be in the following format ;each variable in column one will also have a different number of rows
A1234 0.53725 0.21375 0.41525 3.653 4.232
B5678 0.22233 0.283 0.30333 2.445 2.697
awk '{a[$1]+=$2;b[$1]+=$3;c[$1]+=$4;d[$1]+=$5;e[$1]+=$6;f[$1]++} END{for (i in a) print i,a[i]/f[i],b[i]/f[i],c[i]/f[i],d[i],e[i]}' file
O/p:
B5678 0.222667 0.283 0.303333 2.445 2.697
A1234 0.53725 0.21375 0.41525 3.653 4.232
try following once and let me know if this helps you.
awk '{A[$1]=A[$1]?A[$1]+$5+$6:$5+$6;C[$1]=C[$1]?C[$1]+$2+$3+$4:$2+$3+$4;B[$1]++} END{for(i in A){print "Avg. for " i" =\t",C[i]/(B[i]*3) RS "Count for " i" =\t",A[i]}}' Input_file
EDIT: Adding a non-one liner form of solution too now.
awk '{
A[$1]=A[$1]?A[$1]+$5+$6:$5+$6;
C[$1]=C[$1]?C[$1]+$2+$3+$4:$2+$3+$4;
B[$1]++
}
END{
for(i in A){
print "Avg. for " i" =\t",C[i]/(B[i]*3) RS "Count for " i" =\t",A[i]
}
}
' Input_file
awk solution:
awk '{ a[$1]++; avg[$1]+=$2+$3+$4; sum[$1]+=$5+$6 }
END{ for(i in a) printf "%s%.2f%s%.2f\n",i OFS,avg[i]/(a[i]*3),OFS,sum[i] }' file
The output (the 2nd column - average value, the 3rd column - sum value):
B5678 0.27 5.14
A1234 0.39 7.88
To calculate average of column 2, 3, 4:
awk '{ sum += $2 + $3 + $4 } END { print sum / (NR * 3) }'
To calculate the sum of column 5 and 6 group by column 1:
awk '{ arr[$1] += $5 + $6 } END { for (a in arr) if (a) print a, arr[a] }'
To calculate column 5 and 6 of the last row:
tail file -1 | awk '{sum += $5 + $6} END {print sum}'

Resources