Remove duplicates and keep line which contains max value from one column - LINUX - linux

everyone!
I'd like to remove duplicates and keep lines with the highest value from one column (4th column) in a file with 4 fields. I must do this in a Linux server.
Before
gene subj e-value ident
g1 h1 0.05 75.5
g1 h2 0.03 60.6
g2 h7 0.00 80.5
g2 h9 0.00 50.3
g2 h4 0.03 90.7
g3 h5 0.10 30.5
g3 h8 0.00 76.8
g4 h11 0.00 80.7
After
gene subj e-value ident
g1 h1 0.05 75.5
g2 h4 0.03 90.7
g3 h8 0.00 76.8
g4 h11 0.00 80.7
Thank you so much and I'm sorry if I asked something repeated! But I didn't find an answer for my problem.

You can try this, if it is no problem to get the output without the header:
tail -n +2 file.txt | sort -k1,1 -k4,4rn | sort -uk1,1
Explanation:
tail -n +2 file.txt
will remove the headers so they don't get involved in all the sorting.
sort -k1,1 -k4,4rn
will sort by column 1 first (-k1,1) and then by column 4 numerically and in reverse order (-k4,4rn)
Finally:
sort -uk1,1
Will remove duplicates taking into account just the first column.
Be aware that -k1,1 means from column one to column one, hence -k4,4 is from column 4 to column 4. Adjust to fit your columns.

With GNU datamash tool:
datamash --headers -Wfs -g1 max 4 < file | cut -f1-4
The output:
gene subj e-value ident
g1 h1 0.05 75.5
g2 h4 0.03 90.7
g3 h8 0.00 76.8
g4 h11 0.00 80.7

An awk solution, but I like archimiro version for simplicity.
awk '
NR>1 && $1 in arr {
if ($4 > arr[$1][4])
split($0, arr[$1])
next
}
NR>1 {
arr[$1][1] = ""
split($0, arr[$1])
}
END {
for(i in arr) {
for(j in arr[i])
printf arr[i][j] "\t"
print ""
}
}
' data.file
The result:
g1 h1 0.05 75.5
g2 h4 0.03 90.7
g3 h8 0.00 76.8
g4 h11 0.00 80.7

Related

Awk average of column by moving difference of grouping column variable

I have a file that look like this:
1 snp1 0.0 4
1 snp2 0.2 6
1 snp3 0.3 4
1 snp4 0.4 3
1 snp5 0.5 5
1 snp6 0.6 6
1 snp7 1.3 5
1 snp8 1.3 3
1 snp9 1.9 4
File is sorted by column 3. I want the average of 4th column grouped by column 3 every 0.5 unit apart. For example it should output like this:
1 snp1 0.0 4.4
1 snp6 0.6 6.0
1 snp7 1.3 4.0
1 snp9 1.9 4.0
I can print all positions without average like this:
awk 'NR==1 {pos=$3; print $0} $3>=pos+0.5{pos=$3; print $0}' input
But I am not able to figure out how to print average of 4th column. It would be great if someone can help me to find solution to this problem. Thanks!
Something like this, maybe:
awk '
NR==1 {c1=$1; c2=$2; v=$3; n=1; s=$4; next}
$3>v+0.5 {print c1, c2, v, s/n; c1=$1; c2=$2; v=$3; n=1; s=$4; next}
{n+=1; s+=$4}
END {print c1, c2, v, s/n}
' input

Compare column 2 between two files and print the output with all other columns

I would like to compare two files based on column 2 and print all other columns in the output.
File 1:
p1 p2 rg se p
F Fht 0.3 0.01 0.05
F Tom 0.01 0.004 0.34
File 2:
p1 p2 rg se p
M Fht 0.2 0.02 0.06
M Ram 0.03 0.004 0.32
Desired output:
p1 p2 rg se p p1 p2 rg se p
M Fht 0.2 0.02 0.06 F Fht 0.3 0.01 0.05
I figured out how to print the difference out, but not the common columns.
awk 'NR==FNR{++a[$2];next} !($2 in a)' file1 file2
You may use this awk:
awk 'NR == FNR {map[$2] = $0; next} $2 in map {print $0, map[$2]}' f1 f2 | column -t
p1 p2 rg se p p1 p2 rg se p
M Fht 0.2 0.02 0.06 F Fht 0.3 0.01 0.05
I used column -t for tabular output here.

Compare multiple rows to pick the one with smallest value

I would like to compare the rows in the second column, and get the row with the highest value in the consecutive columns, with priority of column 3> 4 > 5. I sorted my dataset for the second column so the same values will be together.
My dataset looks like this:
X1 A 0.38 24.68 2.93
X2 A 0.38 20.22 14.54
X3 A 0.38 20.08 00.48
X3.3 A 0.22 11.55 10.68
C43 B 0.22 11.55 20.08
C4.2 C 0.22 11.55 3.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
C44 D 0.22 1.10 1.24
P1 E 0.42 0.42 0.42
P2 E 0.42 0.42 0.42
P3 E 0.42 0.42 0.42
In here, I would like to say, if second column is the same value with another row, then I compare their values in the third column and pick the row with the highest value in the third column.
If the rows have the same second and third columns, then I go to forth column and compare their values in this column, and then get row with the highest value.
If the rows sharing second column still share the values in third and forth columns, then I pick the row with highest value in the fifth column.
If, second-third-forth-fifth columns are the same (complete duplicates), then I print them all, but add 'duplicate' next to their fifth column.
If a row does not share its value for the second column for any other rows, then there is no comparison and I keep this column.
Therefore, my expected output will be:
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42duplicate
P2 E 0.42 0.42 0.42duplicate
P3 E 0.42 0.42 0.42duplicate
What I tried at the moment fails, because I can only compare based on second column and not with multiple columns conditioning and I cannot keep complete duplicates.
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++'
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42
I appreciate to learn how to fix it.
I'm afraid the code below is not sophisticated, how about:
awk -v OFS="\t" '$1=$1' "data.txt" | sort -k2,2 -k3nr -k4nr -k5nr > "tmp.txt"
awk -v OFS="\t" '
NR==FNR {
vals = $3","$4","$5
if (max[$2] == "") max[$2] = vals
else if (max[$2] == vals) dupe[$2] = 1
next
} {
vals = $3","$4","$5
if (dupe[$2]) $6 = "duplicate"
if (max[$2] == vals) print
}' "tmp.txt" "tmp.txt"
rm -f "tmp.txt"
It saves the sorted result in a temporary file "tmp.txt".
The 2nd awk script processes the temporary file with two passes.
In the 1st pass, it extracts the "max value" for each 2nd column.
It also detects the duplications and set the variable dupe if found.
In the 2nd pass, it assigns the variable $6 to a string duplicate
if the line has the dupe flag.
Then it prints only the line(s) which have the max value for each 2nd column.
This may not be the most elegant solution but it works
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++' | cut -f2- > /tmp/fgrep.$$
cat data.txt | fgrep -f /tmp/fgrep.$$ | awk '{
rec[NR] = $0
idx = sprintf("%s %s %s %s",$2,$3,$4,$5)
irec[NR] = idx
dup[idx]++
}
END{
for(i in rec){
if(dup[irec[i]]> 1){
print rec[i] "duplicate"
}else{
print rec[i]
}
}
}'
rm /tmp/fgrep.$$

compare 2nd column of two or more files and print union of all files

I have four tab separated files 1.txt, 2.txt, 3.txt, 4.txt. Each having following format
89 ABI1 0.19
93 ABL1 0.15
94 ABL2 0.07
170 ACSL3 0.21
I want to compare 2nd column of all files and print union (based on 2nd column) into new file, like following:
1.txt 2.txt 3.txt 4.txt
ABL2 0.07 0.01 0.11 0.009
AKT1 0.31 0.05 0.05 0.017
AKT2 0.33 0.05 0.01 0.004
How is it possible in awk?
I tried following but this only compares first columns,
awk 'NR==FNR {h[$1] = $0; next} {print $1,h[$1]}' OFS="\t" 2.txt 1.txt
but when I change it to compare 2nd column it doesn't work
awk 'NR==FNR {h[$2] = $0; next} {print $1,h[$2]}' OFS="\t" 2.txt 1.txt
Also this only works on two files at a time.
Is there any way to do it on four files by comparing 2nd column in awk?
Using join on sorted input files, and assuming a shell that understands process substitutions with <(...) (I've used a copy of the data that you provided for every input file, just adding a line at the top for identification, this is the AAA line):
$ join <( join -1 2 -2 2 -o 0,1.3,2.3 1.txt 2.txt ) \
<( join -1 2 -2 2 -o 0,1.3,2.3 3.txt 4.txt )
AAA 1 2 3 4
ABI1 0.19 0.19 0.19 0.19
ABL1 0.15 0.15 0.15 0.15
ABL2 0.07 0.07 0.07 0.07
ACSL3 0.21 0.21 0.21 0.21
There are three joins here. The first two to be performed are the ones in <(...). The first of these join the first two files, while the second join the last two files. The result of one of these joins looks like
AAA 1 2
ABI1 0.19 0.19
ABL1 0.15 0.15
ABL2 0.07 0.07
ACSL3 0.21 0.21
The option -o 0,1.3,2.3 means "output the join field along with field 3 from both files". -1 2 -2 2 means "use field 2 of each file as join field (rather than field 1)".
The outermost join takes the two results and performs the final join that produces the output.
If the input files are not sorted on the join field:
$ join <( join -1 2 -2 2 -o 0,1.3,2.3 <(sort -k2,2 1.txt) <(sort -k2,2 2.txt) ) \
<( join -1 2 -2 2 -o 0,1.3,2.3 <(sort -k2,2 3.txt) <(sort -k2,2 4.txt) )

Use part of a column in one file as search term in other file

I have two files. The output file I am searching has earthquake locations and has the following format:
19090212 1323 30.12 36 19.41 103 28.24 7.29 0.00 4 149 25.8 0.02 5.7 9.8 D - 0
19090216 1828 49.61 36 13.27 101 35.38 10.94 0.00 13 54 38.5 0.07 0.3 0.7 B 0
19090711 2114 54.11 35 1.07 99 56.42 7.00 0.00 7 177 18.7 4.00 63.3 53.2 D # 0
I want to use the last 6 digits of the first column (i.e. '090418' out of '19090418') with the first 3 digits of the second column (i.e. '072' out of '0728') as my search term. The file I am searching has the following format:
SC17 P 090212132329.89
X25A P 090212132330.50
AMTX P 090216182814.12
X29A P 090216182813.70
Y28A P 090216182822.36
MSTX P 090216182826.80
Y27A P 090216182831.43
After I search the second file for the term, I need to figure out how many lines are in that section. So for this example, if I were searching the terms shown for the second file above, I want to know there are 2 lines for 090212132 and 5 lines for 090216182.
This is my first post, so please let me know how I can improve clarity or conciseness in my posts. Thanks for the help!
awk to the rescue!
$ awk 'NR==FNR{a[substr($1,3) substr($2,1,3)]; next}
{k=substr($3,1,9)}
k in a{a[k]++}
END{for(k in a) if(a[k]>0) print k,a[k]}' file1 file2
with your input files, there is no output as expected.
The answer karakfa suggested worked! My output looks like this:
100224194 7
100117172 18
091004005 11
090520220 10
090526143 21
090122033 20
Thanks for the help!
Karafka answer with explanation
awk 'NR==FNR { # For first file
$1 = substr($1, 3); # Get last 6 characters from first col
$2 = substr($2, 1, 3); # Get first 3 characters from second col
a[$1 $2]; # Add to an array
next } # Move to next record in first file
# Start processing second file
{k = substr($3, 1, 9)} # Get first 9 character for third col
k in a {a[k]++} # If key in a, then increment the key
END {
for (k in a) # Iterate array
if (a[k] > 0) # If pattern was matched
print k, a[k] # print the pattern and num occurrence
}'

Resources