Compare column 2 between two files and print the output with all other columns - linux

I would like to compare two files based on column 2 and print all other columns in the output.
File 1:
p1 p2 rg se p
F Fht 0.3 0.01 0.05
F Tom 0.01 0.004 0.34
File 2:
p1 p2 rg se p
M Fht 0.2 0.02 0.06
M Ram 0.03 0.004 0.32
Desired output:
p1 p2 rg se p p1 p2 rg se p
M Fht 0.2 0.02 0.06 F Fht 0.3 0.01 0.05
I figured out how to print the difference out, but not the common columns.
awk 'NR==FNR{++a[$2];next} !($2 in a)' file1 file2

You may use this awk:
awk 'NR == FNR {map[$2] = $0; next} $2 in map {print $0, map[$2]}' f1 f2 | column -t
p1 p2 rg se p p1 p2 rg se p
M Fht 0.2 0.02 0.06 F Fht 0.3 0.01 0.05
I used column -t for tabular output here.

Related

Awk average of column by moving difference of grouping column variable

I have a file that look like this:
1 snp1 0.0 4
1 snp2 0.2 6
1 snp3 0.3 4
1 snp4 0.4 3
1 snp5 0.5 5
1 snp6 0.6 6
1 snp7 1.3 5
1 snp8 1.3 3
1 snp9 1.9 4
File is sorted by column 3. I want the average of 4th column grouped by column 3 every 0.5 unit apart. For example it should output like this:
1 snp1 0.0 4.4
1 snp6 0.6 6.0
1 snp7 1.3 4.0
1 snp9 1.9 4.0
I can print all positions without average like this:
awk 'NR==1 {pos=$3; print $0} $3>=pos+0.5{pos=$3; print $0}' input
But I am not able to figure out how to print average of 4th column. It would be great if someone can help me to find solution to this problem. Thanks!
Something like this, maybe:
awk '
NR==1 {c1=$1; c2=$2; v=$3; n=1; s=$4; next}
$3>v+0.5 {print c1, c2, v, s/n; c1=$1; c2=$2; v=$3; n=1; s=$4; next}
{n+=1; s+=$4}
END {print c1, c2, v, s/n}
' input

Compare multiple rows to pick the one with smallest value

I would like to compare the rows in the second column, and get the row with the highest value in the consecutive columns, with priority of column 3> 4 > 5. I sorted my dataset for the second column so the same values will be together.
My dataset looks like this:
X1 A 0.38 24.68 2.93
X2 A 0.38 20.22 14.54
X3 A 0.38 20.08 00.48
X3.3 A 0.22 11.55 10.68
C43 B 0.22 11.55 20.08
C4.2 C 0.22 11.55 3.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
C44 D 0.22 1.10 1.24
P1 E 0.42 0.42 0.42
P2 E 0.42 0.42 0.42
P3 E 0.42 0.42 0.42
In here, I would like to say, if second column is the same value with another row, then I compare their values in the third column and pick the row with the highest value in the third column.
If the rows have the same second and third columns, then I go to forth column and compare their values in this column, and then get row with the highest value.
If the rows sharing second column still share the values in third and forth columns, then I pick the row with highest value in the fifth column.
If, second-third-forth-fifth columns are the same (complete duplicates), then I print them all, but add 'duplicate' next to their fifth column.
If a row does not share its value for the second column for any other rows, then there is no comparison and I keep this column.
Therefore, my expected output will be:
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42duplicate
P2 E 0.42 0.42 0.42duplicate
P3 E 0.42 0.42 0.42duplicate
What I tried at the moment fails, because I can only compare based on second column and not with multiple columns conditioning and I cannot keep complete duplicates.
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++'
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42
I appreciate to learn how to fix it.
I'm afraid the code below is not sophisticated, how about:
awk -v OFS="\t" '$1=$1' "data.txt" | sort -k2,2 -k3nr -k4nr -k5nr > "tmp.txt"
awk -v OFS="\t" '
NR==FNR {
vals = $3","$4","$5
if (max[$2] == "") max[$2] = vals
else if (max[$2] == vals) dupe[$2] = 1
next
} {
vals = $3","$4","$5
if (dupe[$2]) $6 = "duplicate"
if (max[$2] == vals) print
}' "tmp.txt" "tmp.txt"
rm -f "tmp.txt"
It saves the sorted result in a temporary file "tmp.txt".
The 2nd awk script processes the temporary file with two passes.
In the 1st pass, it extracts the "max value" for each 2nd column.
It also detects the duplications and set the variable dupe if found.
In the 2nd pass, it assigns the variable $6 to a string duplicate
if the line has the dupe flag.
Then it prints only the line(s) which have the max value for each 2nd column.
This may not be the most elegant solution but it works
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++' | cut -f2- > /tmp/fgrep.$$
cat data.txt | fgrep -f /tmp/fgrep.$$ | awk '{
rec[NR] = $0
idx = sprintf("%s %s %s %s",$2,$3,$4,$5)
irec[NR] = idx
dup[idx]++
}
END{
for(i in rec){
if(dup[irec[i]]> 1){
print rec[i] "duplicate"
}else{
print rec[i]
}
}
}'
rm /tmp/fgrep.$$

Remove duplicates and keep line which contains max value from one column - LINUX

everyone!
I'd like to remove duplicates and keep lines with the highest value from one column (4th column) in a file with 4 fields. I must do this in a Linux server.
Before
gene subj e-value ident
g1 h1 0.05 75.5
g1 h2 0.03 60.6
g2 h7 0.00 80.5
g2 h9 0.00 50.3
g2 h4 0.03 90.7
g3 h5 0.10 30.5
g3 h8 0.00 76.8
g4 h11 0.00 80.7
After
gene subj e-value ident
g1 h1 0.05 75.5
g2 h4 0.03 90.7
g3 h8 0.00 76.8
g4 h11 0.00 80.7
Thank you so much and I'm sorry if I asked something repeated! But I didn't find an answer for my problem.
You can try this, if it is no problem to get the output without the header:
tail -n +2 file.txt | sort -k1,1 -k4,4rn | sort -uk1,1
Explanation:
tail -n +2 file.txt
will remove the headers so they don't get involved in all the sorting.
sort -k1,1 -k4,4rn
will sort by column 1 first (-k1,1) and then by column 4 numerically and in reverse order (-k4,4rn)
Finally:
sort -uk1,1
Will remove duplicates taking into account just the first column.
Be aware that -k1,1 means from column one to column one, hence -k4,4 is from column 4 to column 4. Adjust to fit your columns.
With GNU datamash tool:
datamash --headers -Wfs -g1 max 4 < file | cut -f1-4
The output:
gene subj e-value ident
g1 h1 0.05 75.5
g2 h4 0.03 90.7
g3 h8 0.00 76.8
g4 h11 0.00 80.7
An awk solution, but I like archimiro version for simplicity.
awk '
NR>1 && $1 in arr {
if ($4 > arr[$1][4])
split($0, arr[$1])
next
}
NR>1 {
arr[$1][1] = ""
split($0, arr[$1])
}
END {
for(i in arr) {
for(j in arr[i])
printf arr[i][j] "\t"
print ""
}
}
' data.file
The result:
g1 h1 0.05 75.5
g2 h4 0.03 90.7
g3 h8 0.00 76.8
g4 h11 0.00 80.7

awk for comparing, selecting and processing columns

I have a list list.txt
1 10691 0.12 54 + 1 10692 0.13 55 -
2 10720 0.23 -1 + 2 10721 0.13 43 -
3 10832 0.43 123 + 3 10833 0.13 88 -
4 11032 0.22 -1 + 4 11033 0.13 -1 -
5 11248 0.12 45 + 5 11249 0.13 -1 -
6 15214 0.88 33 + 6 15215 0.13 45 -
I wish to extract data from columns 3 ($3) and 8 ($8) using a few rules:
Compare columns 4 ($4) and 9 ($9)
i) If both are negative, output "-1".
ii) If $4 < 0 and $9 > 0, output $3; If $4 > 0 and $9 < 0, output $8.
iii) If both $4 and $ 9 >0, output $3+$8
So I tried something like this:
awk '{a[$4]; b[$9]}
END{
for (x in a) {
for (y in b) {
if (x >0 && y >0) {
print $3+$8
}
else if (x >0 && y <=0) {
print $3;
}
else if (x <= 0 && y >0) {
print $8;
}
else if (x <=0 && y <=0) {
print "-1";
}
}
}
}' list.txt
Somehow this script doesn't give the correct number of lines (should be equal to list.txt) or the right data :(
Using list.txt one should get
0.25
0.13
0.56
-1
0.12
1.01
By using the nested for loops, you are comparing all the values of column 4 with all the values of column 9 instead of comparing the values on corresponding rows.
Working with each line as it is read is probably more what you want:
awk '{
x=$4; y=$9;
if (x >0 && y >0) {
print $3+$8
}
else if (x >0 && y <=0) {
print $3;
}
else if (x <= 0 && y >0) {
print $8;
}
else if (x <=0 && y <=0) {
print "-1";
}
}'
although there is an accepted answer I don't think it's idiomatic awk. You definitely can get rid of if/else blocks and should.
$ awk '{x=$4>0;y=$9>0} x&&y{w=$3+$8} x&&!y{w=$3} !x&&y{w=$8} !x&&!y{w=-1} {print w}' xy
0.25
0.13
0.56
-1
0.12
1.01
better yet
$ awk '{x=$4>0;y=$9>0} x{w=$3} y{w+=$8} !x&&!y{w=-1} {print w}' xy

Biggest and smallest of all lines

I have a output like this
3.69
0.25
0.80
1.78
3.04
1.99
0.71
0.50
0.94
I want to find the biggest number and the smallest number in the above output
I need output like
smallest is 0.25 and biggest as 3.69
Just sort your input first and print the first and last value. One method:
$ sort file | awk 'NR==1{min=$1}END{print "Smallest",min,"Biggest",$0}'
Smallest 0.25 Biggest 3.69
Hope this help.
OUTPUT="3.69 0.25 0.80 1.78 3.04 1.99 0.71 0.50 0.94"
SORTED=`echo $OUTPUT | tr ' ' '\n' | sort -n`
SMALLEST=`echo "$SORTED" | head -n 1`
BIGGEST=`echo "$SORTED" | tail -n 1`
echo "Smallest is $SMALLEST"
echo "Biggest is $BIGGEST"
Added op's awk oneliner request.
I'm not good at awk, but this works anyway. :)
echo "3.69 0.25 0.80 1.78 3.04 1.99 0.71 0.50 0.94" | awk '{
for (i=1; i<=NF; i++) {
if (length(s) == 0) s = $i;
if (length(b) == 0) b = $i;
if ($i < s) s = $i;
if (b < $i) b = $i;
}
print "Smallest is", s;
print "Biggest is", b;
}'
You want an awk solution?
echo "3.69 0.25 0.80 1.78 3.04 1.99 0.71 0.50 0.94" | \
awk -v RS=' ' '/.+/ { biggest = ((biggest == "") || ($1 > biggest)) ? $1 : biggest;
smallest = ((smallest == "") || ($1 < smallest)) ? $1:smallest}
END { print biggest, smallest}'
Produce the following output:
3.69 0.25
You can use this method also
sort file | echo -e `sed -nr '1{s/(.*)/smallest is :\1/gp};${s/(.*)/biggest no is :\1/gp'}`
TXR solution:
$ txr -e '(let ((nums [mapcar tofloat (gun (get-line))]))
(if nums
(pprinl `smallest is #(find-min nums) and biggest is #(find-max nums)`)
(pprinl "empty input")))'
0.1
-1.0
3.5
2.4
smallest is -1.0 and biggest is 3.5

Resources