compare 2nd column of two or more files and print union of all files - linux

I have four tab separated files 1.txt, 2.txt, 3.txt, 4.txt. Each having following format
89 ABI1 0.19
93 ABL1 0.15
94 ABL2 0.07
170 ACSL3 0.21
I want to compare 2nd column of all files and print union (based on 2nd column) into new file, like following:
1.txt 2.txt 3.txt 4.txt
ABL2 0.07 0.01 0.11 0.009
AKT1 0.31 0.05 0.05 0.017
AKT2 0.33 0.05 0.01 0.004
How is it possible in awk?
I tried following but this only compares first columns,
awk 'NR==FNR {h[$1] = $0; next} {print $1,h[$1]}' OFS="\t" 2.txt 1.txt
but when I change it to compare 2nd column it doesn't work
awk 'NR==FNR {h[$2] = $0; next} {print $1,h[$2]}' OFS="\t" 2.txt 1.txt
Also this only works on two files at a time.
Is there any way to do it on four files by comparing 2nd column in awk?

Using join on sorted input files, and assuming a shell that understands process substitutions with <(...) (I've used a copy of the data that you provided for every input file, just adding a line at the top for identification, this is the AAA line):
$ join <( join -1 2 -2 2 -o 0,1.3,2.3 1.txt 2.txt ) \
<( join -1 2 -2 2 -o 0,1.3,2.3 3.txt 4.txt )
AAA 1 2 3 4
ABI1 0.19 0.19 0.19 0.19
ABL1 0.15 0.15 0.15 0.15
ABL2 0.07 0.07 0.07 0.07
ACSL3 0.21 0.21 0.21 0.21
There are three joins here. The first two to be performed are the ones in <(...). The first of these join the first two files, while the second join the last two files. The result of one of these joins looks like
AAA 1 2
ABI1 0.19 0.19
ABL1 0.15 0.15
ABL2 0.07 0.07
ACSL3 0.21 0.21
The option -o 0,1.3,2.3 means "output the join field along with field 3 from both files". -1 2 -2 2 means "use field 2 of each file as join field (rather than field 1)".
The outermost join takes the two results and performs the final join that produces the output.
If the input files are not sorted on the join field:
$ join <( join -1 2 -2 2 -o 0,1.3,2.3 <(sort -k2,2 1.txt) <(sort -k2,2 2.txt) ) \
<( join -1 2 -2 2 -o 0,1.3,2.3 <(sort -k2,2 3.txt) <(sort -k2,2 4.txt) )

Related

I want to find some strings/words from column 1 and 2 in file1 that match column 1 in file2 and replace with column 2 strings/words in file2

I'm still learning coding using Linux platform. I have search for problems similar to mine but the once I found they were either specific or focusing only on changing the entire column 1.
Here are example of my files:
File 1
abc Gamma 3.44
bcd abc 5.77
abc Alpha 1.99
beta abc 0.88
bcd Alpha 5.66
File 2
Gamma Bacteria
Alpha Bacteria
Beta Bacteria
Output file3
abc Bacteria 3.44
bcd abc 5.77
abc Bacteria 1.99
Bacteria abc 0.88
bcd Bacteria 5.66
I have tried:
awk:
$ awk 'FNR==NR{a[$1]=$2;next} {if ($1,$2 in a){$1,$2=a[$1,$2]}; print $0}' file2 file1
$ awk 'NR==FNR {a[FNR]=$0; next} /$1|$2/ {$1 $2=a[FNR]} 1' file2 file1
They gave me:
abc Gamma 3.44
abc 5.77
abc Alpha 1.99
Bacteria abc 0.88
bcd Alpha 5.66
Only changing the $1 and remove the other text strings in column 1 which are not found in file2 $2
And this one:
$ awk -F'\t' -v OFS='\t' 'FNR==1 { next }FNR == NR { file2[$1,$2] = $1 FS $2 } FNR != NR { file1[$1,$2,] = $1 FS $2} END { print "Match:"; for (k in file1) if (k in file1) print file2[k] # Or file1[k]}' file2 file1
Didn't work
Then after i tried sed:
$ sed = file2 | sed -r 'N;s/(.*)\n(.*)/\1s|\&$|\2|/' | sed -f - file1
This gave me an error and complained about
sed -e not being called properly.
Then after take only the smallest $3 if $1 and $2 or $2 and $1 are similar
file 4
bcd abc 5.77
Bacteria abc 0.88
bcd Bacteria 5.66
I have tried this code:
$ awk 'NR == $1&$2 || $3 < min {line = $0; min = $3}END{print line}' file3
$ awk '/^$1/{if(h){print h RS m}min=""; h=$0; next}min=="" || $3 < min{min=$3; m=$0}END{print h RS m}' file3
$ awk -F'\t' '$3 != "NF==min"' OFS='\t' file3
$ awk -v a=NODE '{c=a*$3+(1-a)} !($1 in min) || c<min[$1]{min[$1]=c; minLine[$1]=$0} END{for(k in minLine) print minLine[k]}' file3 | column -t
All didn't work and i tried to research what what does each line means and changed it to fit my problem. But they all failed
This might work for you (GNU sed):
sed -E 's#(.*) (.*)#/^\1 /Is/\\S+/\2/;/^\\S+ \1 /Is/\\S+/\2/2#' file2 |
sed -Ef - file1
Generate a sed script from file2 which is run against file1 to produce the required format.

How to match two different length and different column text file with header using join command in linux

I have two different length text files A.txt and B.txt
A.txt looks like :
ID pos val1 val2 val3
1 2 0.8 0.5 0.6
2 4 0.9 0.6 0.8
3 6 1.0 1.2 1.3
4 8 2.5 2.2 3.4
5 10 3.2 3.4 3.8
B.txt looks like :
pos category
2 A
4 B
6 A
8 C
10 B
I want to match pos column and in both files and want the output like this
ID catgeory pos val1 val2 val3
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
I used the join function join -1 2 -2 1 <(sort -k2 A.txt) <(sort -k1 B.txt) > C.txt
The C.txt comes without a header
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
I want to get output with a header from the join function. kindly help me out
Thanks in advance
In case you are ok with awk, could you please try following. Written and tested with shown samples in GNU awk.
awk 'FNR==NR{a[$1]=$2;next} ($2 in a){$2=a[$2] OFS $2} 1' B.txt A.txt | column -t
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when B.txt is being read.
a[$1]=$2 ##Creating array a with index of 1st field and value is 2nd field of current line.
next ##next will skip all further statements from here.
}
($2 in a){ ##Checking condition if 2nd field is present in array a then do following.
$2=a[$2] OFS $2 ##Adding array a value along with 2nd field in 2nd field as per output.
}
1 ##1 will print current line.
' B.txt A.txt | column -t ##Mentioning Input_file names and passing awk program output to column to make it look better.
As you requested... It is perfectly possible to get the desired output using just GNU join:
$ join -1 2 -2 1 <(sort -k2 -g A.txt) <(sort -k1 -g B.txt) -o 1.1,2.2,1.2,1.3,1.4,1.5
ID category pos val1 val2 val3
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
$
The key to getting the correct output is using the sort -g option, and specifying the join output column order using the -o option.
To "pretty print" the output, pipe to column -t
$ join -1 2 -2 1 <(sort -k2 -g A.txt) <(sort -k1 -g B.txt) -o 1.1,2.2,1.2,1.3,1.4,1.5 | column -t
ID category pos val1 val2 val3
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
$

Compare multiple rows to pick the one with smallest value

I would like to compare the rows in the second column, and get the row with the highest value in the consecutive columns, with priority of column 3> 4 > 5. I sorted my dataset for the second column so the same values will be together.
My dataset looks like this:
X1 A 0.38 24.68 2.93
X2 A 0.38 20.22 14.54
X3 A 0.38 20.08 00.48
X3.3 A 0.22 11.55 10.68
C43 B 0.22 11.55 20.08
C4.2 C 0.22 11.55 3.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
C44 D 0.22 1.10 1.24
P1 E 0.42 0.42 0.42
P2 E 0.42 0.42 0.42
P3 E 0.42 0.42 0.42
In here, I would like to say, if second column is the same value with another row, then I compare their values in the third column and pick the row with the highest value in the third column.
If the rows have the same second and third columns, then I go to forth column and compare their values in this column, and then get row with the highest value.
If the rows sharing second column still share the values in third and forth columns, then I pick the row with highest value in the fifth column.
If, second-third-forth-fifth columns are the same (complete duplicates), then I print them all, but add 'duplicate' next to their fifth column.
If a row does not share its value for the second column for any other rows, then there is no comparison and I keep this column.
Therefore, my expected output will be:
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42duplicate
P2 E 0.42 0.42 0.42duplicate
P3 E 0.42 0.42 0.42duplicate
What I tried at the moment fails, because I can only compare based on second column and not with multiple columns conditioning and I cannot keep complete duplicates.
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++'
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42
I appreciate to learn how to fix it.
I'm afraid the code below is not sophisticated, how about:
awk -v OFS="\t" '$1=$1' "data.txt" | sort -k2,2 -k3nr -k4nr -k5nr > "tmp.txt"
awk -v OFS="\t" '
NR==FNR {
vals = $3","$4","$5
if (max[$2] == "") max[$2] = vals
else if (max[$2] == vals) dupe[$2] = 1
next
} {
vals = $3","$4","$5
if (dupe[$2]) $6 = "duplicate"
if (max[$2] == vals) print
}' "tmp.txt" "tmp.txt"
rm -f "tmp.txt"
It saves the sorted result in a temporary file "tmp.txt".
The 2nd awk script processes the temporary file with two passes.
In the 1st pass, it extracts the "max value" for each 2nd column.
It also detects the duplications and set the variable dupe if found.
In the 2nd pass, it assigns the variable $6 to a string duplicate
if the line has the dupe flag.
Then it prints only the line(s) which have the max value for each 2nd column.
This may not be the most elegant solution but it works
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++' | cut -f2- > /tmp/fgrep.$$
cat data.txt | fgrep -f /tmp/fgrep.$$ | awk '{
rec[NR] = $0
idx = sprintf("%s %s %s %s",$2,$3,$4,$5)
irec[NR] = idx
dup[idx]++
}
END{
for(i in rec){
if(dup[irec[i]]> 1){
print rec[i] "duplicate"
}else{
print rec[i]
}
}
}'
rm /tmp/fgrep.$$

how to compare two text file with first column if match then print same if not then put zero?

1.txt contain
1
2
3
4
5
.
.
180
2.txt contain
3 0.5
4 0.8
9 9.0
120 3.0
179 2.0
so I want my output like if 2.txt match with first column of 1.txt then should print the value of second column that is in 2.txt. while if not match then should print zero .
like output should be:
1 0.0
2 0.0
3 0.5
4 0.8
5 0.0
.
.
8 0.0
9 9.0
10 0.0
11 0.0
.
.
.
120 3.0
121 0.0
.
.
150 0.0
.
179 2.0
180 0.0
awk 'NR==FNR{a[$1]=$2;next}{if($1 in a){print $1,a[$1]}else{print $1,"0.0"}}' 2.txt 1.txt
Brief explanation,
NR==FNR{a[$1]=$2;next: Record $1 of 2.txt into array a
If the $1 in 1.txt exists in array a, print a[$1], else print 0.0
Could you please try following and let me know if this helps you.
awk 'FNR==NR{a[$1];next} {for(i=prev+1;i<=($1-1);i++){print i,"0.0"}}{prev=$1;$1=$1;print}' OFS="\t" 1.txt 2.txt
Explanation of code:
awk '
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when 1.txt is being read.
a[$1]; ##Creating an array a whose index is $1.
next ##next will skip all further statements from here.
}
{
for(i=prev+1;i<=($1-1);i++){ ##Starting a for loop from variable prev+1 to till value of first field with less than 1 to it.
print i,"0.0"} ##Printing value of variable i and 0.0 here.
}
{
prev=$1; ##Setting $1 value to variable prev here.
$1=$1; ##Resetting $1 here to make TAB output delimited in output.
print ##Printing the current line here.
}' OFS="\t" 1.txt 2.txt ##Setting OFS as TAB and mentioning Input_file(s) name here.
Execution of above code:
Input_file(s):
cat 1.txt
1
2
3
4
5
6
7
cat 2.txt
3 0.5
4 0.8
9 9.0
Output will be as follows:
awk 'FNR==NR{a[$1];next} {for(i=prev+1;i<=($1-1);i++){print i,"0.0"}}{prev=$1;$1=$1;print}' OFS="\t" 1.txt 2.txt
1 0.0
2 0.0
3 0.5
4 0.8
5 0.0
6 0.0
7 0.0
8 0.0
9 9.0
This might work for you (GNU sed):
sed -r 's#^(\S+)\s.*#/^\1\\s*$/c&#' file2 | sed -i -f - -e 's/$/ 0.0/' file1
Create a sed script from file2 that if the first field from file2 matches the first field of file1 changes the matching line to the contents of the matching line in file2. All other lines are then zeroed i.e. lines not changed have 0.0 appended.

Select rows in one file based on specific values in the second file (Linux)

I have two files:
One is "total.txt". It has two columns: the first column is natural numbers (indicator) ranging from 1 to 20, the second column contains random numbers.
1 321
1 423
1 2342
1 7542
2 789
2 809
2 5332
2 6762
2 8976
3 42
3 545
... ...
20 432
20 758
The other one is "index.txt". It has three columns:(1.indicator, 2:low value, 3: high value)
1 400 5000
2 600 800
11 300 4000
I want to output the rows of "total.txt" file with first column matches with the first column of "index.txt" file. And at the same time, the second column of output results must be larger than (>) the second column of the "index.txt" and smaller than (<) the third column of the "index.txt".
The expected result is as follows:
1 423
1 2342
2 809
2 5332
2 6762
11 ...
11 ...
I have tried this:
awk '$1==(awk 'print($1)' index.txt) && $2 > (awk 'print($2)' index.txt) && $1 < (awk 'print($2)' index.txt)' total.txt > result.txt
But it failed!
Can you help me with this? Thank you!
You need to read both files in the same awk script. When you read index.txt, store the other columns in an array.
awk 'FNR == NR { low[$1] = $2; high[$1] = $3; next }
$2 > low[$1] && $2 < high[$1] { print }' index.txt total.txt
FNR == NR is the common awk idiom to detect when you're processing the first file.
Use join like Barmar said:
# To join on the first columns
join -11 -21 total.txt index.txt
And if the files aren't sorted in lexical order by the first column then:
join -11 -21 <(sort -k1,1 total.txt) <(sort -k1,1 index.txt)

Resources