Finding matches in 2 files and printing the field above the match - linux

File1:
2987571 2988014
4663633 4668876
4669084 4669827
4669873 4670130
4670212 4670604
4670604 4672469
4672502 4672621
4672723 4673088
4673102 4673518
4673521 4673895
4679698 4680174
5756724 5757680
5757937 5758506
5758855 5759202
5759940 5771528
5772524 5773063
5773005 5773106
5773063 5773452
5773486 5773776
5773836 5774189
File2:
gene complement(6864294..6865061)
/locus_tag="HCH_06747"
CDS complement(6864294..6865061)
/locus_tag="HCH_06747"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="ABC33372.1"
/translation="MIKQLVRPLFTGKGPNFSELSAKECGVGEYQLRYKLPGNTIHIG
MPDAPVPARVNLNADLFDSYGPKKLYNRTFVQMEFEKWAYKGRFLQGDSGLLSKMSLH
IDVNHAERHTEFRKGDLDSLELYLKKDLWNYYETERNIDGEQGANWEARYEFDHPDEM
RAKGYVPPDTLVLVRLPEIYERAPINGLEWLHYQIRGEGIPGPRHTFYWVYPMTDSFY
LTFSFWMTTEIGNRELKVQEMYEDAKRIMSMVELRKE"
gene complement(6865197..6865964)
/locus_tag="HCH_06748"
CDS complement(6865197..6865964)
/locus_tag="HCH_06748"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="ABC33373.1"
/translation="MIKQIVRPLFTGKGPNFSELNVKECGIGDYLLRYKLPGNTIDIG
MPDAPVPSRVNLNADLFDSYDPKKLYNRTFVQMEFEWWAYRGLFLQGDSGLLSKMSLH
IDVNRINPNSPLGGSDLESLETYLREDYWDYYEAEKNIDGVPGSNWQKRYDFDNPDEV
RAKGYIPVRRLVLVLLPEIYVKERINDVEWLHYSIDGEGIAGTNITYYWAYPLTNNYY
LTFSFRTTTELGRNEQRYQRMLEDAKQIMSMVELCKG"
gene complement(6865961..6867109)
/locus_tag="HCH_06749"
CDS complement(6865961..6867109)
The goal here is to take each number of the 1st file's 1st column and see if that number appears in the second file. If yes, I want to print the line right above the match in the file2: "/locus_tag"
For example, if in file1 we have 6864294, and this number is also present on file2, then I'd like to print: /locus_tag="HCH_06747"

Here's a rough sample:
awk '
NR==FNR { # hash file 1 to a
a[$1]
next
}
{
q=$0
while(match($0,/[0-9]+/)) { # find all numeric strings
if((substr($0,RSTART,RLENGTH) in a)) # test if it is in a
print p # and output previous record p
$0=substr($0,RSTART+RLENGTH) # remove match from record
}
p=q # store current record to p
}' file1 file2
/locus_tag="HCH_06747"

Tried this and I think it will work:
for i in $(cat file1 | awk -F " " '{print $1 '\n'; print $2}')
do
grep -m1 $i file2 -A1 | tail -1
done

Related

replace pattern in file 2 with pattern in file 1 if contingency is met

I have two tab delimted data files the file1 looks like:
cluster_j_72 cluster-32 cluster-32 cluster_j_72
cluster_j_75 cluster-33 cluster-33 cluster_j_73
cluster_j_8 cluster-68 cluster-68 cluster_j_8
the file2 looks like:
NODE_148 67545 97045 cluster-32
NODE_221 1 42205 cluster-33
NODE_168 1 24506 cluster-68
I would like to confirm that, for a given row, in file1 columns 2 and 3; as well as 1 and 4 are identical. If this is the case then I would like to take the value for that row from column 2 (file 1) find it in file2 and replace it with the value from column 1 (file 1). Thus the new output of file 2 would look like this (note because column 1 and 4 dont match for cluster 33 (file1) the pattern is not replaced in file2):
NODE_148 67545 97045 cluster_j_72
NODE_221 1 42205 cluster-33
NODE_168 1 24506 cluster_j_8
I have been able to get the contingency correct (here printing the value from file1 i'd like to use to replace a value in file2):
awk '{if($2==$3 && $1==$4){print $1}}'file1
If I could get sed to draw values ($2 and $1) from file1 while looking in file 2 this would work:
sed 's/$2(from file1)/$1(from file1)/' file2
But I don't seem to be able to nest this sed in the previous awk statement, nor get sed to look for a pattern originating in a different file than it's looking in.
thanks!
You never need sed when you're using awk since awk can do anything that sed can do.
This might be what you're trying to do:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
if ( ($1 == $4) && ($2 == $3) ) {
map[$2] = $1
}
next
}
$4 in map { $4 = map[$4] }
{ print }
$ awk -f tst.awk file1 file2
NODE_148 67545 97045 cluster_j_72
NODE_221 1 42205 cluster-33
NODE_168 1 24506 cluster_j_8

Diff 2 settings files and replace the difference

I have 2 files with settings:
file1.txt and file2.txt
A=1 A=2
B=3 B=3
C=5 C=4
D=6 .
. E=7
I am looking for the best approach to replace the values of the file1.txt with the diff values of file2.txt, so the file1.txt would look like:
file1.txt:
A=2
B=3
C=4
D=6
E=7
Currently i didn't write any code, but the only approach i think about is to write a bash script that diffs both files (provided as positional arguments), and use sed to replace non-matching strings. Something in this vein:
./diffreplace.bash file1.txt file2.txt > NEWfile1.txt
I wonder whether there is something more elegant that alerady exists?
All of the following solutions may change the order of assignments. I assumed that would be ok.
Lazy Solution
If you use these assignments in some way that allows overwriting, then you can simple append file2 to the end of file1. All old values will be overwritten be the new ones when you execute result.
cat old new > result
Slightly Better Solution
Extending the previous approach, you can iterate over the lines of result and for every variable, keep only the last assignment:
cat new old |
awk -F= '{if (a[$1]!="x") {print $0; a[$1]=x}}'
Alternative Solution
Use join to combine both files, then filter out the values from the first file by using cut. When your files are sorted, use
join -t= -a1 -a2 new old | cut -d= -f1,2
if not, use
join -t= -a1 -a2 <(sort new) <(sort old) |
cut -d= -f1,2
I'm a little puzzed over your comment the structure of the file must remain untouched. Sort mixes the order so I'm assuming that the As are always on line 1 or line 1 is . etc:
$ awk '
BEGIN { RS="\r?\n" } # in case of Windows line-endings
$0!="." { # we dont store . (change it to null if you need to)
a[FNR]=$0 # hash using line number as key
}
END { # after all that hashing
for(i=1;i<=FNR;i++) # iterate in line number order
print a[i] # output the last met version
}' file1 file2 # mind the file order
Output:
A=2
B=3
C=4
D=6
E=7
Edit: A version with a whitelist:
$ cat whitelist
A
B
E
Script:
$ awk -F= '
NR==FNR { # process the whitelist
a[FNR]=$1 # for a key is linenumber, record as value
b[$1]=FNR # bor b record is key, linenumber is value
n=FNR # remember the count for END
next
} # process file1 and file2 ... filen
($1 in b) { # if record is found in b
a[b[$1]]=$0 # we set the record to a[linenumber]=record
}
END {
for(i=1;i<=n;i++) # here we loop on linenumbers, 1 to n
print a[i]
}' whitelist file1 file2
Output:
A=2
B=3
E=7

compare two columns in two different files in shell script

there is a file1 as below:
21,2018042100
22,2018042101
87,2018042102
98,2018042103
there is file2 as below:
45,2018042100
86,2018042102
87,2018042103
what I need is: (file3)
2018042100,21,45
2018042101,22,0
2018042102,87,86
2018042103,98,87
in row #2 in file3, data for 2018042101 is exist in file1 but it is not exist in file2. So, 0 is inserted in column $3 which is belong to file2.
kindly please assist to find out how I can create a file like file3.
Thanks.
Join seems like made for that problem:
join -t',' -a 1 -a 2 -j 2 file1 file2
2018042100,21,45
2018042101,22
2018042102,87,86
2018042103,98,87
except for the missing ",0" in line 2, but maybe you find a solution in the manpage for that problem too. Else you may use sed to correct for that issue.
join -t',' -a 1 -a 2 -e "0" -j 2 file1 file2 | sed -r 's/^[^,]+,[^,]+$/&,0/'
2018042100,21,45
2018042101,22,0
2018042102,87,86
2018042103,98,87
Another using awk:
$ awk 'BEGIN{FS=OFS=","}NR==FNR{a[$2]=$1;next}{print $2,$1,(a[$2]+0)}' file2 file1
2018042100,21,45
2018042101,22,0
2018042102,87,86
2018042103,98,87
Explained:
$ awk '
BEGIN {
FS=OFS="," # set field separators
}
NR==FNR { # process first file
a[$2]=$1 # hash value on date
next # process next record in first file
}
{ # process second file
print $2,$1,(a[$2]+0) # output date, value, value from first file if exists
}' file2 file1 # mind the file order
Notice, that (a[$2]+0) expects the first field value to be a number like in your example. All other values will produce 0.

linux grep pattern in an unknown number of column

I have a text file with many rows and columns and I want to grep a column by the 'column name'.
M121 M125 M123 M124 M131 M126 M211 N
0.41463252 1.00296561 -0.1713496 0.15923644 -1.49682602 -1.9478695 1.45223392 …
-0.46775802 0.14591103 1.122446 0.83648981 -0.3038532 -1.1841548 2.18074729 …
0.67736835 2.12969375 -0.8187298 0.13582824 -1.49290987 -0.6798428 1.04353114 …
0.08673344 -0.40437672 1.8441559 -0.63679375 0.47998832 0.1702844 0.54029264 …
-0.32606297 -0.95551833 0.6157599 0.02819133 1.44818627 -0.9528659 0.09207864 …
-0.51781121 0.88806507 -0.2913757 -0.00463802 0.05037374 0.953773 0.01244763 …
-0.25724472 0.05119051 0.2109025 -0.26083822 -0.52094072 -0.938595 -0.01275275 …
1.94348766 -1.83607523 1.2010512 -0.54109756 -0.88323831 -0.6263788 -0.96973544 …
0.1900408 -0.61025656 0.4586306 -0.69181051 -0.90713834 0.3589271 0.6870383 …
0.54866057 -0.03861159 -1.505861 0.54871682 -0.24602601 -0.3941754 0.85673905 …
for example, I want to grep M211 column but I don't know the number of column. I tried:
awk '$i == "M211"' filename or awk '$0 == "M211"' filename
awk: illegal field $(), name "i"
input record number 1, filename
source line number 1
Is there any solution ? Thank you.
awk solution - iterates over column names for first line of input file and saves column number if it matches desired pattern. Then print that column. No output if match is not found
$ awk 'NR==1{ for(i=1;i<=NF;i++){if($i=="M125")c=i;} if(c==0)exit; }
{print $c}' ip.txt
M125
1.00296561
0.14591103
2.12969375
-0.40437672
-0.95551833
0.88806507
0.05119051
-1.83607523
-0.61025656
-0.03861159
Similar solution with perl
$ perl -lane '#i = grep {$F[$_] eq "M123"} 0..$#F if $.==1; exit if !#i;
print #F[#i]' ip.txt
M123
-0.1713496
1.122446
-0.8187298
1.8441559
0.6157599
-0.2913757
0.2109025
1.2010512
0.4586306
-1.505861
#i = grep {$F[$_] eq "M123"} 0..$#F if $.==1 for the header line, get index for which column value matches the string M123
exit if !#i exit if no match found
print #F[#i] print the matched column
assumes there'll be only one column match
for multiple matches, use
perl -lane '#i = grep {$F[$_] =~ /^(M121|M126)$/} 0..$#F if $.==1; exit if !#i;
print join " ", #F[#i]' ip.txt
Another in awk:
$ awk 'NR==1 {for(i=NF;i>0;i--) if($i=="M125") break; if(!i) exit} {print $i}' file
M125
1.00296561
0.14591103
2.12969375
-0.40437672
-0.95551833
0.88806507
0.05119051
-1.83607523
-0.61025656
-0.03861159
Explained:
NR==1 { # for the first record
for(i=NF;i>0;i--) # iterate fields backwards for change
if($i=="M125") break # until desired column, remember i
if (!i) exit # if column not found, exit
}
{print $i} # print value from ith field
If you are more familiar with Python:
import csv
column_name = "M125"
with open("file", "rb") as f:
data_dict = csv.DictReader(f, delimiter=" ")
print column_name
for item in data_dict:
print item[column_name]
To do anything with columns ("fields" in awk) by name rather than number you should first create an array that maps the field name to number and from then on just access the fields using that array indexed by the field name(s) rather than accessing them directly by field number(s):
$ awk 'NR==1{for (i=1;i<=NF;i++) f[$i]=i} {print $(f["M124"])}' file
M124
0.15923644
0.83648981
0.13582824
-0.63679375
0.02819133
-0.00463802
-0.26083822
-0.54109756
-0.69181051
0.54871682
or if you don't want to hard-code the column name:
$ awk -v c=M124 'NR==1{for (i=1;i<=NF;i++) f[$i]=i} {print $(f[c])}' file
M124
0.15923644
0.83648981
0.13582824
-0.63679375
0.02819133
-0.00463802
-0.26083822
-0.54109756
-0.69181051
0.54871682
and to print any number of columns in the order you choose:
$ awk -v cols='M129 M124' 'NR==1{for (i=1;i<=NF;i++) f[$i]=i; n=split(cols,c)} {for (i=1;i<=n;i++) printf "%s%s", $(f[c[i]]), (i<n ? OFS : ORS)}' file
M129 M124
1.45223392 0.15923644
2.18074729 0.83648981
1.04353114 0.13582824
0.54029264 -0.63679375
0.09207864 0.02819133
0.01244763 -0.00463802
-0.01275275 -0.26083822
-0.96973544 -0.54109756
0.6870383 -0.69181051
0.85673905 0.54871682

Join two TSV files with inner join

I have 2 TSV files:
TSV file 1:
A B
hello 0.5
bye 0.4
TSV file 2:
C D
hello 1
country 5
I want to join the 2 TSV files together based on file1.A=file2.C
How can i do it with the join function in linux?
Hoping to get this:
Text B D
hello 0.5 1
bye 0.4
country 5
Not getting any output with this:
join -j 1 <(sort -k1 file1.tsv) <(sort -k1 file2.tsv)
A little hairy, but here is a solution using awk and associative arrays.
awk 'FNR == 1 {h[length(h) + 1] = $2}
FILENAME ~ /test1.tsv/ && FNR > 1 {t1[$1]=$2}
FILENAME ~ /test2.tsv/ && FNR > 1 {t2[$1]=$2}
END{print "Text\t"h[1]"\t"h[2];
for(x in t1){print x"\t"t1[x]"\t"t2[x]}
for(x in t2){print x"\t"t1[x]"\t"t2[x]}}' test1.tsv test2.tsv |
sort | uniq
File1
$ cat file1
A B
hello 0.5
bye 0.4
File2
$ cat file2
C D
hello 1
country 5
Output
$ awk 'NR==1{print "Text","B","D"}FNR==1{next}FNR==NR{A[$1]=$2;next}{print $0,(f=$1 in A ? A[$1] : ""; if(f)delete A[$1]}END{for(i in A)print i,"",A[i]}' OFS='\t' file2 file1
Text B D
hello 0.5 1
bye 0.4
country 5
Better Readable Version
awk '
# Print header when NR = 1, this happens only when awk reads first file
NR==1{print "Text","B","D"}
# Number of Records relative to the current input file.
# When awk reads from the multiple input file,
# awk NR variable will give the total number of records relative to all the input file.
# Awk FNR will give you number of records for each input file
# So when awk reads first line, stop processing and go to next line
# this is just to skip header from each input file
FNR==1{
next
}
# FNR==NR is only true while reading first file (file2)
FNR==NR{
# Build assicioative array on the first column of the file
# where array element is second column
A[$1]=$2
# Skip all proceeding blocks and process next line
next
}
{
# Check index ($1 = column1) from second argument (file1) exists in array A
# if exists variable f will be 1 (true) otherwise 0 (false)
# As long as above state is true
# print current line and element of array A where index is column1
print $0,( f=$1 in A ? A[$1] : "" )
# Delete array element corresponding to index $1, if f is true
if(f)delete A[$1]
}
# Finally in END block print array elements one by one,
# from file2 which does not exists in file1
END{
for(i in A)
print i,"",A[i]
}
' OFS='\t' file2 file1
In your title you state you want to perform an inner join. Your example output suggests you want an outer join.
If you want an inner join as the title suggest, I recommend you use eBay's fabulous tsv-utils, particularly the tsv-join command as follows:
tsv-join -H --filter-file 1.tsv --key-fields A --data-fields C --append-fields B 2.tsv
No awk magic needed, just a simple well documented command with easily understandable options.
The above produces a proper inner join, you'd just need to rename the join key to text:
C D B
hello 1 0.5

Resources