Matching two files and print all columns - linux

I have two files I want to match according to column 1 in file 1 and column 2 in file 2.
File 1:
1000019 -0.013936 0.0069218 -0.0048443 -0.0053688
1000054 0.013993 0.0044969 -0.0050022 -0.0043233
File 2:
5131885 1000019
1281471 1000054
I would like to print all columns after matching.
Expected output (file 3):
5131885 1000019 -0.013936 0.0069218 -0.0048443 -0.0053688
1281471 1000054 0.013993 0.0044969 -0.0050022 -0.0043233
I tried the following:
awk 'FNR==NR{arr[$1]=$2;next} ($2 in arr){print $0,arr[$2]}' file1 file2 > file3
join file1 file2 > file3 #after sorting

This awk should work
awk 'NR==FNR {r[$2]=$1; next}{print r[$1], $0}' $file2 $file1
Output
5131885 1000019 -0.013936 0.0069218 -0.0048443 -0.0053688
1281471 1000054 0.013993 0.0044969 -0.0050022 -0.0043233

Related

replace pattern in file 2 with pattern in file 1 if contingency is met

I have two tab delimted data files the file1 looks like:
cluster_j_72 cluster-32 cluster-32 cluster_j_72
cluster_j_75 cluster-33 cluster-33 cluster_j_73
cluster_j_8 cluster-68 cluster-68 cluster_j_8
the file2 looks like:
NODE_148 67545 97045 cluster-32
NODE_221 1 42205 cluster-33
NODE_168 1 24506 cluster-68
I would like to confirm that, for a given row, in file1 columns 2 and 3; as well as 1 and 4 are identical. If this is the case then I would like to take the value for that row from column 2 (file 1) find it in file2 and replace it with the value from column 1 (file 1). Thus the new output of file 2 would look like this (note because column 1 and 4 dont match for cluster 33 (file1) the pattern is not replaced in file2):
NODE_148 67545 97045 cluster_j_72
NODE_221 1 42205 cluster-33
NODE_168 1 24506 cluster_j_8
I have been able to get the contingency correct (here printing the value from file1 i'd like to use to replace a value in file2):
awk '{if($2==$3 && $1==$4){print $1}}'file1
If I could get sed to draw values ($2 and $1) from file1 while looking in file 2 this would work:
sed 's/$2(from file1)/$1(from file1)/' file2
But I don't seem to be able to nest this sed in the previous awk statement, nor get sed to look for a pattern originating in a different file than it's looking in.
thanks!
You never need sed when you're using awk since awk can do anything that sed can do.
This might be what you're trying to do:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
if ( ($1 == $4) && ($2 == $3) ) {
map[$2] = $1
}
next
}
$4 in map { $4 = map[$4] }
{ print }
$ awk -f tst.awk file1 file2
NODE_148 67545 97045 cluster_j_72
NODE_221 1 42205 cluster-33
NODE_168 1 24506 cluster_j_8

Finding matches in 2 files and printing the field above the match

File1:
2987571 2988014
4663633 4668876
4669084 4669827
4669873 4670130
4670212 4670604
4670604 4672469
4672502 4672621
4672723 4673088
4673102 4673518
4673521 4673895
4679698 4680174
5756724 5757680
5757937 5758506
5758855 5759202
5759940 5771528
5772524 5773063
5773005 5773106
5773063 5773452
5773486 5773776
5773836 5774189
File2:
gene complement(6864294..6865061)
/locus_tag="HCH_06747"
CDS complement(6864294..6865061)
/locus_tag="HCH_06747"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="ABC33372.1"
/translation="MIKQLVRPLFTGKGPNFSELSAKECGVGEYQLRYKLPGNTIHIG
MPDAPVPARVNLNADLFDSYGPKKLYNRTFVQMEFEKWAYKGRFLQGDSGLLSKMSLH
IDVNHAERHTEFRKGDLDSLELYLKKDLWNYYETERNIDGEQGANWEARYEFDHPDEM
RAKGYVPPDTLVLVRLPEIYERAPINGLEWLHYQIRGEGIPGPRHTFYWVYPMTDSFY
LTFSFWMTTEIGNRELKVQEMYEDAKRIMSMVELRKE"
gene complement(6865197..6865964)
/locus_tag="HCH_06748"
CDS complement(6865197..6865964)
/locus_tag="HCH_06748"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="ABC33373.1"
/translation="MIKQIVRPLFTGKGPNFSELNVKECGIGDYLLRYKLPGNTIDIG
MPDAPVPSRVNLNADLFDSYDPKKLYNRTFVQMEFEWWAYRGLFLQGDSGLLSKMSLH
IDVNRINPNSPLGGSDLESLETYLREDYWDYYEAEKNIDGVPGSNWQKRYDFDNPDEV
RAKGYIPVRRLVLVLLPEIYVKERINDVEWLHYSIDGEGIAGTNITYYWAYPLTNNYY
LTFSFRTTTELGRNEQRYQRMLEDAKQIMSMVELCKG"
gene complement(6865961..6867109)
/locus_tag="HCH_06749"
CDS complement(6865961..6867109)
The goal here is to take each number of the 1st file's 1st column and see if that number appears in the second file. If yes, I want to print the line right above the match in the file2: "/locus_tag"
For example, if in file1 we have 6864294, and this number is also present on file2, then I'd like to print: /locus_tag="HCH_06747"
Here's a rough sample:
awk '
NR==FNR { # hash file 1 to a
a[$1]
next
}
{
q=$0
while(match($0,/[0-9]+/)) { # find all numeric strings
if((substr($0,RSTART,RLENGTH) in a)) # test if it is in a
print p # and output previous record p
$0=substr($0,RSTART+RLENGTH) # remove match from record
}
p=q # store current record to p
}' file1 file2
/locus_tag="HCH_06747"
Tried this and I think it will work:
for i in $(cat file1 | awk -F " " '{print $1 '\n'; print $2}')
do
grep -m1 $i file2 -A1 | tail -1
done

How to delete lines in file1 if comumn

I have a file with 4 columns and i need to delete from file1 if column 3 is in files 2
Example:
File1:
14769,marty.------#googlemail.com,c076a7b6a52857ddf2f2e60d71dda6bf,49
14770,maryfi-------#googlemail.com,23fc2887a3a8248ddea570b5700b1708,49
14771,n.s------#googlemail.com,e504a6617f375ce04f4e51f1ec66dd93,49
14772,paula------#googlemail.com,f918f5b8df1d6285892d003c2fb9e3cf,49
14773,pkec.------#googlemail.com,4ca2c5d670f324c31a20854873bf63ac,49
14774,squi-------#googlemail.com,d26a0296a361b79afd98ede1af918f6d,49
File 2:
d26a0296a361b79afd98ede1af918f6d
4ca2c5d670f324c31a20854873bf63ac
so result will be like this
14769,marty.------#googlemail.com,c076a7b6a52857ddf2f2e60d71dda6bf,49
14770,maryfi-------#googlemail.com,23fc2887a3a8248ddea570b5700b1708,49
14771,n.s------#googlemail.com,e504a6617f375ce04f4e51f1ec66dd93,49
14772,paula------#googlemail.com,f918f5b8df1d6285892d003c2fb9e3cf,49
i have tried with this
awk -F',' 'NR==FNR {a[$1]=$3 ;next} !($3 in a) {print }' OFS='\t' file1 file2
but not working
I can't add a comment not enough rep; but I've tried your code with gawk and it did remove the two lines as you wanted. The reason you don't get tab delimited output is that OFS takes effect only after $0 is rebuilt, so you can force this by simple assignment like $1=$1 and your OFS='\t':
{a[$1]=$3 ;next} !($3 in a) {$1=$1; print}' OFS='\t' file2 file1
Result:
14769 marty.------#googlemail.com c076a7b6a52857ddf2f2e60d71dda6bf 49
14770 maryfi-------#googlemail.com 23fc2887a3a8248ddea570b5700b1708 49
14771 n.s------#googlemail.com e504a6617f375ce04f4e51f1ec66dd93 49
14772 paula------#googlemail.com f918f5b8df1d6285892d003c2fb9e3cf 49

linux bash - compare two files and remove duplicate lines having same ending

I have two files containing paths to files.
File 1
/home/anybody/proj1/hello.h
/home/anybody/proj1/engine.h
/home/anybody/proj1/car.h
/home/anybody/proj1/tree.h
/home/anybody/proj1/sun.h
File 2
/home/anybody/proj2/module/include/cat.h
/home/anybody/proj2/module/include/engine.h
/home/anybody/proj2/module/include/tree.h
/home/anybody/proj2/module/include/map.h
/home/anybody/proj2/module/include/sun.h
I need a command, probably using grep, that would compare the two file and output a combination of the two files, but in case of duplicates in the name of the file, keep the file from File 2.
Expected output:
/home/anybody/proj1/hello.h
/home/anybody/proj1/car.h
/home/anybody/proj2/module/include/cat.h
/home/anybody/proj2/module/include/engine.h
/home/anybody/proj2/module/include/tree.h
/home/anybody/proj2/module/include/map.h
/home/anybody/proj2/module/include/sun.h
This is so I can generate a list of include files for my project's tag database, but some files are duplicated by the build, and I don't want to have two copies of the same file in my database.
This awk command should do the job:
awk -F/ 'NR == FNR{a[$NF]=$0; next} !($NF in a); END{for (i in a) print a[i]}' file2 file1
/home/anybody/proj1/hello.h
/home/anybody/proj1/car.h
/home/anybody/proj2/module/include/map.h
/home/anybody/proj2/module/include/cat.h
/home/anybody/proj2/module/include/engine.h
/home/anybody/proj2/module/include/tree.h
/home/anybody/proj2/module/include/sun.h
This should do it
cat file2 file1 | awk -F '/' '
{ if (a[$NF] == "") a[$NF] = $0 }
END { for (k in a) print a[k] }' | sort

Comparing two files and updating second file using bash and awk and sorting the second file

I have two files with two colums in each file that I want to compare the 1st column of both files. If the value of the 1st column in the first file does not exist in the second file, I then want to append to the second file the value in the 1st column of the first file, eg
firstFile.log
1457935407,998181
1457964225,998191
1457969802,997896
secondFile.log
1457966024,1
1457967635,1
1457969802,5
1457975246,2
After, secondFile.log should look like:
1457935407,null
1457964225,null
1457966024,1
1457967635,1
1457969802,5
1457975246,2
Note: Second file should be sorted by the first column after being updated.
Using awk and sort:
awk 'BEGIN{FS=OFS=","} FNR==NR{a[$1]; next} {delete a[$1]; print} END{
for (i in a) print i, "null"}' firstFile.log secondFile.log |
sort -t, -k1 > $$.temp && mv $$.temp secondFile.log
1457935407,null
1457964225,null
1457966024,1
1457967635,1
1457969802,5
1457975246,2
using non awk tools...
$ sort -t, -uk1,1 file2 <(sed 's/,.*/,null/' file1)
1457935407,null
1457964225,null
1457966024,1
1457967635,1
1457969802,5
1457975246,2

Resources