compare two columns in two different files in shell script - linux

there is a file1 as below:
21,2018042100
22,2018042101
87,2018042102
98,2018042103
there is file2 as below:
45,2018042100
86,2018042102
87,2018042103
what I need is: (file3)
2018042100,21,45
2018042101,22,0
2018042102,87,86
2018042103,98,87
in row #2 in file3, data for 2018042101 is exist in file1 but it is not exist in file2. So, 0 is inserted in column $3 which is belong to file2.
kindly please assist to find out how I can create a file like file3.
Thanks.

Join seems like made for that problem:
join -t',' -a 1 -a 2 -j 2 file1 file2
2018042100,21,45
2018042101,22
2018042102,87,86
2018042103,98,87
except for the missing ",0" in line 2, but maybe you find a solution in the manpage for that problem too. Else you may use sed to correct for that issue.
join -t',' -a 1 -a 2 -e "0" -j 2 file1 file2 | sed -r 's/^[^,]+,[^,]+$/&,0/'
2018042100,21,45
2018042101,22,0
2018042102,87,86
2018042103,98,87

Another using awk:
$ awk 'BEGIN{FS=OFS=","}NR==FNR{a[$2]=$1;next}{print $2,$1,(a[$2]+0)}' file2 file1
2018042100,21,45
2018042101,22,0
2018042102,87,86
2018042103,98,87
Explained:
$ awk '
BEGIN {
FS=OFS="," # set field separators
}
NR==FNR { # process first file
a[$2]=$1 # hash value on date
next # process next record in first file
}
{ # process second file
print $2,$1,(a[$2]+0) # output date, value, value from first file if exists
}' file2 file1 # mind the file order
Notice, that (a[$2]+0) expects the first field value to be a number like in your example. All other values will produce 0.

Related

Diff 2 settings files and replace the difference

I have 2 files with settings:
file1.txt and file2.txt
A=1 A=2
B=3 B=3
C=5 C=4
D=6 .
. E=7
I am looking for the best approach to replace the values of the file1.txt with the diff values of file2.txt, so the file1.txt would look like:
file1.txt:
A=2
B=3
C=4
D=6
E=7
Currently i didn't write any code, but the only approach i think about is to write a bash script that diffs both files (provided as positional arguments), and use sed to replace non-matching strings. Something in this vein:
./diffreplace.bash file1.txt file2.txt > NEWfile1.txt
I wonder whether there is something more elegant that alerady exists?
All of the following solutions may change the order of assignments. I assumed that would be ok.
Lazy Solution
If you use these assignments in some way that allows overwriting, then you can simple append file2 to the end of file1. All old values will be overwritten be the new ones when you execute result.
cat old new > result
Slightly Better Solution
Extending the previous approach, you can iterate over the lines of result and for every variable, keep only the last assignment:
cat new old |
awk -F= '{if (a[$1]!="x") {print $0; a[$1]=x}}'
Alternative Solution
Use join to combine both files, then filter out the values from the first file by using cut. When your files are sorted, use
join -t= -a1 -a2 new old | cut -d= -f1,2
if not, use
join -t= -a1 -a2 <(sort new) <(sort old) |
cut -d= -f1,2
I'm a little puzzed over your comment the structure of the file must remain untouched. Sort mixes the order so I'm assuming that the As are always on line 1 or line 1 is . etc:
$ awk '
BEGIN { RS="\r?\n" } # in case of Windows line-endings
$0!="." { # we dont store . (change it to null if you need to)
a[FNR]=$0 # hash using line number as key
}
END { # after all that hashing
for(i=1;i<=FNR;i++) # iterate in line number order
print a[i] # output the last met version
}' file1 file2 # mind the file order
Output:
A=2
B=3
C=4
D=6
E=7
Edit: A version with a whitelist:
$ cat whitelist
A
B
E
Script:
$ awk -F= '
NR==FNR { # process the whitelist
a[FNR]=$1 # for a key is linenumber, record as value
b[$1]=FNR # bor b record is key, linenumber is value
n=FNR # remember the count for END
next
} # process file1 and file2 ... filen
($1 in b) { # if record is found in b
a[b[$1]]=$0 # we set the record to a[linenumber]=record
}
END {
for(i=1;i<=n;i++) # here we loop on linenumbers, 1 to n
print a[i]
}' whitelist file1 file2
Output:
A=2
B=3
E=7

how to Merge 2 tables with awk

First of all, sorry for my English and I know there's a lot of various topics regarding AWK but it's a very difficult function to me...
I would like to merge two tables using common columns with awk. The tables differ in the amount of rows. I have my first table that I want to modify and the second as a reference table. I would like to compare my colunme1.F1 with my column1.F2. When it matches, add the column2.F2 in my file1. But I need to keep all my lines in file1.
I give you an example:
File1
Num_id,Name,description1,description2,description3
?,atlanta_1,,,
RO_5,babeni_SW,,,
? ,Bib1,,,
RO_9,BoUba_456,,,
?,Castor,,,
File2
official_Num_id,official_Name
RO_1,America
RO_2,Andre
RO_3,Atlanta
RO_4,Axa
RO_5,Babeni
RO_6,Barba
RO_7,Bib
RO_8,Bilbao
RO_9,Bouba
RO_10,Castor
File3
Num_id,Name,description1,description2,description3,official_Name
?,atlanta_1,,,
RO_5,babeni_SW,,,Babeni
?,Bib1,,,
RO_9,BoUba_456,,,Bouba
?,Castor,,,
I read a lot of solution on Internet and it seems that awk could work ..
I tried awk 'NR==FNR {h[$1] = $2; next} {print $0,h[$1]}' $File1 $File2 > file3
But my command doesn't work, my File3 looks exactly that File1.
In a second time, I don't know if it's possible to compare my two second columns when names have difference like atlanta_1 and Atlanta and add the official_num_id and the official_name in my File1.
Any hero over there?
You had it, except for two small things. First you need to set your file separators to , and, second, reverse the order of your input files on the command line so that the reference file is processed first:
$ awk 'BEGIN {FS=OFS=","} NR==FNR {h[$1] = $2; next} {print $0,h[$1]}' File2 File1
Num_id,Name,description1,description2,description3,
?,atlanta_1,,,,
RO_5,babeni_SW,,,,Babeni
? ,Bib1,,,,
RO_9,BoUba_456,,,,Bouba
?,Castor,,,,
You can also use the join command for this:
join --header --nocheck-order -t, -1 1 -2 1 -a 1 file1 file2
To answer your question if it's possible to compare my two second columns when names have difference like atlanta_1 and Atlanta and add the official_num_id and the official_name in my File1:
$ awk '
BEGIN { FS=OFS="," }
NR==FNR { # file2
a[tolower($2)]=$0 # hash on lowercase city
next
}
{ # file1
split($2,b,"[^[:alpha:]]") # split on non-alphabet
print $0 (tolower(b[1]) in a?OFS a[tolower(b[1])]:"")
}' file2 file1
Num_id,Name,description1,description2,description3
?,atlanta_1,,,,RO_3,Atlanta
RO_5,babeni_SW,,,,RO_5,Babeni
? ,Bib1,,,,RO_7,Bib
RO_9,BoUba_456,,,,RO_9,Bouba
?,Castor,,,,RO_10,Castor
split will split Name field on non-alphabetic characters, ie _ in atlanta_1, 1 in Bib1 etc. so it might fail on cities with dashes etc., edit the pattern [^[:alpha:]] in split accordingly. Header doesn't match with those names, rethink the header names.

Scalable way of deleting all lines from a file where the line starts with one of many values

Given an input file of variable values (example):
A
B
D
What is a script to remove all lines from another file which start with one of the above values? For example, the file contents:
A
B
C
D
Would end up being:
C
The input file is of the order of 100,000 variable values. The file to be mangled is of the order of several million lines.
awk '
NR==FNR { # IF this is the first file in the arg list THEN
list[$0] # store the contents of the current record as an index or array "list"
next # skip the rest of the script and so move on to the next input record
} # ENDIF
{ # This MUST be the second file in the arg list
for (i in list) # FOR each index "i" in array "list" DO
if (index($0,i) == 1) # IF "i" starts at the 1st char on the current record THEN
next # move on to the next input record
}
1 # Specify a true condition and so invoke the default action of printing the current record.
' file1 file2
An alternative approach to building up an array and then doing a string comparison on each element would be to build up a Regular Expression, e.g.:
...
list = list "|" $0
...
and then doing an RE comparison:
...
if ($0 ~ list)
next
...
but I'm not sure that'd be any faster than the loop and you'd then have to worry about RE metacharacters appearing in file1.
If all of your values in file1 are truly single characters, though, then this approach of creating a character list to use in an RE comparison might work well for you:
awk 'NR==FNR{list = list $0; next} $0 !~ "^[" list "]"' file1 file2
You can also achieve this using egrep:
egrep -vf <(sed 's/^/^/' file1) file2
Lets see it in action:
$ cat file1
A
B
$ cat file2
Asomething
B1324
C23sd
D2356A
Atext
CtestA
EtestB
Bsomething
$ egrep -vf <(sed 's/^/^/' file1) file2
C23sd
D2356A
CtestA
EtestB
This would remove lines that start with one of the values in file1.
You can use comm to display the lines that are not common to both files, like this:
comm -3 file1 file2
Will print:
C
Notice that for this for this to work, both files have to be sorted, if they aren't sorted you can bypass that using
comm -3 <(sort file1) <(sort file2)

Comparing two files using awk and printing contains which are matching from other files

I have two files:
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
file2.txt
919167000000
919594000000
Output
919167000000,hutch,mumbai
919594000000,idea,mumbai
How can I achieve this using AWK? I've got a huge file of phone numbers which needs to be compared like this. I believe Awk can handle it; if not please let me know how can I do this.
Extra definitions
Is the common part always a 6-digit number? Yes always 6.
Are the two files already sorted? file1 is not sorted. file2 can be sorted.
Are the trailing digits in file 2 always zeros? No, these are phone numbers this can vary, purpose of this is to get series information of the phone number.
Is there any danger of file 1 containing three records for a given number while file 2 contains 2 records, or is it one-to-one? It's one-to-one.
Can there be records in file 1 with no match in file 2, or vice versa?_ Yes.
If so, do you want to see the unmatched records? Yes I want both records.
Extended data
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
918888,airtel,karnataka
file2.txt
919167838888
919594998484
919212334323
Output Expected:
919167838888,hutch,mumbai
919594998484,idea,mumbai
919212334323,nomatch,nomatch
As I noted in a comment, there's a lot of unstated information needed to give a definitive answer. However, we can make some plausible guesses:
The common number is the first 6 digits of file 2 (we don't care about the trailing digits, but will simply copy them to the output).
The files are sorted in order.
If there are unmatched records in either file, those records will be ignored.
The tools of choice are probably sed and join:
sed 's/^\([0-9]\{6\}\)/\1,\1/' file2.txt |
join -t, -o 1.2,2.2,2.3 - file1.txt
This edits file2.txt to create a comma-separated first field with the 6-digit phone number followed by all the rest of the line. The input is fed to the join command, which joins on the first column, and outputs the 'rest of the line' (column 2) from file2.txt and columns 2 and 3 from file1.txt.
If the phone numbers are variable length, then the matching operation is horribly complex. For that, I'd drop into Perl (or Python) to do the work. If the data is unsorted, it can be sorted before being fed into the commands. If you want unmatched records, you can specify how to handle those in the options to join.
The extra information needed is now available. The key information is the 6-digits is fixed — phew! Since you're on Linux, I'm assuming bash is available with 'process substitution':
sort file2.txt |
sed 's/^\([0-9]\{6\}\)/\1,\1/' |
join -t, -o 1.2,2.2,2.3 -a 1 -a 2 -e 'no-match' - <(sort file1.txt)
If process substitution is not available, simply sort file1.txt in situ:
sort -o file1.txt file1.txt
Then use file1.txt in place of <(sort file1.txt).
I think the comment might be asking for inputs such as:
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
902130,airtel,karnataka
file2.txt
919167000000
919594000000
919342313242
Output
no-match,airtel,karnataka
919167000000,hutch,mumbai
919342313242,no-match,no-match
919594000000,idea,mumbai
If that's not what the comment is about, please clarify by editing the question to add the extra data and output in a more readable format than comments allow.
Working with the extended data, this mildly modified command:
sort file2.txt |
sed 's/^\([0-9]\{6\}\)/\1,\1/' |
join -t, -o 1.2,2.2,2.3 -a 1 -e 'no-match' - <(sort file1.txt)
produces the output:
919167838888,hutch,mumbai
919212334323,no-match,no-match
919594998484,idea,mumbai
which looks rather like a sorted version of the desired output. The -a n options control whether the unmatched records from file 1 or file 2 (or both) are printed; the -e option controls the value printed for the unmatched fields. All of this is readily available from the man pages for join, of course.
Here's one way using GNU awk. Run like:
awk -f script.awk file2.txt file1.txt
Contents of script.awk:
BEGIN {
FS=OFS=","
}
FNR==NR {
sub(/[ \t]+$/, "")
line = substr($0, 0, 6)
array[line]=$0
next
}
{
printf ($1 in array) ? $0"\n" : "FILE1 no match --> "$0"\n"
dup[$1]++
}
END {
for (i in array) {
if (!(i in dup)) {
printf "FILE2 no match --> %s\n", array[i]
}
}
}
Alternatively, here's the one-liner:
awk 'BEGIN { FS=OFS="," } FNR==NR { sub(/[ \t]+$/, ""); line = substr($0, 0, 6); array[line]=$0; next } { printf ($1 in array) ? $0"\n" : "FILE1 no match --> "$0"\n"; dup[$1]++} END { for (i in array) if (!(i in dup)) printf "FILE2 no match --> %s\n", array[i] }' file2.txt file1.txt
awk -F, 'FNR==NR{a[$1]=$2","$3;next}{for(i in a){if($1~/i/) print $1","a[i]}}' your_file

extracting data from two list using a shell script

I am trying to create a shell script that pulls a line from a file and checks another file for an instance of the same. If it finds an entry then it adds it to another file and loops through the first list until the it has gone through the whole file. The data in the first file looks like this -
email#address.com;
email2#address.com;
and so on
The other file in which I am looking for a match and placing the match in the blank file looks like this -
12334 email#address.com;
32213 email2#address.com;
I want it to retain the numbers as well as the matching data. I have an idea of how this should work but need to know how to implement it.
My Idea
#!/bin/bash
read -p "enter first file name:" file1
read -p "enter second file name:" file2
FILE_DATA=( $( /bin/cat $file1))
FILE_DATA1=( $( /bin/cat $file2))
for I in $((${#FILE_DATA[#]}))
do
echo $FILE_DATA[$i] | grep $FILE_DATA1[$i] >> output.txt
done
I want the output to look like this but only for addresses that match -
12334 email#address.com;
32213 email2#address.com;
Thank You
quite like manipulating text using SQL:
$ cat file1
b#address.com
a#address.com
c#address.com
d#address.com
$ cat file2
10712 e#address.com
11457 b#address.com
19985 f#address.com
22519 d#address.com
$ join -1 1 -2 2 <(sort file1) <(sort -k2 file2) | awk '{print $2,$1}'
11457 b#address.com
22519 d#address.com
make keys sorted(we use emails as keys here)
join on keys(file1.column1, file2.column2)
format output(use awk to reverse columns)
As you've learned about diff and comm, now it's time to learn about another tool in the unix toolbox, join.
Join does just what the name indicates, it joins together 2 files. The way you join is based on keys embedded in the file.
The number 1 restraint on using join is that the data must be sorted in both files on the same column.
file1
a abc
b bcd
c cde
file2
a rec1
b rec2
c rec3
join file1 file2
a abc rec1
b bcd rec2
c cde rec3
you can consult the join man page for how to reduce and reorder the columns of output. for example
1>join -o 1.1 2.2 file1 file2
a rec1
b rec2
c rec3
You can use your code for file name input to turn this into a generalizable script.
Your solution using a pipeline inside a for loop will work for small sets of data, but as the size of data grows, the cost of starting a new process for each word you are searching for will drag down the run time.
I hope this helps.
Read line by the file1.txt file and assign the line to var ADDR. grep file2.txt with the content of var ADDR and append the output to file_result.txt.
(while read ADDR; do grep "${ADDR}" file2.txt >> file_result.txt ) < file1.txt
This awk one-liner can help you do that -
awk 'NR==FNR{a[$1]++;next}($2 in a){print $0 > "f3.txt"}' f1.txt f2.txt
NR and FNR are awk's built-in variables that stores the line numbers. NR does not get reset to 0 when working with two files. FNR does. So while that condition is true we add everything to an array a. Once the first file is completed, we check for the second column of second file. If a match is present in the array we put the entire line in a file f3.txt. If not then we ignore it.
Using data from Kev's solution:
[jaypal:~/Temp] cat f1.txt
b#address.com
a#address.com
c#address.com
d#address.com
[jaypal:~/Temp] cat f2.txt
10712 e#address.com
11457 b#address.com
19985 f#address.com
22519 d#address.com
[jaypal:~/Temp] awk 'NR==FNR{a[$1]++;next}($2 in a){print $0 > "f3.txt"}' f1.txt f2.txt
[jaypal:~/Temp] cat f3.txt
11457 b#address.com
22519 d#address.com

Resources