how to Merge 2 tables with awk - linux

First of all, sorry for my English and I know there's a lot of various topics regarding AWK but it's a very difficult function to me...
I would like to merge two tables using common columns with awk. The tables differ in the amount of rows. I have my first table that I want to modify and the second as a reference table. I would like to compare my colunme1.F1 with my column1.F2. When it matches, add the column2.F2 in my file1. But I need to keep all my lines in file1.
I give you an example:
File1
Num_id,Name,description1,description2,description3
?,atlanta_1,,,
RO_5,babeni_SW,,,
? ,Bib1,,,
RO_9,BoUba_456,,,
?,Castor,,,
File2
official_Num_id,official_Name
RO_1,America
RO_2,Andre
RO_3,Atlanta
RO_4,Axa
RO_5,Babeni
RO_6,Barba
RO_7,Bib
RO_8,Bilbao
RO_9,Bouba
RO_10,Castor
File3
Num_id,Name,description1,description2,description3,official_Name
?,atlanta_1,,,
RO_5,babeni_SW,,,Babeni
?,Bib1,,,
RO_9,BoUba_456,,,Bouba
?,Castor,,,
I read a lot of solution on Internet and it seems that awk could work ..
I tried awk 'NR==FNR {h[$1] = $2; next} {print $0,h[$1]}' $File1 $File2 > file3
But my command doesn't work, my File3 looks exactly that File1.
In a second time, I don't know if it's possible to compare my two second columns when names have difference like atlanta_1 and Atlanta and add the official_num_id and the official_name in my File1.
Any hero over there?

You had it, except for two small things. First you need to set your file separators to , and, second, reverse the order of your input files on the command line so that the reference file is processed first:
$ awk 'BEGIN {FS=OFS=","} NR==FNR {h[$1] = $2; next} {print $0,h[$1]}' File2 File1
Num_id,Name,description1,description2,description3,
?,atlanta_1,,,,
RO_5,babeni_SW,,,,Babeni
? ,Bib1,,,,
RO_9,BoUba_456,,,,Bouba
?,Castor,,,,

You can also use the join command for this:
join --header --nocheck-order -t, -1 1 -2 1 -a 1 file1 file2

To answer your question if it's possible to compare my two second columns when names have difference like atlanta_1 and Atlanta and add the official_num_id and the official_name in my File1:
$ awk '
BEGIN { FS=OFS="," }
NR==FNR { # file2
a[tolower($2)]=$0 # hash on lowercase city
next
}
{ # file1
split($2,b,"[^[:alpha:]]") # split on non-alphabet
print $0 (tolower(b[1]) in a?OFS a[tolower(b[1])]:"")
}' file2 file1
Num_id,Name,description1,description2,description3
?,atlanta_1,,,,RO_3,Atlanta
RO_5,babeni_SW,,,,RO_5,Babeni
? ,Bib1,,,,RO_7,Bib
RO_9,BoUba_456,,,,RO_9,Bouba
?,Castor,,,,RO_10,Castor
split will split Name field on non-alphabetic characters, ie _ in atlanta_1, 1 in Bib1 etc. so it might fail on cities with dashes etc., edit the pattern [^[:alpha:]] in split accordingly. Header doesn't match with those names, rethink the header names.

Related

compare two columns in two different files in shell script

there is a file1 as below:
21,2018042100
22,2018042101
87,2018042102
98,2018042103
there is file2 as below:
45,2018042100
86,2018042102
87,2018042103
what I need is: (file3)
2018042100,21,45
2018042101,22,0
2018042102,87,86
2018042103,98,87
in row #2 in file3, data for 2018042101 is exist in file1 but it is not exist in file2. So, 0 is inserted in column $3 which is belong to file2.
kindly please assist to find out how I can create a file like file3.
Thanks.
Join seems like made for that problem:
join -t',' -a 1 -a 2 -j 2 file1 file2
2018042100,21,45
2018042101,22
2018042102,87,86
2018042103,98,87
except for the missing ",0" in line 2, but maybe you find a solution in the manpage for that problem too. Else you may use sed to correct for that issue.
join -t',' -a 1 -a 2 -e "0" -j 2 file1 file2 | sed -r 's/^[^,]+,[^,]+$/&,0/'
2018042100,21,45
2018042101,22,0
2018042102,87,86
2018042103,98,87
Another using awk:
$ awk 'BEGIN{FS=OFS=","}NR==FNR{a[$2]=$1;next}{print $2,$1,(a[$2]+0)}' file2 file1
2018042100,21,45
2018042101,22,0
2018042102,87,86
2018042103,98,87
Explained:
$ awk '
BEGIN {
FS=OFS="," # set field separators
}
NR==FNR { # process first file
a[$2]=$1 # hash value on date
next # process next record in first file
}
{ # process second file
print $2,$1,(a[$2]+0) # output date, value, value from first file if exists
}' file2 file1 # mind the file order
Notice, that (a[$2]+0) expects the first field value to be a number like in your example. All other values will produce 0.

Removing Fields from file keeping delimeter intact

We have a requirement to remove certain fields inside a delimited file using shell script, but we do not want to loose delimeter. i.e
$cat file1
col1,col2,col3,col4,col5
1234,4567,7890,9876,6754
abcd,efgh,ijkl,rtyt,cvbn
now we need to generate different file (say file2) out of file1 with second and fourth column removed with delimeter (,) intact
i.e
$cat file2
col1,,col3,,col5
1234,,7890,,6754
abcd,,ijkl,,cvbn
Please suggest what would be easiest and efficient way of achieving this, also as file is having around 300 fields/columns AWK is not working because of its limitation related to number of fields.
awk '{$2 = $4 = ""}1' FS=, OFS=, file1 > file2
awk 'BEGIN { FS = OFS = ","; } { $2 = $4 = ""; print }' file1 > file2
Which, simply saying:
sets input & output field separator to , at start,
empties fields 2 & 4 for each line, and then prints the line back.

how to read a file in awk command

I have two files that look like:
**file1.txt**
"a","1","11","111"
"b","2","22","222"
"c","3","33","333"
"d","4","44","444"
"e","5","55","555"
"f","6","66","666"
**file2.txt**
"b"
"d"
"a"
"c"
"e"
"f"
I need to create a script that changes the order of file1 and begin with the order of file2. e.g.:
"b","2","22","222"
"d","4","44","444"
"a","1","11","111"
"c","3","33","333"
"e","5","55","555"
"f","6","66","666"
I created a command that looks like:
nawk '/^("b")/' file1 ; nawk '/^("d")/' file1 ; nawk '/^("a")/' file1 ; nawk '/^("c")/' file1 ; nawk '/^("e")/' file1 ; nawk '/^("f")/' file1
It does the trick, however I would like to further automate it, but don't know how to proceed. How could I create a command or variable that would look at line 1 of file2("b") and put it the above command, then look at line 2 of file2("d"), and put it in the above command, and so on. Basically if possible, I would like the command to look at file 2 and fill in the blanks in the above command. Any other more convenient commands you guys can suggest would be appreciated. Note that I currently have to manually insert the letters from file 2 in the above command.
The actual file may contain well over 100 lines
awk -F, 'NR==FNR { a[$1]=$0; next }
($1 in a) { print a[$1] }' file1 file2
This reads all of file1 into memory, then prints in the order of file2. If file1 is very large, this may not be feasible.
This is a common Awk idiom; search the many near-duplicates if you need a more detailed explanation.

Scalable way of deleting all lines from a file where the line starts with one of many values

Given an input file of variable values (example):
A
B
D
What is a script to remove all lines from another file which start with one of the above values? For example, the file contents:
A
B
C
D
Would end up being:
C
The input file is of the order of 100,000 variable values. The file to be mangled is of the order of several million lines.
awk '
NR==FNR { # IF this is the first file in the arg list THEN
list[$0] # store the contents of the current record as an index or array "list"
next # skip the rest of the script and so move on to the next input record
} # ENDIF
{ # This MUST be the second file in the arg list
for (i in list) # FOR each index "i" in array "list" DO
if (index($0,i) == 1) # IF "i" starts at the 1st char on the current record THEN
next # move on to the next input record
}
1 # Specify a true condition and so invoke the default action of printing the current record.
' file1 file2
An alternative approach to building up an array and then doing a string comparison on each element would be to build up a Regular Expression, e.g.:
...
list = list "|" $0
...
and then doing an RE comparison:
...
if ($0 ~ list)
next
...
but I'm not sure that'd be any faster than the loop and you'd then have to worry about RE metacharacters appearing in file1.
If all of your values in file1 are truly single characters, though, then this approach of creating a character list to use in an RE comparison might work well for you:
awk 'NR==FNR{list = list $0; next} $0 !~ "^[" list "]"' file1 file2
You can also achieve this using egrep:
egrep -vf <(sed 's/^/^/' file1) file2
Lets see it in action:
$ cat file1
A
B
$ cat file2
Asomething
B1324
C23sd
D2356A
Atext
CtestA
EtestB
Bsomething
$ egrep -vf <(sed 's/^/^/' file1) file2
C23sd
D2356A
CtestA
EtestB
This would remove lines that start with one of the values in file1.
You can use comm to display the lines that are not common to both files, like this:
comm -3 file1 file2
Will print:
C
Notice that for this for this to work, both files have to be sorted, if they aren't sorted you can bypass that using
comm -3 <(sort file1) <(sort file2)

Comparing two files using awk and printing contains which are matching from other files

I have two files:
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
file2.txt
919167000000
919594000000
Output
919167000000,hutch,mumbai
919594000000,idea,mumbai
How can I achieve this using AWK? I've got a huge file of phone numbers which needs to be compared like this. I believe Awk can handle it; if not please let me know how can I do this.
Extra definitions
Is the common part always a 6-digit number? Yes always 6.
Are the two files already sorted? file1 is not sorted. file2 can be sorted.
Are the trailing digits in file 2 always zeros? No, these are phone numbers this can vary, purpose of this is to get series information of the phone number.
Is there any danger of file 1 containing three records for a given number while file 2 contains 2 records, or is it one-to-one? It's one-to-one.
Can there be records in file 1 with no match in file 2, or vice versa?_ Yes.
If so, do you want to see the unmatched records? Yes I want both records.
Extended data
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
918888,airtel,karnataka
file2.txt
919167838888
919594998484
919212334323
Output Expected:
919167838888,hutch,mumbai
919594998484,idea,mumbai
919212334323,nomatch,nomatch
As I noted in a comment, there's a lot of unstated information needed to give a definitive answer. However, we can make some plausible guesses:
The common number is the first 6 digits of file 2 (we don't care about the trailing digits, but will simply copy them to the output).
The files are sorted in order.
If there are unmatched records in either file, those records will be ignored.
The tools of choice are probably sed and join:
sed 's/^\([0-9]\{6\}\)/\1,\1/' file2.txt |
join -t, -o 1.2,2.2,2.3 - file1.txt
This edits file2.txt to create a comma-separated first field with the 6-digit phone number followed by all the rest of the line. The input is fed to the join command, which joins on the first column, and outputs the 'rest of the line' (column 2) from file2.txt and columns 2 and 3 from file1.txt.
If the phone numbers are variable length, then the matching operation is horribly complex. For that, I'd drop into Perl (or Python) to do the work. If the data is unsorted, it can be sorted before being fed into the commands. If you want unmatched records, you can specify how to handle those in the options to join.
The extra information needed is now available. The key information is the 6-digits is fixed — phew! Since you're on Linux, I'm assuming bash is available with 'process substitution':
sort file2.txt |
sed 's/^\([0-9]\{6\}\)/\1,\1/' |
join -t, -o 1.2,2.2,2.3 -a 1 -a 2 -e 'no-match' - <(sort file1.txt)
If process substitution is not available, simply sort file1.txt in situ:
sort -o file1.txt file1.txt
Then use file1.txt in place of <(sort file1.txt).
I think the comment might be asking for inputs such as:
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
902130,airtel,karnataka
file2.txt
919167000000
919594000000
919342313242
Output
no-match,airtel,karnataka
919167000000,hutch,mumbai
919342313242,no-match,no-match
919594000000,idea,mumbai
If that's not what the comment is about, please clarify by editing the question to add the extra data and output in a more readable format than comments allow.
Working with the extended data, this mildly modified command:
sort file2.txt |
sed 's/^\([0-9]\{6\}\)/\1,\1/' |
join -t, -o 1.2,2.2,2.3 -a 1 -e 'no-match' - <(sort file1.txt)
produces the output:
919167838888,hutch,mumbai
919212334323,no-match,no-match
919594998484,idea,mumbai
which looks rather like a sorted version of the desired output. The -a n options control whether the unmatched records from file 1 or file 2 (or both) are printed; the -e option controls the value printed for the unmatched fields. All of this is readily available from the man pages for join, of course.
Here's one way using GNU awk. Run like:
awk -f script.awk file2.txt file1.txt
Contents of script.awk:
BEGIN {
FS=OFS=","
}
FNR==NR {
sub(/[ \t]+$/, "")
line = substr($0, 0, 6)
array[line]=$0
next
}
{
printf ($1 in array) ? $0"\n" : "FILE1 no match --> "$0"\n"
dup[$1]++
}
END {
for (i in array) {
if (!(i in dup)) {
printf "FILE2 no match --> %s\n", array[i]
}
}
}
Alternatively, here's the one-liner:
awk 'BEGIN { FS=OFS="," } FNR==NR { sub(/[ \t]+$/, ""); line = substr($0, 0, 6); array[line]=$0; next } { printf ($1 in array) ? $0"\n" : "FILE1 no match --> "$0"\n"; dup[$1]++} END { for (i in array) if (!(i in dup)) printf "FILE2 no match --> %s\n", array[i] }' file2.txt file1.txt
awk -F, 'FNR==NR{a[$1]=$2","$3;next}{for(i in a){if($1~/i/) print $1","a[i]}}' your_file

Resources