Merge two files using AWK with conditions

Merge two files using AWK with conditions - linux

I am new to bash scripting and need help with below Question. I parsed a log file to get below and now stuck on later part.
I have a file1.csv with content as:
mac-test-1,10.32.9.12,15
mac-test-2,10.32.9.13,10
mac-test-3,10.32.9.14,11
mac-test-4,10.32.9.15,13
and second file2.csv has below content:
mac-test-3,10.32.9.14
mac-test-4,10.32.9.15
I want to do a file comparison and if the line in second file matches any line in first file then change the content of file 1 as below:
mac-test-1,10.32.9.12, 15, no match
mac-test-2,10.32.9.13, 10, no match
mac-test-3,10.32.9.14, 11, matched
mac-test-4,10.32.9.15, 13, matched
I tried this
awk -F "," 'NR==FNR{a[$1]; next} $1 in a {print $0",""matched"}' file2.csv file1.csv
but it prints below and doesn't include the not matching records
mac-test-3,10.32.9.14,11,matched
mac-test-4,10.32.9.15,13,matched
Also, in some cases the file2 can be empty so the result should be like this:
mac-test-1,10.32.9.12,15, no match
mac-test-2,10.32.9.13,10, no match
mac-test-3,10.32.9.14,11, no match
mac-test-4,10.32.9.15,13, no match

With your shown samples please try following awk code. You need not to check condition first and then print the statement because when you are checking $1 in a then those items who doesn't exist will NEVER come inside this condition's block. So its better to print whole line
of file1.csv and then print status of that particular line either its matched OR not-matched based on their existence inside array.
awk '
BEGIN { FS=OFS="," }
FNR==NR{
arr[$0]
next
}
{
print $0,(($1 OFS $2) in arr)?"Matched":"Not-matched"
}
' file2.csv file1.csv
EDIT: Adding a solution to handle empty file of file2.csv scenario here, same concept wise as above only thing it handles scenarios when file2.csv is an Empty file.
awk -v lines=$(wc -l < file2.csv) '
BEGIN { FS=OFS=","}
(lines==0){
print $0,"Not-Matched"
next
}
FNR==NR{
arr[$0]
next
}
{
print $0,(($1 OFS $2) in arr)?"Matched":"Not-matched"
}
' file2.csv file1.csv

You are not printing the else case:
awk -F "," 'NR==FNR{a[$1]; next}
{
if ($1 in a) {
print $0 ",matched"
} else {
print $0 ",no match"
}
}' file2.csv file1.csv
Output
mac-test-1,10.32.9.12,15,no match
mac-test-2,10.32.9.13,10,no match
mac-test-3,10.32.9.14,11,matched
mac-test-4,10.32.9.15,13,matched
Or in short, without manually printing the comma but using OFS:
awk 'BEGIN{FS=OFS=","} NR==FNR{a[$1];next}{ print $0 OFS (($1 in a)?"":"no")"match"}' file2.csv file1.csv
Edit
I found a solution on this page handling FNR==NR on an empty file.
When file2.csv is empty, all output lines will be:
mac-test-1,10.32.9.12,15,no match
Example
awk -F "," '
ARGV[1] == FILENAME{a[$1];next}
{
if ($1 in a) {
print $0 ",matched"
} else {
print $0 ",no match"
}
}' file2.csv file1.csv

Each of #RavinderSingh13's and #Thefourthbird's answers contain large parts of the solution but here it is all together:
awk '
BEGIN { FS=OFS="," }
{ key = $1 FS $2 }
FILENAME == ARGV[1] {
arr[key]
next
}
{
print $0, ( key in arr ? "matched" : "no match")
}
' file2.csv file1.csv
or if you prefer:
awk '
BEGIN { FS=OFS="," }
{ key = $1 FS $2 }
!f {
arr[key]
next
}
{
print $0, ( key in arr ? "matched" : "no match")
}
' file2.csv f=1 file1.csv

Related

Merge two files AWK

I have to merge two files and need help with:
File1.csv
mac-test-2,10.57.8.2,Compliant
mac-test-6,10.57.8.6,Compliant
mac-test-12,10.57.8.12,Compliant
mac-test-17,10.57.8.17,Noncompliant
File2.csv
mac-test-17,10.57.8.17,2022-10-21
After Merge the content should be
Merge.csv
mac-test-2,10.57.8.2,Compliant,NA
mac-test-6,10.57.8.6,Compliant,NA
mac-test-12,10.57.8.12,Compliant,NA
mac-test-17,10.57.8.17,Noncompliant,2022-10-21
so logic is if the File1.txt doesnt have a matching record in File2.txt then "NA" should be inserted and if it is a match then date should be inserted in the fourth column.
I have written below
awk -F "," '
ARGV[1] == FILENAME{a[$1];next}
{
if ($1 in a) {
print $0 ","
} else {
print $0 ",NA"
}
}
' File2.csv File1.csv
But this is printing
mac-test-2,10.57.8.2,Compliant,NA
mac-test-6,10.57.8.6,Compliant,NA
mac-test-12,10.57.8.12,Compliant,NA
mac-test-17,10.57.8.17,Noncompliant,
I am not sure how I can print the date if it matches.

With your shown samples please try following awk code. Written and tested with your shown samples only.
awk '
BEGIN{ FS=OFS="," }
FNR==NR{
arr[$1]=$NF
next
}
{
print $0,($1 in arr?arr[$1]:"NA")
}
' file2.csv file1.csv
To handle empty file2.csv please try following awk program.
awk '
BEGIN{ FS=OFS="," }
ARGV[1] == FILENAME{
arr[$1]=$NF
next
}
{
if ($1 in arr) {
print $0,arr[$1]
}
else{
print $0,"N/A"
}
}' file2.csv file1.csv

implementing Excel-vlookup-like function with awk

I have a question about vlookup function implementation with awk. I have a csv file having id-score pairs like this (say 1.csv):
id,score
1,16
3,12
5,13
11,8
13,32
17,37
23,74
29,7
31,70
41,83
There are "unscored" guys. I also have a csv file including all registered guys both scored and unscored like this (say, 2.csv) (I transposed for the want of space)
id,1,3,5,7,11,13,17,19,23,29,31,37,41
I would like to generate id-score pairs according to 2nd csv file so as to include both scored and unscored guys. For unscored guys, NAN would be used instead of the digit.
In other words, final result is desired to be like this:
id,score
1,16
3,12
5,13
7,NAN
11,8
13,32
17,37
19,NAN
23,74
29,7
31,70
37,NAN
41,83
When I tried to create a new table with the following awk command, it did not work to me. Thanks in advance for any advice.
awk 'FNR==NR{a[$1]++; next} {print $0, (a[$1]) ? a[$2] : "NAN"}' 1.csv 2.csv

here is your script with fixes: set field separators; save the score value for each id; print the value from lookup, if missing NaN
$ awk 'BEGIN {FS=OFS=","}
FNR==NR {a[$1]=$2; next}
{print $1, (($1 in a)?a[$1]:"NAN")}' file1 file2
id,score
1,16
3,12
5,13
7,NAN
11,8
13,32
17,37
19,NAN
23,74
29,7
31,70
37,NAN
41,83

With bash and join:
echo "id,score"
join --header -j 1 -t ',' <(sort 1.csv | grep -v '^id') <(tr ',' '\n' < 2.csv | grep -v '^id' | sort) -e "NAN" -a 2 -o 2.1,1.2 | sort -n
Output:
id,score
1,16
3,12
5,13
7,NAN
11,8
13,32
17,37
19,NAN
23,74
29,7
31,70
37,NAN
41,83
See: man join

With awk could you please try following, written with shown samples in GNU awk. Considering(like your shown samples) your both the Input_files have headers in their first line.
awk -v counter=2 '
FNR==1{
next
}
FNR==NR{
a[FNR]=$0
b[FNR]=$1
next
}
{
if($0==b[counter]){
print a[counter]
counter++
}
else{
print $0",NA"
}
}
' FS="," 1.csv <(tr ',' '\n' < 2.csv)
Explanation: Adding detailed explanation for above.
awk -v counter=2 ' ##Starting awk program from here and setting counter as 2.
FNR==1{ ##Checking condition if line is 1st then do following.
next ##next will skip all further statements from here.
}
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when Input_file 1.csv is being read.
a[FNR]=$0 ##Creating array a with index FNR and value of current line.
b[FNR]=$1 ##Creating array b with index FNR and value of 1st field of current line.
next ##next will skip all further statements from here.
}
{
if($0==b[counter]){ ##Checking condiiton if current line is same as array b with index counter value then do following.
print a[counter] ##Printing array a with index of counter here.
counter++ ##Increasing count of counter by 1 each time cursor comes here.
}
else{ ##Else part of for above if condition starts here.
print $0",NA" ##Printing current line and NA here.
}
}
' FS="," 1.csv <(tr ',' '\n' < 2.csv) ##Setting FS as , for Input_file 1.csv and sending 2.csv output by changing comma to new line to awk.

An awk solution could be:
awk -v FS=, -v OFS=, '
NR == 1 { print; next }
NR == FNR { score[$1] = $2; next }
{ for (i = 2; i <= NF; ++i)
print $i, score[$i] == "" ? "NAN" : score[$i] }
' 1.csv 2.csv

extract data using sed or awk in linux

I am trying to merge data from 2 text files based on some condition.
I have two files:
1.txt
gera077||o||emi_riv_90#hotmail.com||||200.45.113.254||o||0f8caa3ced5dc172901a427410d20540
okan1993||||killa-o#hotmail.de||||84.141.125.140||o||69c1cb5ddbc66cceebe0dddba3eddf68
Tosiunia||||tosia_19#amorki.pl||o||83.22.193.86|||||ddcbba2076646980391cb4971b8030
DREP
glen-666||o||glen-666#hotmail.com||||84.196.42.167||o||f139d8b49085d012af9048bb1cba3534
Page 1
Sheyes1 ||||summer_faerie_dustyrose#yahoo.com|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
.
BenPhynix||||BenPhynix#aol.de||||| 62.226.181.57||||11dea24f1caebb012e11285579050f38
menopause |||totoche#wanadoo.fr||o||83.193.209.52||o||d7ca4d78fc79a795695ae1c161ce82ea
jonof.|o||joflem#medi3.no||o||213.161.242.106||o||239f33743e4a070b728d4dcbd1091f1a
2.txt
f139d8b49085d012af9048bb1cba3534: 12883 #: "#
d7ca4d78fc79a795695ae1c161ce82ea: 123422
0f8caa3ced5dc172901a427410d20540 :: demo
Contains the matching lines from 1.txt and hash is replaced with corresponding value in 2.txt
result.txt
gera077 || o || emi_riv_90#hotmail.com || or || 200.45.113.254 || o ||: demo
glen-666-||glen-666#hotmail.com||||84.196.42.167||||12883 #: "#
menopause |||totoche#wanadoo.fr||o||83.193.209.52||o||123422
Contains the non-matching lines from 1.txt
left.txt
okan1993||||killa-o#hotmail.de||||84.141.125.140||o||69c1cb5ddbc66cceebe0dddba3eddf68
Tosiunia||||tosia_19#amorki.pl||o||83.22.193.86|||||ddcbba2076646980391cb4971b8030
DREP
Page 1
Sheyes1 ||||summer_faerie_dustyrose#yahoo.com|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
.
BenPhynix||||BenPhynix#aol.de||||| 62.226.181.57||||11dea24f1caebb012e11285579050f38
jonof.|o||joflem#medi3.no||o||213.161.242.106||o||239f33743e4a070b728d4dcbd1091f1a
The script I am trying is :
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
c[$9]=$0;
next
}
($1 in a){
val=$1;
$1="";
sub(/:/,"");
print a[val] s1 $0 s1 b[val];
d[val]=$0;
next
}
END{
for(i in d){
delete c[i]
};
for(j in c){
print c[j] > "left.txt"
}}
' FS="|" 1.txt FS=":" OFS=":" 2.txt > result.txt
But it is giving me empty result.txt
I am facing difficulty in debugging the issue.
Any help would be highly appreciated.

Try following awk(completely based on your shown Input_file(s) and considering that your 2.txt will not have any duplicates on it too) and let me know if this helps you.
awk 'FNR==NR{a[$NF]=$0;next} $1~/:/{sub(/:/,"",$1);flag=1} ($1 in a){val=$1;if($0 ~ /:/ && !flag){sub(/[^:]*/,"");sub(/:/,"")};print a[val] OFS $0 > "result.txt";flag="";delete a[val]} END{for(i in a){print a[i]>"left.txt"}}' FS="|" 1.txt FS=" " OFS="||o||" 2.txt
Output will be 2 files named results.txt and left.txt. Will add non-one liner form and explanation too for above code shortly.
Adding a non-one liner form of solution too now.
awk '
FNR==NR{ ##FNR and NR both are awk out of the box variables and they denote line numbers in Input_file(s), difference between them is FNR value will be RESET when it complete reading 1 Input_file and NR value will be keep increasing till it completes reading all the Input_file(s).
a[$NF]=$0; ##Creating an array named a whose index is $NF(value of last field of current line) and value is current line.
next ##next is awk out of the box keyword which will skip all further statements now.
}
$1~/:/{ ##Checking condition here if current lines 1st field has a colon in it then do following:
sub(/:/,"",$1); ##Using sub function of awk which will substitute colon with NULL of 1st field of current line of current Input_file.
flag=1 ##Setting a variable named flag here(basically to make sure that 1st colon is substituted so need for another colon removal.
}
($1 in a){ ##Checking a condition here if current line $1 is present in array a then do following:
val=$1; ##Setting variable named val value to $1 here.
if($0 ~ /:/ && !flag){ ##Checking condition here if current line is having colon and variable flag is NOT NULL then do following:
sub(/[^:]*/,""); ##Substituting all the values from starting to till colon comes with NULL.
sub(/:/,"")}; ##Then substituting only 1 colon here.
print a[val] OFS $0 > "result.txt"; ##printing the value of array a whose index is variable val OFS(output field separator) current line values to output file named results.txt here.
flag=""; ##Unsetting the value of variable flag here.
delete a[val] ##Deleting the value of array a whose index is variable val here.
}
END{ ##Starting end section of this awk program here. which will be executed once all Input_file(s) have been read.
for(i in a){ ##Traversing through the array a now.
print a[i]>"left.txt"} ##Printing the value of array a(which will basically provide those values which are NOT matched in both files) in left.txt file.
}
' FS="|" 1.txt FS=" " OFS="||o||" 2.txt ##Setting FS="|" for 1.txt Input_file and then setting FS=" " and OFS="||o||" for 2.txt Input_file, 1.txt and 2.txt are Input_files for this program to run.

This awk script may also help.
$ awk 'BEGIN{FS="\|";OFS="|"}NR==FNR{data[$1]=$2;}
NR!=FNR{if($NF in data){
$NF=data[$NF];print >"result.txt"
}else{
print >"left.txt"}
}' <( sed 's/\s*:\s*/|/' 2.txt) 1.txt 2>/dev/null
Output
$ cat result.txt
gera077||o||emi_riv_90#hotmail.com||||200.45.113.254||o||: demo
glen-666||o||glen-666#hotmail.com||||84.196.42.167||o||12883 #: "#
menopause |||totoche#wanadoo.fr||o||83.193.209.52||o||123422
$ cat left.txt
okan1993||||killa-o#hotmail.de||||84.141.125.140||o||69c1cb5ddbc66cceebe0dddba3eddf68
Tosiunia||||tosia_19#amorki.pl||o||83.22.193.86|||||ddcbba2076646980391cb4971b8030
DREP
Page 1
Sheyes1 ||||summer_faerie_dustyrose#yahoo.com|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
.
BenPhynix||||BenPhynix#aol.de||||| 62.226.181.57||||11dea24f1caebb012e11285579050f38
jonof.|o||joflem#medi3.no||o||213.161.242.106||o||239f33743e4a070b728d4dcbd1091f1a
We have preprocessed the first file - using sed - to make its field delimiter | and used process substitution to pass the result to awk.

Matching files using awk in linux

I have 2 files:
1.txt:
e10adc3949ba59abbe56e057f20f883e
f8b46e989c5794eec4e268605b63eb59
e3ceb5881a0a1fdaad01296d7554868d
2.txt:
e10adc3949ba59abbe56e057f20f883e:1111
679ab793796da4cbd0dda3d0daf74ec1:1234
f8b46e989c5794eec4e268605b63eb59:1#/233:
I want 2 files as output:
One is result.txt which contains lines from 2.txt whose match is in 1.txt
and another is left.txt which contains lines from 1.txt whose match is not in 2.txt
Expected output of both files is below:
result.txt
e10adc3949ba59abbe56e057f20f883e:1111
f8b46e989c5794eec4e268605b63eb59:1#/233:
left.txt
e3ceb5881a0a1fdaad01296d7554868d
I tried 1-2 approaches with awk but not succeeded. Any help would be highly appreciated.
My script:
awk '
FNR==NR{
val=$1;
sub(/[^:]*/,"");
sub(/:/,"");
a[val]=$0;
next
}
!($NF in a){
print > "left.txt";
next
}
{
print $1,$2,a[$NF]> "result.txt"
}
' FS=":" 2.txt FS=":" OFS=":" 1.txt

Following awk may help you in same.
awk 'FNR==NR{a[$1]=$0;next} ($0 in a){print a[$0] > "results.txt";next} {print > "left.txt"}' FS=":" OFS=":" 2.txt FS=" " OFS=":" 1.txt
EDIT: Adding explanation of code too here.
awk '
FNR==NR{ ##FNR==NR condition will be TRUE when first Input_file is being read by awk. Where FNR and NR are the out of the box variables for awk.
a[$1]=$0; ##creating an array named a whose index is $1 and value is $2 from 2.txt Input_file.
next ##next is out of the box keyword from awk and will skip all further statements of awk.
}
($0 in a){ ##Checking here condition if current line of Input_file 1.txt is present in array named a then do following.
print a[$0] > "results.txt"; ##Printing the current line into output file named results.txt, since current line is coming in array named a(which was created by 1st file).
next ##next is awk keyword which will skip further statements for awk code now.
}
{
print > "left.txt" ##Printing all lines which skip above condition(which means they did not come into array a) to output file named left.txt as per OP need.
}
' FS=":" OFS=":" 2.txt FS=" " OFS=":" 1.txt ##Setting FS(field separator) as colon for 2.txt and Setting FS to space for 1.txt here. yes, we could set multiple field separators for different Input_file(s).

How about this one:
awk 'BEGIN{ FS = ":" }NR==FNR{ a[$0]; next }$1 in a{ print $0 > "results.txt"; delete a[$1]; next }END{ for ( i in a ) print i > "left.txt" }' 1.txt 2.txt
Output:
results.txt
e10adc3949ba59abbe56e057f20f883e:1111
f8b46e989c5794eec4e268605b63eb59:1#/233:
left.txt
e3ceb5881a0a1fdaad01296d7554868d

Comparing two CSV files in linux

I have two CSV files with me in the following format:
File1:
No.1, No.2
983264,72342349
763498,81243970
736493,83740940
File2:
No.1,No.2
"7938493","7364987"
"2153187","7387910"
"736493","83740940"
I need to compare the two files and output the matched,unmatched values.
I did it through awk:
#!/bin/bash
awk 'BEGIN {
FS = OFS = ","
}
if (FNR==1){next}
NR>1 && NR==FNR {
a[$1];
next
}
FNR>1 {
print ($1 in a) ? $1 FS "Match" : $1 FS "In file2 but not in file1"
delete a[$1]
}
END {
for (x in a) {
print x FS "In file1 but not in file2"
}
}'file1 file2
But the output is:
"7938493",In file2 but not in file1
"2153187",In file2 but not in file1
"8172470",In file2 but not in file1
7938493,In file1 but not in file2
2153187,In file1 but not in file2
8172470,In file1 but not in file2
Can you please tell me where I am going wrong?

Here are some corrections to your script:
BEGIN {
# FS = OFS = ","
FS = "[,\"]+"
OFS = ", "
}
# if (FNR==1){next}
FNR == 1 {next}
# NR>1 && NR==FNR {
NR==FNR {
a[$1];
next
}
# FNR>1 {
$2 in a {
# print ($1 in a) ? $1 FS "Match" : $1 FS "In file2 but not in file1"
print ($2 in a) ? $2 OFS "Match" : $2 "In file2 but not in file1"
delete a[$2]
}
END {
for (x in a) {
print x, "In file1 but not in file2"
}
}
This is an awk script, so you can run it like awk -f script.awk file1 file2. Doing so gives these results:
$ awk -f script.awk file1 file2
736493, Match
763498, In file1 but not in file2
983264, In file1 but not in file2
The main problem with your script was that it didn't correctly handle the double quotes around the numbers in file2. I changed the input field separator so that the double quotes are treated as part of the separator to deal with this. As a result, the first field $1 in the second file is empty (it is the bit between the start of the line and the first "), so you need to use $2 to refer to the first value you're interested in. Aside from that, I removed some redundant conditions from your other blocks and used OFS rather than FS in your first print statement.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Merge two files using AWK with conditions - linux

Related

Merge two files AWK

implementing Excel-vlookup-like function with awk

extract data using sed or awk in linux

Matching files using awk in linux

Comparing two CSV files in linux

Categories

Resources