awk Compare 2 files, print match and difference - text

I need to comapre two files f1.txt and f2.txt and obtain matches, and non-matches, for this case I am looking to match first field on both files. And print first the second field of f2.txt, then print the entire line of f1.txt. And for no match found on f2.txt to state "Not Found" and then print f1.txt entire line.
F1.txt
1;2;3;4;5;6;7;8
1a;2;3;4;5;6;7;8
1b;2;3;4;5;6;7;8
2b;2;3;4;5;6;7;8
F2.txt
1;First
1a;Firsta
1b;Firstb
Desired output:
First;1;1;2;3;4;5;6;7;8
Firsta;1a;1a;2;3;4;5;6;7;8
Firstb;1b;1b;2;3;4;5;6;7;8
Not Found;2b;2;3;4;5;6;7;8
I am able to obtain the matches but not the non match
awk -F ";" -v OFS="";"" "NR==FNR{a[$1]=$2;next}a[$1]{print a[$1],$0}" f2.txt f1.txt
Thanks

This should do:
awk -F";" 'NR==FNR{a[$1]=$2;next}{if (a[$1])print a[$1],$0;else print "Not Found", $0;}' OFS=";" f2.txt f1.txt

This was very useful .
I have changed a bit to get data between 2 files and only have 1 column in each file .
awk 'BEGIN { OFS=FS=";" } FNR==NR { array[$1]=$1; next } { print ($1 in array ? array[$1] : "Not Found"), $0 }' file1 file2

Related

linux shell get multi file intersection

I have a few txt file examples 1.txt 2.txt 3.txt 4.txt
I want to get 1.txt 2.txt 3.txt 4.txt content intersection
cat 1.txt 2.txt | sort | uniq -c > tmp.txt
cat tmp.txt 3.txt | sort | uniq -c > tmp2.txt
and so on ....
Is there a better way?
input text
1.txt
1
2
3
4
2.txt
1
2
3
3.txt
1
2
4.txt
1
5
expected output:
1
With your shown samples please try following awk code.
1st solution: This considers that you may have duplicates values of lines with in a single Input_file itself then you may try following:
awk '
!arr2[FILENAME,$0]++{
arr1[$0]++
}
END{
for(i in arr1){
if(arr1[i]==(ARGC-1)){
print i
}
}
}
' *.txt
2nd solution: This solution assumes that there is no duplicates in Input_file if this is the case then try following:
awk '
{
arr[$0]++
}
END{
for(i in arr){
if(arr[i]==(ARGC-1)){
print i
}
}
}
' *.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
arr[$0]++ ##Creating an array named arr with index of $0 and keep increasing its value.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Traversing through array arr here.
if(arr[i]==(ARGC-1)){ ##Checking condition if value of current item in arr is Equal to total number of files then print it.
print i
}
}
}
' *.txt ##Passing all .txt files as an input to awk program from here.

How to get 1st field of a file only when 2nd field matches a string?

How to get 1st field of a file only when 2nd field matches a given string?
#cat temp.txt
Ankit pass
amit pass
aman fail
abhay pass
asha fail
ashu fail
cat temp.txt | awk -F"\t" '$2 == "fail" { print $1 }'*
gives no output
Another syntax with awk:
awk '$2 ~ /^faild$/{print $1}' input_file
A deleted 'cat' command.
^ start string
$ end string
It's the best way to match patten.
Either:
Your fields are not tab-separated or
You have blanks at the end of the relevant lines or
You have DOS line-endings and so there are CRs at the end of every
line and so also at the end of every $2 in every line (see
Why does my tool output overwrite itself and how do I fix it?)
With GNU cat you can run cat -Tev temp.txt to see tabs (^I), CRs (^M) and line endings ($).
Your code seems to work fine when I remove the * at the end
cat temp.txt | awk -F"\t" '$2 == "fail" { print $1 }'
The other thing to check is if your file is using tab or spaces. My copy/paste of your data file copied spaces, so I needed this line:
cat temp.txt | awk '$2 == "fail" { print $1 }'
The other way of doing this is with grep:
cat temp.txt | grep fail$ | awk '{ print $1 }'

extract data using sed or awk in linux

I am trying to merge data from 2 text files based on some condition.
I have two files:
1.txt
gera077||o||emi_riv_90#hotmail.com||||200.45.113.254||o||0f8caa3ced5dc172901a427410d20540
okan1993||||killa-o#hotmail.de||||84.141.125.140||o||69c1cb5ddbc66cceebe0dddba3eddf68
Tosiunia||||tosia_19#amorki.pl||o||83.22.193.86|||||ddcbba2076646980391cb4971b8030
DREP
glen-666||o||glen-666#hotmail.com||||84.196.42.167||o||f139d8b49085d012af9048bb1cba3534
Page 1
Sheyes1 ||||summer_faerie_dustyrose#yahoo.com|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
.
BenPhynix||||BenPhynix#aol.de||||| 62.226.181.57||||11dea24f1caebb012e11285579050f38
menopause |||totoche#wanadoo.fr||o||83.193.209.52||o||d7ca4d78fc79a795695ae1c161ce82ea
jonof.|o||joflem#medi3.no||o||213.161.242.106||o||239f33743e4a070b728d4dcbd1091f1a
2.txt
f139d8b49085d012af9048bb1cba3534: 12883 #: "#
d7ca4d78fc79a795695ae1c161ce82ea: 123422
0f8caa3ced5dc172901a427410d20540 :: demo
Contains the matching lines from 1.txt and hash is replaced with corresponding value in 2.txt
result.txt
gera077 || o || emi_riv_90#hotmail.com || or || 200.45.113.254 || o ||: demo
glen-666-||glen-666#hotmail.com||||84.196.42.167||||12883 #: "#
menopause |||totoche#wanadoo.fr||o||83.193.209.52||o||123422
Contains the non-matching lines from 1.txt
left.txt
okan1993||||killa-o#hotmail.de||||84.141.125.140||o||69c1cb5ddbc66cceebe0dddba3eddf68
Tosiunia||||tosia_19#amorki.pl||o||83.22.193.86|||||ddcbba2076646980391cb4971b8030
DREP
Page 1
Sheyes1 ||||summer_faerie_dustyrose#yahoo.com|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
.
BenPhynix||||BenPhynix#aol.de||||| 62.226.181.57||||11dea24f1caebb012e11285579050f38
jonof.|o||joflem#medi3.no||o||213.161.242.106||o||239f33743e4a070b728d4dcbd1091f1a
The script I am trying is :
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
c[$9]=$0;
next
}
($1 in a){
val=$1;
$1="";
sub(/:/,"");
print a[val] s1 $0 s1 b[val];
d[val]=$0;
next
}
END{
for(i in d){
delete c[i]
};
for(j in c){
print c[j] > "left.txt"
}}
' FS="|" 1.txt FS=":" OFS=":" 2.txt > result.txt
But it is giving me empty result.txt
I am facing difficulty in debugging the issue.
Any help would be highly appreciated.
Try following awk(completely based on your shown Input_file(s) and considering that your 2.txt will not have any duplicates on it too) and let me know if this helps you.
awk 'FNR==NR{a[$NF]=$0;next} $1~/:/{sub(/:/,"",$1);flag=1} ($1 in a){val=$1;if($0 ~ /:/ && !flag){sub(/[^:]*/,"");sub(/:/,"")};print a[val] OFS $0 > "result.txt";flag="";delete a[val]} END{for(i in a){print a[i]>"left.txt"}}' FS="|" 1.txt FS=" " OFS="||o||" 2.txt
Output will be 2 files named results.txt and left.txt. Will add non-one liner form and explanation too for above code shortly.
Adding a non-one liner form of solution too now.
awk '
FNR==NR{ ##FNR and NR both are awk out of the box variables and they denote line numbers in Input_file(s), difference between them is FNR value will be RESET when it complete reading 1 Input_file and NR value will be keep increasing till it completes reading all the Input_file(s).
a[$NF]=$0; ##Creating an array named a whose index is $NF(value of last field of current line) and value is current line.
next ##next is awk out of the box keyword which will skip all further statements now.
}
$1~/:/{ ##Checking condition here if current lines 1st field has a colon in it then do following:
sub(/:/,"",$1); ##Using sub function of awk which will substitute colon with NULL of 1st field of current line of current Input_file.
flag=1 ##Setting a variable named flag here(basically to make sure that 1st colon is substituted so need for another colon removal.
}
($1 in a){ ##Checking a condition here if current line $1 is present in array a then do following:
val=$1; ##Setting variable named val value to $1 here.
if($0 ~ /:/ && !flag){ ##Checking condition here if current line is having colon and variable flag is NOT NULL then do following:
sub(/[^:]*/,""); ##Substituting all the values from starting to till colon comes with NULL.
sub(/:/,"")}; ##Then substituting only 1 colon here.
print a[val] OFS $0 > "result.txt"; ##printing the value of array a whose index is variable val OFS(output field separator) current line values to output file named results.txt here.
flag=""; ##Unsetting the value of variable flag here.
delete a[val] ##Deleting the value of array a whose index is variable val here.
}
END{ ##Starting end section of this awk program here. which will be executed once all Input_file(s) have been read.
for(i in a){ ##Traversing through the array a now.
print a[i]>"left.txt"} ##Printing the value of array a(which will basically provide those values which are NOT matched in both files) in left.txt file.
}
' FS="|" 1.txt FS=" " OFS="||o||" 2.txt ##Setting FS="|" for 1.txt Input_file and then setting FS=" " and OFS="||o||" for 2.txt Input_file, 1.txt and 2.txt are Input_files for this program to run.
This awk script may also help.
$ awk 'BEGIN{FS="\|";OFS="|"}NR==FNR{data[$1]=$2;}
NR!=FNR{if($NF in data){
$NF=data[$NF];print >"result.txt"
}else{
print >"left.txt"}
}' <( sed 's/\s*:\s*/|/' 2.txt) 1.txt 2>/dev/null
Output
$ cat result.txt
gera077||o||emi_riv_90#hotmail.com||||200.45.113.254||o||: demo
glen-666||o||glen-666#hotmail.com||||84.196.42.167||o||12883 #: "#
menopause |||totoche#wanadoo.fr||o||83.193.209.52||o||123422
$ cat left.txt
okan1993||||killa-o#hotmail.de||||84.141.125.140||o||69c1cb5ddbc66cceebe0dddba3eddf68
Tosiunia||||tosia_19#amorki.pl||o||83.22.193.86|||||ddcbba2076646980391cb4971b8030
DREP
Page 1
Sheyes1 ||||summer_faerie_dustyrose#yahoo.com|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
.
BenPhynix||||BenPhynix#aol.de||||| 62.226.181.57||||11dea24f1caebb012e11285579050f38
jonof.|o||joflem#medi3.no||o||213.161.242.106||o||239f33743e4a070b728d4dcbd1091f1a
We have preprocessed the first file - using sed - to make its field delimiter | and used process substitution to pass the result to awk.

Matching files using awk in linux

I have 2 files:
1.txt:
e10adc3949ba59abbe56e057f20f883e
f8b46e989c5794eec4e268605b63eb59
e3ceb5881a0a1fdaad01296d7554868d
2.txt:
e10adc3949ba59abbe56e057f20f883e:1111
679ab793796da4cbd0dda3d0daf74ec1:1234
f8b46e989c5794eec4e268605b63eb59:1#/233:
I want 2 files as output:
One is result.txt which contains lines from 2.txt whose match is in 1.txt
and another is left.txt which contains lines from 1.txt whose match is not in 2.txt
Expected output of both files is below:
result.txt
e10adc3949ba59abbe56e057f20f883e:1111
f8b46e989c5794eec4e268605b63eb59:1#/233:
left.txt
e3ceb5881a0a1fdaad01296d7554868d
I tried 1-2 approaches with awk but not succeeded. Any help would be highly appreciated.
My script:
awk '
FNR==NR{
val=$1;
sub(/[^:]*/,"");
sub(/:/,"");
a[val]=$0;
next
}
!($NF in a){
print > "left.txt";
next
}
{
print $1,$2,a[$NF]> "result.txt"
}
' FS=":" 2.txt FS=":" OFS=":" 1.txt
Following awk may help you in same.
awk 'FNR==NR{a[$1]=$0;next} ($0 in a){print a[$0] > "results.txt";next} {print > "left.txt"}' FS=":" OFS=":" 2.txt FS=" " OFS=":" 1.txt
EDIT: Adding explanation of code too here.
awk '
FNR==NR{ ##FNR==NR condition will be TRUE when first Input_file is being read by awk. Where FNR and NR are the out of the box variables for awk.
a[$1]=$0; ##creating an array named a whose index is $1 and value is $2 from 2.txt Input_file.
next ##next is out of the box keyword from awk and will skip all further statements of awk.
}
($0 in a){ ##Checking here condition if current line of Input_file 1.txt is present in array named a then do following.
print a[$0] > "results.txt"; ##Printing the current line into output file named results.txt, since current line is coming in array named a(which was created by 1st file).
next ##next is awk keyword which will skip further statements for awk code now.
}
{
print > "left.txt" ##Printing all lines which skip above condition(which means they did not come into array a) to output file named left.txt as per OP need.
}
' FS=":" OFS=":" 2.txt FS=" " OFS=":" 1.txt ##Setting FS(field separator) as colon for 2.txt and Setting FS to space for 1.txt here. yes, we could set multiple field separators for different Input_file(s).
How about this one:
awk 'BEGIN{ FS = ":" }NR==FNR{ a[$0]; next }$1 in a{ print $0 > "results.txt"; delete a[$1]; next }END{ for ( i in a ) print i > "left.txt" }' 1.txt 2.txt
Output:
results.txt
e10adc3949ba59abbe56e057f20f883e:1111
f8b46e989c5794eec4e268605b63eb59:1#/233:
left.txt
e3ceb5881a0a1fdaad01296d7554868d

formatting text using awk

Hi I have the following text and I need to use awk or sed to print 3 separate columns
11/13/14 101 HUDSON AUBONPAINJERSEY CITY NJ $4.15
11/22/14 MTAMVM*110TH ST/CATNEW YORK NY $19.05
11/22/14 DUANE READE #14226 0NEW YORK NY $1.26
So I like to produce a file containing all the dates. Another file containing all the description and third file containing all the numbers
I can use an awk to print the first column printy $1 and then use -F [$] option to print last column but I'm not able to just print the middle column as there are spaces etc. Can I ignore the spaces? or is there a better way of doing this?
Thaking you in advance
Try doing this :
$ awk '
{
print $1 > "dates"; $1=""
print $NF > "prices"; $NF=""
print $0 > "desc"
}
' file
or :
awk -F' +' '
{
print $1 > "dates"
print $2 > "desc"
print $3 > "prices"
}
' file
Then :
$ cat dates
$ cat desc
$ cat prices
Wasn't fast enough to be the first to give an awk solution, so here's one with grep and sed...
grep -o '^.*/.*/1.' file #first col
sed 's/^.*\/.*\/1.//;s/\$.*//' file #middle col
grep -o '\$.*$' file #last col

Resources