Fuzzy String Matching using Python - python-3.x

I have a training dataset for eg.
Letter Word
A Apple
B Bat
C Cat
D Dog
E Elephant
and I need to check the dataframe such as
AD Apple Dog
AE Applet Elephant
DC Dog Cow
EB Elephant Bag
AED Apple Elephant Dog
D Door
ABC All Bat Cat
the instances AD,AE,EB are almost accurate (Apple and Applet are considered closer to each other, similar for Bat and Bag) but DC doesn't match.
Output Required:
Letters Words Status
AD Apple Dog Accept
AE Applet Elephant Accept
DC Dog Cow Reject
EB Elephant Bag Accept
AED Apple Elephant Dog Accept
D Door Reject
ABC All Bat Cat Accept
ABC accepted because 2 of 3 words match.
The words accepted need to be matched 70% (Fuzzy Match). yet, threshold subject to change.
How can I find these matches using Python.

You can use thefuzz to solve your problem:
# Python env: pip install thefuzz
# Conda env: conda install thefuzz
from thefuzz import fuzz
THRESHOLD = 70
df2['Others'] = (df2['Letters'].agg(list).explode().reset_index()
.merge(df1, left_on='Letters', right_on='Letter')
.groupby('index')['Word'].agg(' '.join))
df2['Ratio'] = df2.apply(lambda x: fuzz.ratio(x['Words'], x['Others']), axis=1)
df2['Status'] = np.where(df2['Ratio'] > THRESHOLD, 'Accept', 'Reject')
Output:
>>> df2
Letters Words Others Ratio Status
0 AD Apple Dog Apple Dog 100 Accept
1 AE Applet Elephant Apple Elephant 97 Accept
2 DC Dog Cow Dog Cat 71 Accept
3 EB Elephant Bag Elephant Bat 92 Accept
4 AED Apple Elephant Dog Apple Dog Elephant 78 Accept
5 D Door Dog 57 Reject
6 ABC All Bat Cat Apple Cat Bat 67 Reject

Related

how do we find the probability of how many different sample?

we have 8 group cat : +group A:18 cat eat kind of food A.+group B: 3 cat eat kind of food B.+group C: 4 cat eat kind of food C. +group D: 2 cat eat kind of food D. +group E:4 cat eat kind of food E.+group F:2 cat eat kind of food F. +group G:13 cat eat kind of food G. +group H: 4 cat eat kind of food H.
A random sample of 8 cat.
a) How many different samples are possible?
b) How many different samples of size 8 are possible subject to the constraint that no 2
cat may have the same food type?
Would you help me a and b.

Understanding the sed N command

The sed manual states about the N command:
N
Add a newline to the pattern space, then append the next line of input to the pattern space. If there is no more input then sed exits without processing any more commands.
Now, from what I know, sed reads each line of input, applies the script(s) on it, and (if -n is not specified) prints it as the output stream.
So, given the following example file:
Apple
Banana
Grapes
Melon
Running this:
$ sed '/Banana/N;p'
From what I understand, sed should process each line of input: Apple, Banana, Grapes and Melon.
So, I would think that the output will be:
Apple
Apple
Banana
Grapes # since it read the next line with N
Banana
Grapes
Grapes (!)
Grapes (!)
Melon
Melon
Explanation:
Apple is read to the pattern space. it doesn't match Banana regex, so only p is applied. It's printed twice: once for the p command, and once because sed prints the pattern space by default.
Next, Banana is read to the pattern space. It matches the regex, so that the N command is applied: so it reads the next line Grapes to the pattern space, and then p prints it: Banana\nGrapes. Next, the pattern space is printed again due to the default behavior.
Now, I would expect that Grapes will be read to the pattern space, so that Grapes will be printed twice, same as for Apple and Melon.
But in reality, this is what I get:
Apple
Apple
Banana
Grapes
Banana
Grapes
Melon
Melon
It seems that once Grapes was read as part of the N command that was applied to Banana, it will no longer be read as a line of its own.
Is that so? and if so, why isn't it emphasized in the docs?
This might explain it (GNU sed):
sed '/Banana/N;p' file --d
SED PROGRAM:
/Banana/ N
p
INPUT: 'file' line 1
PATTERN: Apple
COMMAND: /Banana/ N
COMMAND: p
Apple
END-OF-CYCLE:
Apple
INPUT: 'file' line 2
PATTERN: Banana
COMMAND: /Banana/ N
PATTERN: Banana\nGrapes
COMMAND: p
Banana
Grapes
END-OF-CYCLE:
Banana
Grapes
INPUT: 'file' line 4
PATTERN: Melon
COMMAND: /Banana/ N
COMMAND: p
Melon
END-OF-CYCLE:
Melon
Where --d is short for --debug
You will see the INPUT: lines go 1,2,4 because the second cycle also grabs input line 3 with the N command.
To debug this, I added = to your script so that you can see what's being emitted at each iteration. The line numbers conveniently demarcate the output from each. Then to identify the default print action at the end of each iteration, I added s/.*/==&==/ so you can see what was printed by sed because you did not specify -n.
sed '/Banana/N;=;p;=;s/.*/==&==/' <<\:
> Apple
> Banana
> Grapes
> Melon
> :
1
Apple
1
==Apple==
3
Banana
Grapes
3
==Banana
Grapes==
4
Melon
4
==Melon==
So, the pattern space containing Banana and Grapes was printed twice, and the first and last lines were printed in isolation twice.

how to separate Unit/Suite/APT/# from an address in Excel

I have a data base 272,000 addresses But some addresses have unit/suite/STE/APT seeexample below
16 BRIARWOOD COURT UNIT B MONTVALE, NJ 07645
100 CROWN COURT #471 EDGEWATER, NJ 07020
23-05 HIGH ST APT A FAIR LAWN, NJ 07410
15-01 BROADWAY STE 6 FAIR LAWN, NJ 07410
80 BROADWAY, SUITE 1A CRESSKILL, N.J. 07626
300 GORGE ROAD APT 11 CLIFFSIDE PARK, N.J. 07010
I would like to split the text to the next column when it comes across unit/suite/STE/APT
I want to separate these so I can use Advance filter with unique records and create a master find and replace to clean the list....
Any formulas I can use for this would be helpful....
You can batch geocode your file on geocoder.ca
This is the result I got:
rawlocation Latitude Longitude Score StandardCivicNumber StandardAddress StandardCity StandardStateorProvinceAbbrv PostalZip Confidence
16 BRIARWOOD COURT UNIT B MONTVALE NJ 07645 41.035587 -74.06744 1
16 Briarwood Crt Montvale NJ 7677 0.7
100 CROWN COURT #471 EDGEWATER NJ 07020 40.822893 -73.978375 1 100 Crown Crt Edgewater NJ 07020-1137 0.8
23-05 HIGH ST APT A FAIR LAWN NJ 07410 40.940276 -74.120329 1 23 High St Fair Lawn NJ 07410-3574 0.8
15-01 BROADWAY STE 6 FAIR LAWN NJ 07410 40.920501 -74.091153 1 1 S Broadway Fair Lawn NJ 07410-5529 0.8
80 BROADWAY - 0
300 GORGE ROAD APT 11 CLIFFSIDE PARK N.J. 07010 40.814151 -73.990015 1 300 Gorge Rd Cliffside Park NJ 07010-2759 0.8
From the cleaned up version you can then street compare to extract additional entities.
Since not all addresses have a secondary number (such as APT C, or STE 312), I would recommend separating every time you come across a ZIP (5 digits) or a ZIP+4 (like 07010-2759). This will help you break that string into discrete addresses.
If you then want to clean up the list by correcting small typos and standardizing abbreviations, etc, I recommend using an address validation and standardization service like Melissa Data, or SmartyStreets. SmartyStreets has tools for validating/cleansing large lists of addresses and even extracting addresses out of text. (Full disclosure) I'm a software developer for SmartyStreets.

Using Spark to merge two or more files content and manipulate the content

Can I use Spark to do the following?
I have three files to merge and change the contents:
First File called column_header.tsv with this content:
first_name last_name address zip_code browser_type
Second file called data_file.tsv with this content:
John Doe 111 New Drive, Ca 11111 34
Mary Doe 133 Creator Blvd, NY 44499 40
Mike Coder 13 Jumping Street UT 66499 28
Third file called browser_type.tsv with content:
34 Chrome
40 Safari
28 FireFox
The final_output.tsv file after Spark processing the above should have this contents:
first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox
Is this do able using Spark? Also I will consider Sed or Awk if it is possible use the tools. I know the above is possible with Python but I will prefer using Spark to do the data manipulation and changes. Any suggestions? Thanks in advance.
Here it is in awk, just in case. Notice the file order:
$ awk 'NR==FNR{ a[$1]=$2;next }{ $NF=($NF in a?a[$NF]:$NF) }1' file3 file1 file2
Output:
first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox
Explained:
NR==FNR { # process browser_type file
a[$1]=$2 # remember remember the second of ...
next } # skip to the next record
{ # process the other files
$NF=( $NF in a ? a[$NF] : $NF) } # replace last field with browser from a
1 # implicit print
It is possible. Read header:
with open("column_header.tsv") as fr:
columns = fr.readline().split()
Read data_file.tsv:
users = spark.read.option("delimiter", "\t").csv("data_file.tsv").toDF(*columns)
Read called browser_type.tsv:
browsers = spark.read.csv("called browser_type.tsv") \
.toDF("browser_type", "browser_name")
Join:
users.join(browser, "browser_type", "left").write.csv(path)

Remove string with brackets in R

I have a data.frame looks like this:
name
Lily(1+2)
John(good+1)
Tom()
Jim
Alice(*+#)
.....
I want to remove all brackets and everything inside the brackets in R. What should I do?
I prefer my data.frame can be looked like:
name
Lily
John
Tom
Jim
Alice
....
Thanks!
# read your sample data:
d <- read.table(text=readClipboard(), header=TRUE, comment='`')
# remove strings in parentheses
transform(d, name=gsub('\\(.*\\)', '', name))
# name
# 1 Lily
# 2 John
# 3 Tom
# 4 Jim
# 5 Alice

Resources