python fuzzywuzzy limit, how does it work?

python fuzzywuzzy limit, how does it work? - python-3.x

How exactly does the limit work with pythons fuzzywuzzy module, what does it mean?
matches = process.extract(query, choices, limit=2, scorer=fuzz.partial_ratio)

Limit is generally used in fuzzywuzzy when you need "x" best matching solutions.
So, for example you are comparing the same column of a df to match with each other. It will be the case that 1st match will be the name itself. So, you do limit = 2 do get the 2nd best match.
Ex: column values =['Apple','Banana','Orange','Appl','Banan']
If you want to do fuzzy using same column and see how "Apple" is used in different contexts because of spelling mistakes etc. Now the best match of Apple will be Apple itself, so you do limit=2 do get "Appl" in this case
I hope I was clear

Related

How to use index and match with two columns?

I have the following case
E.g., I want the two cells in yellow to be the same. For this, I need to find the columns Score 1 and Result, and then find the row 03-Jan so that I get the actual score. Do you have any idea how to solve this? I tried with some match and index but I do not get the solution.

Use INDEX with three matches, the first to find the correct row, while the other 2 find the correct column.
=INDEX($E:$N,MATCH($Q9,B:B,0),MATCH(R$8,$E$2:$N$2,0)+MATCH(R$7,$E$3:$I$3,0)-1)

You want to use an Index(Match),Match())
but what you really want to use is a double XLOOKUP.
I also liked this video to help me learn all of the ways to do a 2d lookup.
YouTube Video

Better way to Vlookup

I would like to know if there is a better alternative to Vlookup to find matches between two cells (or Python Dfs).
Say I have the below Dfs,
I want my code to check if the values in DF1 was in DF2, If values exactly match OR if the values partially matche return me the value in the DF2.
Just like the matches in 4th column Row 2,3 returned values.
Thanks Amigo!

Well, as you probably suspected already, you have several options. You can easily search for an exact match, like this.
=VLOOKUP(value,data,column,FALSE)
Here is an example.
https://www.excelfunctions.net/vlookup-example-exact-match.html
Or, consider doing a partial match, as such.
=VLOOKUP(value&"*",data,column,FALSE)
Here is an example.
https://exceljet.net/formula/partial-match-with-vlookup
Oh, you can do a fuzzy match as well. Use the AddIn below for this kind of task.
https://www.microsoft.com/en-us/download/details.aspx?id=15011
In Python, it would be done like this.
matches = []
for c in checklist:
if c in words:
matches.append(c)
Obviously, the items in the square brackets are the items in the list.
For Python fuzzy matches, follow the steps outlined in the link below.
https://marcobonzanini.com/2015/02/25/fuzzy-string-matching-in-python/

List of items find almost duplicates

Within excel I have a list of artists, songs, edition.
This list contains over 15000 records.
The problem is the list does contain some "duplicate" records. I say "duplicate" as they aren't a complete match. Some might have a few typo's and I'd like to fix this up and remove those records.
So for example some records:
ABBA - Mamma Mia - Party
ABBA - Mama Mia! - Official
Each dash indicates a separate column (so 3 columns A, B, C are filled in)
How would I mark them as duplicates within Excel?
I've found out about the tool Fuzzy Lookup. Yet I'm working on a mac and since it's not available on mac I'm stuck.
Any regex magic or vba script what can help me out?
It'd also be alright to see how much similar the row is (say 80% similar).

One of the common methods for fuzzy text matching is the Levenshtein (distance) algorithm. Several nice implementations of this exist here:
https://stackoverflow.com/a/4243652/1278553
From there, you can use the function directly in your spreadsheet to find similarities between instances:
You didn't ask, but a database would be really nice here. The reason is you can do a cartesian join (one of the very few valid uses for this) and compare every single record against every other record. For example:
select
s1.group, s2.group, s1.song, s2.song,
levenshtein (s1.group, s2.group) as group_match,
levenshtein (s1.song, s2.song) as song_match
from
songs s1
cross join songs s2
order by
group_match, song_match
Yes, this would be a very costly query, depending on the number of records (in your example 225,000,000 rows), but it would bubble to the top the most likely duplicates / matches. Not only that, but you can incorporate "reasonable" joins to eliminate obvious mismatches, for example limit it to cases where the group matches, nearly matches, begins with the same letter, etc, or pre-filtering out groups where the Levenschtein is greater than x.

You could use an array formula, to indicate the duplicates, and you could modify the below to show the row numbers, this checks the rows beneath the entry for any possible 80% dupes, where 80% is taken as left to right, not total comparison. My data is a1:a15000
=IF(NOT(ISERROR(FIND(MID($A1,1,INT(LEN($A1)*0.8)),$A2:$A$15000))),1,0)
This way will also look back up the list, to indicate the ones found
=SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A1)*0.8)),$A3:$A$15000,1)),0,1))+SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A2)*0.8)),$A$1:$A1,1)),0,1))
The first entry i.e. row 1 is the first part of the formula, and the last row will need the last part after the +

try this worksheet fucntions in your loop:
=COUNTIF(Range,"*yourtexttofind*")

Excel, Numberplate Clarification

I am working on an excel document for fuel cards at the minute and my current issue is to write in a formula for validating number plates based on UK standard plates (two letters followed by two numbers then three letters i.e. BK08JWZ). At this point in time we are not considering personal plates in this just to keep things simple.
Ideally I need excel to look at the text in the box and confirm it to an agreed layout but I am struggling to find the right formula. The plates are in column 'I' and I have already added in another column after titled 'approved plates' in column 'J'but this can be deleted if it's not needed.
Results wise, I can do this one of two ways, to either get the excel document to highlight and number plates that do not match the DVLA standard , or have a column next to the number plate column that registers a boolean response to the recognition i.e. If it is valid (true) or if not (false).
Either way the plate needs to be able to be seen as it was currently, so if there is something wrong with it, it needs to be visible, not throw up an error message.
Any help would be very welcome.
All the information on UK standard number plates are on this site:
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/359317/INF104_160914.pdf

I would do it like this:
1) create a lookup sheet with data from the booklet. One column for allowed "memory tag" identiffiers (first two letters), one column for the allowed "age identiffiers" (first two numbers), and one column for allowed random letters (last three letters, full alphabet except I and Q)
2) strip spaces from the number plate for comparison
3) Use MID(numberplate,1,2), MID(numberplate,3,2) and MID(numberplate,5,3) to compare to each lookup list repectively (using INDEX()>0).
4) when all 3 parts are found in lookup lists the number plate is valid.

Try researching Regular Expressions or RegEx. This is a powerful programming tool to determine whether strings match specific patterns. You can use RegEx expressions to extract the pattern, replace the pattern or test for the pattern. Very efficient but not for the faint-hearted although there is plenty of help on-line. Try this article for starters.
The following RegEx may be what you need..
(?^[A-Z]{2}[0-9]{2}[A-Z]{3}$)|(?^[A-Z][0-9]{1,3}[A-Z]{3}$)|(?^[A-Z]{3}[0-9]{1,3}[A-Z]$)|(?^[0-9]{1,4}[A-Z]{1,2}$)|(?^[0-9]{1,3}[A-Z]{1,3}$)|(?^[A-Z]{1,2}[0-9]{1,4}$)|(?^[A-Z]{1,3}[0-9]{1,3}$)
This was copied from this article which gives a very full explanation using DVLA rules.
EDIT:
To use RegEx within Excel. In the IDE, Tools menu, select References and add the Microsoft VBScript Regular Expressions 5.5 reference.
With acknowlegement to user3616725s helpful observation.

Excel 2013, how to us the "search" function like vlookup

Essentially, I am looking for a way to use the "search" function like the "vlookup" function. In my case, I am have a long list, of say, 1000 descriptions of different types of fasteners and I want to classify them according to what they are, (ie. Nut, bolt, washer etc.). However, I can't sort by description or partnumber because they, alphanumerically, don't line up by class. But he descripotion field does say at some point in it, what it is(ie. Nut, bolt, washer etc.).
As said, I have a table of classes, and I am looking for a formula that would look in the "description" field for all the values in the table,and then return that value, or one associated with it (like vlookup does with cell values).
So that, if it found "nut" in the description, it would return "nut", or if it found "bolt" it would return "bolt."
I hope that this question makes sense. Let me also say that I found a way "manually" do this using the search function, along with others, but the formula was very long and each value in my table had to be specially called out. However, I will include the formula I used to make clear what I was trying to do.
See below.
=IF(ISNUMBER(SEARCH($G$2,C3)),$G$2,IF(ISNUMBER(SEARCH($G$3,C3)),$G$3,IF(ISNUMBER(SEARCH($G$4,C3)),$G$4,IF(ISNUMBER(SEARCH($G$5,C3)),$G$5,...IF(ISNUMBER(SEARCH($G$13,C3)),$G$13,"MISC"))))))))))))
You see that with each item you add to your table, you have to add another if loop. I am hoping there is a better way. (I would call it "vsearch" :-) )

Try this formula
=IFERROR(LOOKUP(2^15,SEARCH($G$3:$G$13,C3),$G$3:$G$13),"MISC")
SEARCH returns an array of numbers or errors depending on whether each term is found in C3. By searching for "bignum" (in this case 2^15) which won't be found, the match is always with the last number, i.e. the last matching term in G3:G13.

MATCH can be used to find the text, and INDEX can be used to return the text
using your example, where you are searching in G:
=MATCH("*"&C3&"*",$G:$G,0)
and then index to return the text
=INDEX($G:$G,MATCH("*"&C3&"*",$G:$G,0))
and as a finishing touch, the #VALUE! replacement
=IFERROR(INDEX($G:$G,MATCH("*"&C3&"*",$G:$G,0)),"MISC")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

python fuzzywuzzy limit, how does it work? - python-3.x

How exactly does the limit work with pythons fuzzywuzzy module, what does it mean? matches = process.extract(query, choices, limit=2, scorer=fuzz.partial_ratio)

Related

How to use index and match with two columns?

Better way to Vlookup

List of items find almost duplicates

Excel, Numberplate Clarification

Excel 2013, how to us the "search" function like vlookup

Categories

Resources