FuzzyWuzzy for very similar records in Python - python-3.x

I have a dataset with which I want to find the closest string match. For that purpose I'm using FuzzyWuzzy in this way
sol=process.extract(t,dev2,scorer=fuzz.token_sort_ratio)
Where t is the string and dev2 is the list to compare to. My problem is that sometimes it has very similar records and options provided by FuzzyWuzzy seems to be lacking. And I've tested with token_sort, token_set, partial_token sort and set, ratio, partial_ratio, and WRatio.
For example, the string Italy - Serie A gives me the following 2 closest matches.
Token_sort_ratio: (92, 'Italy - Serie D');(86, 'Italian - Serie A')
The one wanted is obviously the second one, but character by character is closer the first one, which is a different league.
This happens as well with teams. If, let's say I have a string Buchtholz I would obtains Buchtholz II before I get TSV Buchtholz.
My main guess now would be to try and weight the presence and absence of several characters more heavily, like single capital letters at the end of the string, so if there is a difference in the letter or an absence it is weighted as less close. Or for () and special characters.
I don't know if there is a way to take this into account or you guys have a better approach to get the string that really matches.

Similarity matches often require knowledge of the data being analysed. i.e. it is not just a blind single round of matching. I recommend that you pass your results through more steps of matching, starting with inclusive/optimistic approaches (like token_set_ratio) with low cut off scores and working toward more exclusive/pessimistic approaches with higher cut off scores until you have a clear winner. If you know more about the text you're analyzing, you can even modify the strings as you progress.
In a case I worked on, I did similarity matches of goods movement descriptions. In the descriptions the numbers sequences were more important than the text. e.g. when looking for a match for "SLURRY VALVE 250MM RAGMAX 2000" the 250 and 2000 part of the string are important, otherwise I get a "SLURRY VALVE 50MM RAGMAX 2000" as the best match instead of "VALVE B/F 250MM,RAGMAX 250RAG2000 RAGON" which is a better result.
I put the similarity match process through two steps: 1. Get a bunch of similar matches using an optimistic matching scorer (token_set_ratio) 2. get the number sequences of these results and pass them through another round of matching with a more strict scorer (token_sort_ratio). Doing this gave me the better result in the example I showed above.
Below is some blocks of code that could be of assistance:
here's a function to get numbers from the sequence. (In your case you might use this to exclude numbers from your string instead?)
def get_numbers_from_string(description):
numbers = ''.join((ch if ch in '0123456789.-' else ' ') for ch in description)
numbers = ' '.join([nr for nr in numbers.split()])
return numbers
and here is a portion of the code I used to put the description match through two rounds:
try:
# get close match from goods move that has material numbers
df_material = pd.DataFrame(process.extract(description,
corpus_material,
scorer=fuzz.token_set_ratio),
columns=['Similar Text','Score']
)
if df_material['Score'][df_material['Score']>=cut_off_accuracy_materials].count()>=1:
similar_text = df_material['Similar Text'].iloc[0]
score = df_material['Score'].iloc[0]
if nr_description_numbers>4:
# if there are multiple matches found, then get best number combination match
df_material = df_material[df_material['Score']>=cut_off_accuracy_materials]
new_corpus = list(df_material['Similar Text'])
new_corpus = np.vectorize(get_numbers_from_string)(new_corpus)
df_material['numbers'] = new_corpus
df_numbers = pd.DataFrame(process.extract(description_numbers,
new_corpus,
scorer=fuzz.token_sort_ratio),
columns=['numbers','Score']
)
similar_text = df_material['Similar Text'][df_material['numbers']==df_numbers['numbers'].iloc[0]].iloc[0]
nr_score = df_numbers['Score'].iloc[0]
hope it helps, and good luck

Related

Regex to find numbers that are not in a phrase

If I have e.g. this string:
Beschreibung Menge VK-Preis MwSt% Betrag
Schadenbewertunginkl.Restwertermittlung 1 25,00€ 19 25,00€
Rechnungsbetragexcl.MwSt.: 25,00€
MwSt.(19%): 4,75€
Rechnungsbetragincl.MwSt.: 123.029,75€
I want to extract all the numbers.
My regexes are:
regex_up_to_thousand = r'\b(?:\d{1,3}){1}(?:,{1}\d{2})\b'
and
regex_every_price = r'\b(?:\d{1,3}(\.|,))+(:?\d{3}(\.|,))(?:\d{2})\b'
My idea was to first get the "big" prices, remove them from the text and get the other numbers.
Which works in most cases, until I have a date that looks like this maybe
Gutachtennummer: 1009126 Leistungsdatum: 11.10.2021
I would get the 11.10 with my second regex, and I don't know how to prevent this.
I thought the \b would help, but sadly not.
Any ideas?
It's not the end of the world, since I do a lot of math in the background, but it's a possibility that a date would fit some values and I calculate something wrong in the end.
You could try the following pattern.
\b\d+(?:(?:\.|,)\d{3})*(?:(?:\.|,)\d{2})\b(?!\W\d)
The main thing is (?!\W\d) at the end which ensures that after your amount you will not have a construct of 1 non-word character followed by 1 digit.
Example: https://regex101.com/r/q1ic9S/1

Using REGEX to grab the information after the match

I ran a PDF through a series of processes to extra the text from it. I was successful in that regard. However, now I want to extract specific text from documents.
The document is set up as a multi lined string (I believe. when I paste it into Word the paragraph character is at the end of each line):
Send Unit: COMPLETE
NOA Selection: 20-0429.07
#for some reason, in this editor, despite the next line having > infront of it, the following line (Pni/Trk) keeps wrapping up to the line above. This doesn't exist in the actual doc.
Pni/Trk: 3 Panel / 3 Track
Panel Stack: STD
Width: 142.0000
The information is want to extract are the numbers following "NOA Selection:".
I know I can do a regex something to the effect of:
pattern = re.compile(r'NOA\sSelection:\s\d*-\d*\.\d*)
but I only want the numbers after the NOA selection, especially because NOA Selection will always be the same but the format of the numbers/letters/./-/etc. can vary pretty wildly. This looked promising but it is in Java and I haven't had much luck recreating it in Python.
I think I need to use (?<=...), but haven't been able to implement it.
Also, several of the examples show the string stored in the python file as a variable, but I'm trying to access it from a .txt file, so I might be going wrong there. This is what I have so far.
with open('export1.txt', 'r') as d:    
contents = d.read()    
p = re.compile('(?<=NOA)')
s = re.search(p, contents)
print(s.group())
Thank you for any help you can provide.
With your shown samples, you could try following too. For sample 20-0429.07 I have kept .07 part optional in regex in case you have values 20-0429 only it should work for those also.
import re
val = """Send Unit: COMPLETE
NOA Selection: 20-0429.07"""
matches = re.findall(r'NOA\s+Selection:\s+(\d+-\d+(?:\.\d+)?)', val)
print(matches)
['20-0429.07']
Explanation: Adding detailed explanation(only for explanation purposes).
NOA\s+Selection:\s+ ##matching NOA spaces(1 or more occurrences) Selection: spaces(1 or more occurrences)
(\d+-\d+(?:\.\d+)?) ##Creating capturing group matching(1 or more occurrences) digits-digits(1 or more occurrences)
##and in a non-capturing group matching dot followed by digits keeping it optional.
Keeping it simple, you could use re.findall here:
inp = """Send Unit: COMPLETE
NOA Selection: 20-0429.07"""
matches = re.findall(r'\bNOA Selection: (\S+)', inp)
print(matches) # ['20-0429.07']

Search the best match comparing prefixes

I have the numbers codes and text codes like in table1 below. And I have the numbers to search like in table2
for which I want to get the best match for a prefix of minimun length of 3 comparing from left to rigth and show as answer the corresponding TEXT CODE.
If there is an exact match, that would be the answer.
If there is no any value that has at least 3 length prefix then answer would be "not found".
I show some comments explaining the conditions applied in answer expected for each Number to search next to table2.
My current attempt shows the exact matches, but I'm not sure how to compare the values to search for the other conditions, when there is no exact match.
ncode = ["88271","1893","107728","4482","3527","71290","404","5081","7129","33751","3","40489","107724"]
tcode = ["RI","NE","JH","XT","LF","NE","RI","XT","QS","XT","YU","WE","RP"]
tosearch = ["50923","712902","404","10772"]
out = []
out.append([])
out.append([])
for code in tosearch:
for nc in ncode:
if code == nc:
indexOfMatched = ncode.index(nc)
out[0].append(nc)
out[1].append(tcode[indexOfMatched])
>>> out
[['404'], ['RI']]
The expected output would be
out = [
['50923', '712902', '404', '10772'],
['NOT FOUND', 'NE', 'RI', 'JH' ]
]
A simple solution you might consider would be the fuzzy-match library. It compares strings and calculates a similarity score. It really shines with strings rather than numbers, but it could easily be applied to find similar results in your prefix numbers.
Check out fuzzy-match here.
Here is a well written fuzzy-match tutorial.

String contains substring and substring not part of longer word (exact match)

I have captured the full text of a PDF-file in a string called pdfText.
Next I am looping through an array containing substrings to be found/searched for in the pdfText-string.
One of the substrings is Invoice.
Both pdfText and the substrings I am searching for are converted to lower case.
If at least one of the substrings are found in the pdfText, a boolean is set to true.
Now, I have an example where the pdtText contains '...Net amount to be invoiced...'. This is the only variant of 'invoice' in the text.
This of course returns true if I use
substring = "Invoice" ... pdfText.contains(substring.ToLower).
But in this case I need it to return false. I need to find only exact matches.
Another example, if the pdfText contains '...This is an invoice. Please pay....Net amount to be invoiced...' the boolean should be set to true because of the first invoice-match, but not the second invoiced-(non)match.
So what I am looking for is to find a substring Invoice in a string pdfText and make sure, that the substring is not part of a longer word invoiced, invoice-process etc.. Note, that invoice. should return True.
I believe this should be possible, but cannot wrap my head around it currently.
I might need to use regex?
This one uses the RegEx, with a slight change, proposed by #Mederic at https://stackoverflow.com/a/45587916/2326360
Use the build in UiPath activity Is Match, found under Programming->String.
Use it inside your loop, with the current settings.
The RegEx is: substring+"[^a-zA-Z]"
I have declared the following variables:
RegEx would be a good approach.
I only started RegEx not long ago but I think this would work fine.
RegEx:
(invoice)[^a-zA-Z]
Explanation:
() Creates a Capture Group
invoice looks for the match for invoice
[^a-zA-Z] Checks there are no characters from a-z or A-Z after
Example:
Sample: This was invoiced
Result: No Result
Sample: This is an invoice.
Result: Match on invoice. Capture group 1 = invoice
Implementation:
Dim m As Match = Regex.Match(pdfText.ToLower,"(invoice)[^a-zA-Z]")
' If successful, write the group.
If (m.Success) Then
Dim key As String = m.Groups(1).Value
Console.WriteLine(key)
End If

How to extract relationships from a text

I am currently new with NLP and need guidance as of how I can solve this problem.
I am currently doing a filtering technique where I need to brand data in a database as either being correct or incorrect. I am given a structured data set, with columns and rows.
However, the filtering conditions are given to me in a text file.
An example filtering text file could be the following:
Values in the column ID which are bigger than 99
Values in the column Cash which are smaller than 10000
Values in the column EndDate that are smaller than values in StartDate
Values in the column Name that contain numeric characters
Any value that follows those conditions should be branded as bad.
However, I want to extract those conditions and append them to the program that I've made so far.
For instance, for the conditions above, I would like to produce
`if ID>99`
`if Cash<10000`
`if EndDate < StartDate`
`if Name LIKE %[1-9]%`
How can I achieve the above result using the Stanford NLP? (or any other NLP library).
This doesn't look like a machine learning problem; it's a simple parser. You have a simple syntax, from which you can easily extract the salient features:
column name
relationship
target value or target column
The resulting "action rule" is simply removing the "syntactic sugar" words and converting the relationship -- and possibly the target value -- to its symbolic form.
Enumerate all of your critical words for each position in a lexicon. Then use basic string manipulation operators in your chosen implementation language to find the three needed fields.
EXAMPLE
Given the data above, your lexicons might be like this:
column_trigger = "Values in the column"
relation_dict = {
"are bigger than" : ">",
"are smaller than" : "<",
"contain" : "LIKE",
...
}
value_desc = {
"numeric characters" : "%[1-9]%",
...
}
From here, use these items in standard parsing. If you're not familiar with that, please look up the basics of a simple sentence grammar in your favourite programming language, with rules such as such as
SENTENCE => SUBJ VERB OBJ
Does that get you going?

Resources