Dataframe text column with spelling mistakes - python-3.x

Sorry for bad title, I wasn't sure how best to describe the issue.
I have a dataframe with a column for occupation, df['occupation'], users can enter their occupuation during signup using whatever terms they like.
I'm attempting to do an EDA on the column, however, I'm not sure how to clean the column to get it from this,
Occupation
a-level student
a level
alavls
university physics student
physics student
6th form student
builder
Into something like this,
Occupation
a-levels
University student
Full time employment
Without writing out hundreds of lines renaming each unique entry.
TYIA
Any help or links to useful modules would be great.

The simplest way to do this is by applying a function that measures the similarity between the two sentences, there are plenty of similiraty mesures that could be used in this context like the hamming distance, however they are all relatively very limited, and you might be forced at some point -if in production- to have a machine learning model for this task.
import pandas as pd
def hamming_distance(chaine1, chaine2):
"""mesures the similiraty between two chaines of caracters
Note: this is a very limited as it only examines the positions of the letters in both chaines.
"""
return sum(c1 != c2 for c1, c2 in zip(chaine1, chaine2))
OCCUPATIONS = [ "Occupation","a-levels" , "University student" , "Full time employment"]
def get_most_similar(ocup,OCCUPATIONS):
"""return the most similar occupation from the unique values OCCUPATIONs to the entry ocup
"""
return min([(oc,hamming_distance(ocup.lower(),oc.lower())) for oc in OCCUPATIONS],key=lambda item:item[1])[0]
column = ["Occupation","a-level student","a level","alavls","university physics student","physics student","6th form student","builder"]
df = pd.DataFrame(column,columns=['occupation']) # this is just a reconstruction of your dataframe you probably don't need this line.
df['occupation']=df['occupation'].apply(lambda ocup : get_most_similar(ocup,OCCUPATIONS))
df.head(100)

Hi the approach you can possibly use in this problem is similar to the Solution Covered here Using apply map with Regex
The Regex approach will allow you to use Wildcards for cases you have not explored in your Dataset.

Related

Is there a logical function in excel to extract unique text values from a range of similar texts?

I am working on a dataset which has data (text) entries captured in different styles like we see in the table below in 1000's of rows:
**School Name **
Abirem school
Abirem sec School
Abirem Secondary school
Abirem second. School
Metropolitan elementary
Metropolitan Element.
Metropolitan ele
I need help to extract the unique data values within a group of similar entries regardless of the style it was entered. The output I want should look like we see below:
**School Name **
Abirem school
Metropolitan elementary
I have tried using the functions; EXACT, UNIQUE, MATCH and even XLOOKUP (with the wildcard option) but none of them gives me the output I want.
Is there a logical function that can be used?
This will prove to be tricky. Excel would not know wheather or not two different names that look similar are actually meant to be similar. Even for us humans it will become trivial. I mean; would School1 ABC be similar to School1 DEF or not? Without actually knowing geographical locations about these two schools these could well be two different schools with a similar first word in their names.
Either way, if you happen to be willing to accept this ambiguity you could make a match on the 1st word of each line here and return only those where they match first:
Formula in C1:
=LET(a,A1:A7,UNIQUE(XLOOKUP(TEXTSPLIT(a," ")&" *",a&" ",a,,2)))

setting a variable equal to the value in the adjacent column in pandas dataframe

I have a dataframe that i need to use to get to the full name of the object from the abbreviation in order to search for it in a different dataframe.
this is the first few lines of the simple dataframe. it lists all of the national parks in the US. I need this for input menus and decision trees in the program.
In bad pseudo code I need code that is like.
my`_var = next line over from park_abbrev`in df
so if park_abbrev = DENA then my_var = Denali National Park and Preserve
I need this because I use the initials for user input and that leads to this function, which is picking trails from a separate very large dataframe depending on the difficulty level the user selects.This dataframe only has the full name of the park, not the abbreviation, and i need that to get only the trails in the park of interest.
thank you for any suggestions.
You can get the the park_namess if the park_abbrev is DENA using df.loc:
df.loc[df['park_abbrev']=='DENA','park_name']

how to add columns with 'filled data' after filling missing values in pandas or python using different techniques?

How to add columns with 'filled data' after filling missing values in pandas or python, using different or several techniques like various statistical techniques or machine learning techniques.
What I want to do is that, after filling the data let's say with mean, median or standard deviation values or with other machine learning algos, like KNN or XGBoost or some other technique, then I want to add or append those or that particular column(s) at the end of the csv or excel file but not below the actual data, I mean towards the right-end side of the file.
For instance, I've filled the missing data of a particular column using statistical techniques and other ML techniques then I want to add those 'filled values' along with the original values in a new column having it's actual name with underscore and the technique with which the data is filled for that particular feature and add it at the end of the data to the right side of the data. Example, the column or feature is 'phone' then at the right-end side after filling missing values it must show the whole original or actual values plus the values calculated by statistical means or ML means with column name like "phone_Mean" or "phone_interpolation" or 'phone_KNN' or 'phone_XGBoost' like that.
What I've done so far ?
I've applied the ways from the pandas documentation page and stackoverflow as well, the ones which are generally high enlisted and are in top 7/10 links on google or duckduckgo search engines, but all went in vain.
I'm really facing this issue from last few days due to which I'm crippled at convincing my client. So, it will be great help if you can assist me with some code example using pandas or core python code to support your answer.
Here's the snippet of the dataset. Let's say I'm applying techniques on a feature/column named 'phone':
One of the way is by making use of pandas like:-
df_01["phone_mean"] = df_01["phone"].fillna().mean()

Custom Word Tagger

I am new to NLP and am getting to know NLTK, but am having some trouble getting off the ground on something I am trying to accomplish.
I would like to build my own word tagger such that if I pass a string like "The Porsche is red" the function would return ('Porsche','Car', 'red', 'Color').
I already have the dictionaries built that define the categories. I am just struggling on how to get start. Could anyone offer some assistance?
Thanks very much.
UPDATE: The dictionary at this time is a simple two column list in .csv format with the word and its corresponding category.
Example Link: http://www.filedropper.com/carexampledictionary
Sincerely,
Mick
I think simple lookup in the list might work. First tokenize the text, then iterate through the tokens and look up each token in the list in your lists of categories.
One problem you might have is overlap between the categories. Is there any word which occurrs in more than one category list? If so you'd need a method to disambiguate which category a given token belongs to. If not simple list-lookup should work.
More precisely, here is what I would do step-by-step:
Import the data into a dictionary
Tokenize the text
For each token, look up whether the token is in the keys of your dictionary
Tag the word according to what category it belongs to
Hope that helps.

Weighted search algorithm to find like contacts

I need to write an algorithm that returns the closest match for a contact based on the name and address entered by the user. Both of these are troubling, since there are so many ways to enter a company name and address, for instance:
Company A, 123 Any Street Suite 200, Anytown, AK 99012
Comp. A, 123 Any St., Suite 200, Anytown, AK 99012
CA, 123 Any Street Ste 200, Anytown, AK 99012
I have looked at doing a Levenshtein distance on the Name, but that doesn't seem a great tool, since they could abbreviate the name. I am looking for something that matches on the most information possible.
My initial attempt was to limit the results first by the first 5 digits of the postal code and then try to filter down to one based on other information, but there must be a more standard approach to getting this done. I am working in .NET but will look at any code you can provide to get an idea on how to accomplish this.
I don't exactly now how this is accomplished, but all major delivery companies (FedEx, USPS, UPS) seem to have a way of matching an address you input against their database and transforming it to a normalized form. As I've seen this happen on multiple websites (Amazon comes to mind), I am assuming that there is an API to this functionality, but I don't know where to look for it and whether it is suitable for your purposes.
Just a thought though.
EDIT: I found the USPS API
I have solved this problem with a combination of address normalization, Metaphone, and Levenshtein distance. You will need to separate the name from the address since they have different characteristics. Here are the steps you need to do:
1) Narrow down you list of matches by using the (first six characters of the) zip code. Basically you will need to calculate the Levenshtein distance of the two strings and select the ones that have a distance of 1 or 2 at the most. You can potentially precompute a table of zip codes and their "Levenshtein neighbors" if you really need to speed up the search.
http://en.wikipedia.org/wiki/Levenshtein_distance
2) Convert all the address abbreviations to a standard format using the list of official prefix and suffix abbreviations from the USPS. This will help make sure your results for the next step are more uniform:
https://www.usps.com/send/official-abbreviations.htm
3) Convert the address to a short code using the Methaphone algorithm. This will get rid of most common spelling mistakes. Just make sure that your implementation can eliminate all non word characters, pass numbers intact and handle multiple words (make sure each word is separated by a single space):
http://en.wikipedia.org/wiki/Metaphone
4) Once you have the Methaphone result of the compare the address strings using the Levenshtein distance. Calculate a percentage of change score by dividing the result by the number of characters in the longer string.
5) Repeat steps 3 and 4 but now use the names instead of the addresses.
6) Compute the score for each entry using this formula: (Weight for address * Address score) + (Weight for name * Name score). Pick your weights based on what is more important. I would start with .9 for the address (since the address is more specific) and .1 for the name but the weights may depend on your application. Pick the entry with the lowest score. If the score is too high (say over .15 you may declare that there are no matches).
I think filtering based on zip code first would be the easiest, as finding it is fairly unambiguous. From there you can probably extract the city and street. I'm not sure how you would go about finding the name, but it seems matching it against the address if you already have a database of (name, address) pairs is feasible.
Dun & Bradstreet do this. They charge money because it's really hard. There's no "standard" solution. It's mostly a painful choice between a service like D&B or roll your own.
As a start, I'd probably do a word-indexed search. That would mean two stages:
Offline stage: Generate an index of all the addresses by their keywords. For example, "Company", "A" and "123" would all become an keywords for the address you provided above. You could do some stemming, which would mean for words like "street" you'd also add a word "st" into its index.
Online stage: The user gives you a search query. Break down the search query into all its keywords, and find all possible matches of each keyword in the database. Tally the number of matched keywords on each address. Then sort the results by the number of matched keywords. This should be able to be done quite quickly if there aren't too many matches, as its just a few sorted list merges and increments, followed finally by a sort.
Given that you know the domain of your problem, you could specialise the algorithm to use knowledge about the domain - for example the zip code filtering mentioned before.
Also just to enable me to provide you with a better answer, are you using an SQL database at all? I ask because the way I would do it is I'd store the keyword index in the SQL database, and then the SQL query to search by keyword becomes quite easy, since the database does all the work.
Maybe instead of using Levenshtein for the name only, it could be useful when used with the entire string representation of a contact. For instance, the distance of your first example to the second is 7 and to the third 9. Considering the strings have lengths 54, 50 and 45, this seems to be a relatively useful and quite simple similarity measure.
This is what I would do. I am not aware of algorithms, so I just use what makes sense.
I am assuming that the person would provide name, street address, city name, state name, and zipcode.
If the zipcode is provided in 9 numbers, or has a hyphen, I would strip it down to 5 numbers. I would search the database for all of the addresses that has that zipcode.[query 1]
Then I would compare the state letter with the one from the database. If it's not a match, then I would tell that to the user. Same goes for the city name.
From what I understand, a street name is not in numbers, only the house on a street had numbers in it. Further more, the house number is usually at the beginning unless it is house or suite number.
So I would do regex to search for the numbers and the next space or comma next to it. Then find position of the first word that does not has a period(.) or ends in comma. I have part of the street name, so I could do a comparison against the rows fetched earlier, or I would change the query to have the street name LIKE %streetName%.
I am guessing the database has a beginning number and ending number of the house on a block. I would check against that street row to see if the provided street number is on that street.
By now you would know the correct data to show, and could look up in a different table as to which name is associated with that house number. I am not sure why you want to compare it. Only use for name comparing would be if you want to find people whose address was not provided. You can look here for comparing string ways Similar String algorithm
If you can reliably figure out general structure of each address (perhaps by the suggestions in the other answers), your best bet would be to run the data through a USPS-certified (meaning: the results are reliable, accurate, and conform to federal standards) address verification service.
#RyanDelucchi, it is a fun problem, but only once you've solved it. So, #SteveBering, I would recommend submitting your list of contacts to a list processing service which will flag duplicates based on the address -- according to USPS guidelines.
Since I work in the address verification field, I would suggest SmartyStreets (which I work for) since it will deliver the most value to your specific need -- however, there are a few CASS-Certified vendors who will do basically similar things.

Resources