Fastest way to detect and append duplicates base on specific column in dataframe - python-3.x

Here are samples data:
name age gender school
Michael Z 21 Male Lasalle
Lisa M 22 Female Ateneo
James T 21 Male UP
Michael Z. 23 Male TUP
Here are the expected output I need:
name age gender similar name on_lasalle on_ateneo on_up on_tup
Michael Z 21 Male Michael Z. True False False True
Lisa M 22 Female False True False False
James T 21 Male False False True False
I've been trying to use fuzzywuzzy on my python script. The data I am getting is coming from bigquery, then I am comverting it to dataframe to clean some stuff. After that, I am converting the dataframe to a list of dictionaries.
Notice the above data where Michael Z. from TUP was appended to Michael Z from school Lasalle since they have similar names with 100% similarity rate using fuzz.token_set_ratio
What I want is to get all similar rows base on names and append it to the current dictionary we are looking at (including their school).
Here is the code and the loop to get similar rows base on names:
data_dict_list = data_df.to_dict('records')
for x in range(0, len(data_dict_list)):
for y in range(x, len(data_dict_list)):
if not data_dict_list[x]['is_duplicate']:
similarity = fuzz.token_set_ratiod(data_dict_list[x]['name'], data_dict_list[y]['name'])
if similarity >= 90:
data_dict_list[x]['similar_names'].update('similar_name': data_dict_list[y]['name'])
...
data_dict_list[x]['is_duplicate'] = True
The runtime of this script is very slow, as sometimes, I am getting 100,000+ data !!! So it will loop through all of that data.
How will I be able to speed up the process of this?
Suggesting pandas is much appreciated as I am having a hard time figuring out how to loop data in it.

As a first step you can simply replace the import of fuzzywuzzy with rapidfuzz:
from rapidfuzz import fuzz
which should already improve the performance quite a bit. You can further improve the performance by comparing complete lists of strings in rapidfuzz in the following way:
>> import pandas as pd
>> from rapidfuzz import process, fuzz
>> df = pd.DataFrame(data={'name': ['test', 'tests']})
>> process.cdist(df['name'], df['name'], scorer=fuzz.token_set_ratio, score_cutoff=90)
array([[100, 0],
[ 0, 100]], dtype=uint8)
which returns a matrix of result where all elements with a score below 90 are set to 0. For large datasets you can enable multithreading using the workers argument:
process.cdist(df['name'], df['name'], workers=-1, scorer=fuzz.token_set_ratio, score_cutoff=90)

Related

pandas check if two values are statistically different

I have a pandas dataframe which has some values for Male and some for Female. I would like to calculate if the percentage of both genders' values is significantly different or not and tell confidence intervals of these rates. Given below is the sample code:
data={}
data['gender']=['male','female','female','male','female','female','male','female','male']
data['values']=[10,2,13,4,11,8,14,19,2]
df_new=pd.DataFrame(data)
df_new.head() # make a simple data frame
gender values
0 male 10
1 female 2
2 female 13
3 male 4
4 female 11
df_male=df_new.loc[df_new['gender']=='male']
df_female=df_new.loc[df_new['gender']=='female'] # separate male and female
# calculate percentages
male_percentage=sum(df_male['values'].values)*100/sum(df_new['values'].values)
female_percentage=sum(df_female['values'].values)*100/sum(df_new['values'].values)
# want to tell whether both percentages are statistically different or not and what are their confidence interval rates
print(male_percentage)
print(female_percentage)
Any help will be much appreciated. Thanks!
Use t-test.In this case, use a two t test, meaning you are comparing values/means of two samples.
I am applying an alternative hypothesis; A!=B.
I do this by testing the null hypothesis A=B. This is achieved by calculating a p value. When p falls below a critical value, called alpha, I reject the null hypothesis. Standard value for alpha is 0.05. Below 5% probability, the sample will produce patterns similar to observed values
Extract Samples, in this case a list of values
A=df[df['gender']=='male']['values'].values.tolist()
B=df[df['gender']=='female']['values'].values.tolist()
Using scipy library, do the t -test
from scipy import stats
t_check=stats.ttest_ind(A,B)
t_check
alpha=0.05
if(t_check[1]<alpha):
print('A different from B')
Try this:
df_new.groupby('gender')['values'].sum()/df_new['values'].sum()*100
gender
female 63.855422
male 36.144578
Name: values, dtype: float64

How to Put Ages into intervals

I have a list of ages in an existing dataframe. I would like to put these ages into intervals/Age Groups such as (10-20), (20-30),etc. Please see excample below
I am unsure where to begin coding this as i get an "bins" errors when using all bins related code
Here's what you can do:
import pandas as pd
def checkAgeRange(age):
las_dig=age%10
range_age=str.format('{0}-{1}',age-las_dig,((age-las_dig)+10))
return range_age
d={'AGE':[19,13,45,65,23,12,28]}
dataFrame= pd.DataFrame(data=d)
dataFrame['AgeGroup']=dataFrame['AGE'].apply(checkAgeRange)
print(dataFrame)
# Output: AGE AgeGroup
0 19 10-20
1 13 10-20
2 45 40-50
3 65 60-70
4 23 20-30
5 12 10-20
6 28 20-30
Some explanation of code above:
d={'AGE':[19,13,45,65,23,12,28]}
dataFrame= pd.DataFrame(data=d)
# Making a simple dataframe here
dataFrame['AgeGroup']=dataFrame['AGE'].apply(checkAgeRange)
# applying our checkAgeRange function here
def checkAgeRange(age):
las_dig=age%10
range_age=str.format('{0}-{1}',age-las_dig,((age-las_dig)+10))
return range_age
# This method extracts the las digit from age and then forms the range as a string. You can change the data-structure here according to your needs.
Hope this answers your question. Cheers!

Use KDTree/KNN Return Closest Neighbors

I have two python pandas dataframes. One contains all NFL Quarterbacks' College Football statistics since 2007 and a label on the type of player they are (Elite, Average, Below Average). The other dataframe contains all of the college football qbs' data from this season along with a prediction label.
I want to run some sort of analysis to determine the two closest NFL comparisons for every college football qb based on their labels. I'd like to add to two comparable qbs as two new columns to the second dataframe.
The feature names in both dataframes are the same. Here is what the dataframes look like:
Player Year Team GP Comp % YDS TD INT Label
Player A 2020 ASU 12 65.5 3053 25 6 Average
For the example above, I'd like two find the two closest neighbors to Player A that also have the label "Average" from the first dataframe.
The way I thought of doing this was to use Scipy's KDTree and run a query tree:
tree = KDTree(nfl[features], leafsize=nfl[features].shape[0]+1)
closest = []
for row in college.iterrows():
distances, ndx = tree.query(row[features], k=2)
closest.append(ndx)
print(closest)
However, the print statement returned an empty list. Is this the right way to solve my problem?
.iterrows(), will return namedtuples (index, Series) where index is obviously the index of the row, and Series is the features values with the index of those being the columns names (see below).
As you have it, row is being stored as that tuple, so when you have row[features], that won't really do anything. What you're really after is that Series which the features and values Ie row[1]. So you can either call that directly, or just break them up in your loop by doing for idx, row in df.iterrows():. Then you can just call on that Series row.
Scikit learn is a good package here to use (actually built on Scipy so you'll notice same syntax). You'll have to edit the code to your specifications (like filter to only have the "Average" players, maybe you are one-hot encoding the category columns and in that case may need to add that to the features,etc.), but to give you an idea (And I made up these dataframes just for an example...actually the nfl one is accurate, but the college completely made up), you can see below using the kdtree and then taking each row in the college dataframe to see which 2 values it's closest to in the nfl dataframe. I obviously have it print out the names, but as you can see with print(closest), the raw arrays are there for you.
import pandas as pd
nfl = pd.DataFrame([['Tom Brady','1999','Michigan',11,61.0,2217,16,6,'Average'],
['Aaron Rodgers','2004','California',12,66.1,2566,24,8,'Average'],
['Payton Manning','1997','Tennessee',12,60.2,3819,36,11,'Average'],
['Drew Brees','2000','Perdue',12,60.4,3668,26,12,'Average'],
['Dan Marino','1982','Pitt',12,58.5,2432,17,23,'Average'],
['Joe Montana','1978','Notre Dame',11,54.2,2010,10,9,'Average']],
columns = ['Player','Year','Team','GP','Comp %','YDS','TD','INT','Label'])
college = pd.DataFrame([['Joe Smith','2019','Illinois',11,55.6,1045,15,7,'Average'],
['Mike Thomas','2019','Wisconsin',11,67,2045,19,11,'Average'],
['Steve Johnson','2019','Nebraska',12,57.3,2345,9,19,'Average']],
columns = ['Player','Year','Team','GP','Comp %','YDS','TD','INT','Label'])
features = ['GP','Comp %','YDS','TD','INT']
from sklearn.neighbors import KDTree
tree = KDTree(nfl[features], leaf_size=nfl[features].shape[0]+1)
closest = []
for idx, row in college.iterrows():
X = row[features].values.reshape(1, -1)
distances, ndx = tree.query(X, k=2, return_distance=True)
closest.append(ndx)
collegePlayer = college.loc[idx,'Player']
closestPlayers = [ nfl.loc[x,'Player'] for x in ndx[0] ]
print ('%s closest to: %s' %(collegePlayer, closestPlayers))
print(closest)
Output:
Joe Smith closest to: ['Joe Montana', 'Tom Brady']
Mike Thomas closest to: ['Joe Montana', 'Tom Brady']
Steve Johnson closest to: ['Dan Marino', 'Tom Brady']

How to understand python data frames better

Being a beginner in Python, I often face this problem - Let's say I am working with a data frame and want to execute an operation on one of the column. It can be just removing the decimal point from the value or maybe I want to take out the month from the date column. But often the solutions I would find online - it is generally shown with a single value or a data point like this:
a = 11.0
int(a)
11
Now, the same solution can't be applied if I have a data frame or a column. Again If I want to add time with date
d = date.today()
d
datetime.date(2018, 3, 30)
datetime.combine(d, datetime.min.time())
datetime.datetime(2018, 3, 30, 0, 0)
In the same manner, this solution can not be used for a data frame. That will throw an error. Obviously I have a lacking in knowledge here, I am not being able to make it work in terms of data frames. Can you please point me towards the topic which might help me understand these problems in terms of data frames ? or maybe show an example how its done ?
You should have a look at pandas library to manipulate dataframes : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
This is an exemple to apply a function for each value of a given column:
import pandas as pd
def myFunction(a_string):
return a_string.upper()
data = pd.read_csv('data.csv')
print(data)
data['City'] = data['City'].apply(myFunction)
print(data)
Data at beginning :
Name City Age
Robert Paris 32
Max Dallas 24
Raj Delhi 27
Data after:
Name City Age
Robert PARIS 32
Max DALLAS 24
Raj DELHI 27
Here myFunction uppercase the string but could be used the same way for other kind of operations.
Hope that helps.

Reorder Rows into Columns in Pandas (Python 3, Pandas)

Right now, my code takes scraped web data from a file (BigramCounter.txt), and then finds all the bigrams within that file so that the data looks like this:
Counter({('the', 'first'): 45, ('on', 'purchases'): 42, ('cash', 'back'): 39})
After this, I try to feed it into a pandas DataFrame where it spits this df out:
the on cash
first purchases back
0 45 42 39
This is very close to what I need but not quite. First off, the DF does not read my attempt to name the columns. Furthermore, I was hoping for something formatted more like this where its two COLUMNS and the Words are not split between Cells:
Words Frequency
the first 45
on purchases 42
cash back 39
For reference, here is my code. I think I may need to reorder an axis somewhere but I'm not sure how? Any ideas?
import re
from collections import Counter
main_c = Counter()
words = re.findall('\w+', open('BigramCounter.txt', encoding='utf-8').read())
bigrams = Counter(zip(words,words[1:]))
main_c.update(bigrams) #at this point it looks like Counter({('the', 'first'): 45, etc...})
comm = [[k,v] for k,v in main_c]
frame = pd.DataFrame(comm)
frame.columns = ['Word', 'Frequency']
frame2 = frame.unstack()
frame2.to_csv('text.csv')
I think I see what you're going for, and there are many ways to get there. You were really close. My first inclination would be to use a series, especially since you'd (presumably) just be getting rid of the df index when you write to csv, but it doesn't make a huge difference.
frequencies = [[" ".join(k), v] for k,v in main_c.items()]
pd.DataFrame(frequencies, columns=['Word', 'Frequency'])
Word Frequency
0 the first 45
1 cash back 39
2 on purchases 42
If, as I suspect, you want word to be the index, add frame.set_index('Word')
Word Frequency
the first 45
cash back 39
on purchases 42

Resources