How to apply a fuzzy matching function on the target and reference columns for pandas dataframes - python-3.x

******Edited with Solution Below*******
I have carefully read the guidelines, hope the question is acceptable.
I have two pandas dataframes, I need to apply a fuzzy matching function on the target and reference columns and merge the data based on the similarity score preserving the original data.
i have checked similar questions, e.g. see:
is it possible to do fuzzy match merge with python pandas?
but I am not able to use this solution.
So far I have:
df1 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'],'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
import distance
fulldf=[]
for name1 in df1['RefName']:
for name2 in df2['TarName']:
if distance.jaccard(name1, name2)<0.6:
fulldf.append({'RefName':name1 ,'Score':distance.jaccard(name1, name2),'TarName':name2 })
pd_fulldf= pd.DataFrame(fulldf)
How can I include the 'NameId' and 'Type' (and eventual other columns) in the final output e.g.:
df1_NameId RefName df1_Type df1_NewColumn Score df2_NameId TarName df2_Type df2_NewColumn
1 robert johnes Person … 0.0000 1 roberto johnes Person …
Is there a way to code this so that is easily scalable, and can be performed on datasets with hundred thousands of rows?
I have solved the original problem by unpacking the dataframes in the loop:
import distance
import pandas as pd
#Create test Dataframes
df1 = pd.DataFrame({'NameId': [1,2,3], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
results=[]
#Create two generators objects to loop through each dataframe row one at the time
#Call each dataframe element that you want to have in the final output in the loop
#Append results to the empty list you created
for a,b,c in df1.itertuples():
for d,e,f in df2.itertuples():
results.append((a,b,c,distance.jaccard(c, f),e,d,f))
result_df=pd.DataFrame(results)
print(result_df)

I believe what you need is Cartesian Product of TarName and RefName. Applying distance function to the product is the result you required.
df1["mergekey"] = 0
df2["mergekey"] = 0
df_merged = pd.merge(df1, df2, on = "mergekey")
df_merged["Distance"] = df_merged.apply(lambda x: distance.jaccard(x.RefName, x.TarName), axis = 1)
Result:
NameId_x RefName Type_x mergekey NameId_y TarName Type_y Distance
0 1 robert johnes Person 0 1 roberto johnes Person 0.000000
1 1 robert johnes Person 0 2 lew malinosky Person 0.705882
2 1 robert johnes Person 0 3 andreatta della blatta Person 0.538462
3 2 lew malinsky Person 0 1 roberto johnes Person 0.764706
4 2 lew malinsky Person 0 2 lew malinosky Person 0.083333
5 2 lew malinsky Person 0 3 andreatta della blatta Person 0.666667
6 3 gioberto delle lanterne Person 0 1 roberto johnes Person 0.533333
7 3 gioberto delle lanterne Person 0 2 lew malinosky Person 0.588235
8 3 gioberto delle lanterne Person 0 3 andreatta della blatta Person 0.250000

Related

Cast topic modeling outcome to dataframe

I have used BertTopic with KeyBERT to extract some topics from some docs
from bertopic import BERTopic
topic_model = BERTopic(nr_topics="auto", verbose=True, n_gram_range=(1, 4), calculate_probabilities=True, embedding_model='paraphrase-MiniLM-L3-v2', min_topic_size= 3)
topics, probs = topic_model.fit_transform(docs)
Now I can access the topic name
freq = topic_model.get_topic_info()
print("Number of topics: {}".format( len(freq)))
freq.head(30)
Topic Count Name
0 -1 1 -1_default_greenbone_gmp_manager
1 0 14 0_http_tls_ssl tls_ssl
2 1 8 1_jboss_console_web_application
and inspect the topics
[('http', 0.0855701486234524),
('tls', 0.061977919455444744),
('ssl tls', 0.061977919455444744),
('ssl', 0.061977919455444744),
('tcp', 0.04551718585531556),
('number', 0.04551718585531556)]
[('jboss', 0.14014705432060262),
('console', 0.09285308122803233),
('web', 0.07323749337563096),
('application', 0.0622930523123512),
('management', 0.0622930523123512),
('apache', 0.05032395169459188)]
What I want is to have a final dataframe that has in one column the topic name and in another column the elements of the topic
expected outcome:
class entities
o http_tls_ssl tls_ssl HTTP...etc
1 jboss_console_web_application JBoss, console, etc
and one dataframe with the topic name on different columns
http_tls_ssl tls_ssl jboss_console_web_application
o http JBoss
1 tls console
2 etc etc
I did not find out how to do this. Is there a way?
Here is one way to to it:
Setup
import pandas as pd
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]
topic_model = BERTopic()
# To keep the example reproducible in a reasonable time, limit to 3,000 docs
topics, probs = topic_model.fit_transform(docs[:3_000])
df = topic_model.get_topic_info()
print(df)
# Output
Topic Count Name
0 -1 23 -1_the_of_in_to
1 0 2635 0_the_to_of_and
2 1 114 1_the_he_to_in
3 2 103 2_the_to_in_and
4 3 59 3_ditto_was__
5 4 34 4_pool_andy_table_tell
6 5 32 5_the_to_game_and
First dataframe
Using Pandas string methods:
df = (
df.rename(columns={"Name": "class"})
.drop(columns=["Topic", "Count"])
.reset_index(drop=True)
)
df["entities"] = [
[item[0] if item[0] else pd.NA for item in topics]
for topics in topic_model.get_topics().values()
]
df = df.loc[~df["class"].str.startswith("-1"), :] # Remove -1 topic
df["class"] = df["class"].replace(
"^-?\d+_", "", regex=True
) # remove prefix '1_', '2_', ...
print(df)
# Output
class entities
1 the_to_of_and [the, to, of, and, is, in, that, it, for, you]
2 the_he_to_in [the, he, to, in, and, that, is, of, his, year]
3 the_to_in_and [the, to, in, and, of, he, team, that, was, game]
4 ditto_was__ [ditto, was, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>]
5 pool_andy_table_tell [pool, andy, table, tell, us, well, your, about, <NA>, <NA>]
6 the_to_game_and [the, to, game, and, games, espn, on, in, is, have]
Second dataframe
Using Pandas transpose:
other_df = df.T.reset_index(drop=True)
new_col_labels = other_df.iloc[0] # save first row
other_df = other_df[1:] # remove first row
other_df.columns = new_col_labels
other_df = pd.DataFrame({col: other_df.loc[1, col] for col in other_df.columns})
print(other_df)
# Output
the_to_of_and the_he_to_in the_to_in_and ditto_was__ pool_andy_table_tell the_to_game_and
0 the the the ditto pool the
1 to he to was andy to
2 of to in <NA> table game
3 and in and <NA> tell and
4 is and of <NA> us games
5 in that he <NA> well espn
6 that is team <NA> your on
7 it of that <NA> about in
8 for his was <NA> <NA> is
9 you year game <NA> <NA> have

Algo to identify slightly different uniquely identifiable common names in 3 DataFrame columns

Sample DataFrame df has 3 columns to identify any given person, viz., name, nick_name, initials. They can have slight differences in the way they are specified but looking at three columns together it is possible to overcome these differences and separate out all the rows for given person and normalize these 3 columnns with single value for each person.
>>> import pandas as pd
>>> df = pd.DataFrame({'ID':range(9), 'name':['Theodore', 'Thomas', 'Theodore', 'Christian', 'Theodore', 'Theodore R', 'Thomas', 'Tomas', 'Cristian'], 'nick_name':['Tedy', 'Tom', 'Ted', 'Chris', 'Ted', 'Ted', 'Tommy', 'Tom', 'Chris'], 'initials':['TR', 'Tb', 'TRo', 'CS', 'TR', 'TR', 'tb', 'TB', 'CS']})
>>> df
ID name nick_name initials
0 0 Theodore Tedy TR
1 1 Thomas Tom Tb
2 2 Theodore Ted TRo
3 3 Christian Chris CS
4 4 Theodore Ted TR
5 5 Theodore R Ted TR
6 6 Thomas Tommy tb
7 7 Tomas Tom TB
8 8 Cristian Chris CS
In this case desired output is as follows:
ID name nick_name initials
0 0 Theodore Ted TR
1 1 Thomas Tom TB
2 2 Theodore Ted TR
3 3 Christian Chris CS
4 4 Theodore Ted TR
5 5 Theodore Ted TR
6 6 Thomas Tom TB
7 7 Thomas Tom TB
8 8 Christian Chris CS
The common value can be anything as long as it is normalized to same value. For example, name is Theodore or Theodore R - both fine.
My actual DataFrame is about 4000 rows. Could someone help specify optimal algo to do this.
You'll want to use Levenshtein distance to identify similar strings. A good Python package for this is fuzzywuzzy. Below I used a basic dictionary approach to collect similar rows together, then overwrite each chunk with a designated master row. Note this leaves a CSV with many duplicate rows, I don't know if this is what you want, but if not, easy enough to take the duplicates out.
import pandas as pd
from itertools import chain
from fuzzywuzzy import fuzz
def cluster_rows(df):
row_clusters = {}
threshold = 90
name_rows = list(df.iterrows())
for i, nr in name_rows:
name = nr['name']
new_cluster = True
for other in row_clusters.keys():
if fuzz.ratio(name, other) >= threshold:
row_clusters[other].append(nr)
new_cluster = False
if new_cluster:
row_clusters[name] = [nr]
return row_clusters
def normalize_rows(row_clusters):
for name in row_clusters:
master = row_clusters[name][0]
for row in row_clusters[name][1:]:
for key in row.keys():
row[key] = master[key]
return row_clusters
if __name__ == '__main__':
df = pd.read_csv('names.csv')
rc = cluster_rows(df)
normalized = normalize_rows(rc)
pd.DataFrame(chain(*normalized.values())).to_csv('norm-names.csv')

Groupby and calculate count and means based on multiple conditions in Pandas

For the given dataframe as follows:
id|address|sell_price|market_price|status|start_date|end_date
1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9
#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|')
I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:
mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10
The desired output will like this:
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0
1 2 888 Foster Street 1.32 1
2 3 5 Pawnee Avenue 1.25 1
3 4 916 W. Mill Pond St. 1.44 3
4 5 68 Henry Drive NaN 2
I have tried so far:
# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
In order to filter start_date isin the range of 2019-09 and 2019-10:
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]
To filter row status is finished or failed, I use:
mask = df['status'].str.contains('finished|failed')
df[mask]
But I don't know how to use those to get final result. Thanks your help at advance.
I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:
df2 = df[['id','address']].drop_duplicates()
print (df2)
id address
0 1 7552 Atlantic Lane
2 2 888 Foster Street
5 3 5 Pawnee Avenue
9 4 916 W. Mill Pond St.
12 5 68 Henry Drive
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
df = df[mask1 & mask]
df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
result_count=('ratio','size'))
df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0.0
2 2 888 Foster Street 1.320000 1.0
5 3 5 Pawnee Avenue 1.250000 1.0
9 4 916 W. Mill Pond St. 1.436667 3.0
12 5 68 Henry Drive NaN 2.0
Some helpers
def mean_ratio(idf):
# filtering data
idf = idf[
(idf['start_date'].between('2019-09-01', '2019-10-31')) &
(idf['mean_ratio'].notnull()) ]
return np.round(idf['mean_ratio'].mean(), 2)
def result_count(idf):
idf = idf[
(idf['status'].isin(['finished', 'failed'])) &
(idf['start_date'].between('2019-09-01', '2019-10-31')) ]
return idf.shape[0]
# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']
df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})
# Group the df
g = df.groupby(['id', 'address'])
mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')
# Final result
pd.concat((mean_ratio, result_count), axis=1)

How to fill in between rows gap comparing with other dataframe using pandas?

I want to compare df1 with df2 and fill only the blanks without overwriting other values. I have no idea how to achieve this without overwriting or creating an extra columns.
Can I do this by converting df2 into dictionary and mapping with df1?
df1 = pd.DataFrame({'players name':['ram', 'john', 'ismael', 'sam', 'karan'],
'hobbies':['jog','','photos','','studying'],
'sports':['cricket', 'basketball', 'chess', 'kabadi', 'volleyball']})
df1:
players name hobbies sports
0 ram jog cricket
1 john basketball
2 ismael photos chess
3 sam kabadi
4 karan studying volleyball
And, df,
df2 = pd.DataFrame({'players name':['jagan', 'mohan', 'john', 'sam', 'karan'],
'hobbies':['riding', 'tv', 'sliding', 'jumping', 'studying']})
df2:
players name hobbies
0 jagan riding
1 mohan tv
2 john sliding
3 sam jumping
4 karan studying
I want output like this:
Try this:
df1['hobbies'] = (df1['players name'].map(df2.set_index('players name')['hobbies'])
.fillna(df1['hobbies']))
df1
Output:
players name hobbies sports
0 ram jog cricket
1 john sliding basketball
2 ismael photos chess
3 sam jumping kabadi
4 karan studying volleyball
if blank space is NaN value
df1 = pd.DataFrame({"players name":["ram","john","ismael","sam","karan"],
"hobbies":["jog",pd.np.NaN,"photos",pd.np.NaN,"studying"],
"sports":["cricket","basketball","chess","kabadi","volleyball"]})
then
dicts = df2.set_index("players name")['hobbies'].to_dict()
df1['hobbies'] = df1['hobbies'].fillna(df1['players name'].map(dicts))
output:
players name hobbies sports
0 ram jog cricket
1 john sliding basketball
2 ismael photos chess
3 sam jumping kabadi
4 karan studying volleyball

How to export multiple excels based on one column

I was trying to export multiple excels based on the value of one column. For example:
import pandas as pd
df = pd.DataFrame({'state':['PA','PA','TX','TX'],'county':['Centre','Berks','Austin','Taylor'],'a':[4,3,2,1],'b':[3,4,5,6]})
df
How can I export this dataframe to multiple excels based on the attribute of column "state". For example, to export a separate excel only with "state" = "PA" and another excel with "state" = "TX". Thanks.
solution for n elements in the state column.
1.imagine that this is your dataframe
import pandas as pd
df = pd.DataFrame({'state':['PA','PA','TX','TX','RX'],'county':['Centre','Berks','Austin','Taylor','Mike'],'a':[4,3,2,1,0],'b':[3,4,5,6,7]})
print(df)
state county a b
0 PA Centre 4 3
1 PA Berks 3 4
2 TX Austin 2 5
3 TX Taylor 1 6
4 RX Mike 0 7
2.the idea: Series.unique
df['state'].unique()
array(['PA', 'TX', 'RX'], dtype=object)
as you can see unique returns the different and unrepeatable elements present in the series.
3. For loop
you can use a for loop to filter the dataframe based on the unique state elements returned by unique:
for state in df['state'].unique():
print(df[df['state'].eq(state)])
print('-'*20)
state county a b
0 PA Centre 4 3
1 PA Berks 3 4
--------------------
state county a b
2 TX Austin 2 5
3 TX Taylor 1 6
--------------------
state county a b
4 RX Mike 0 7
4 send to excel
for state in df['state'].unique():
df[df['state'].eq(state)].to_excel(state+'.xlsx')
on the use of DataFrame.eq, DataFrame.ne and the operator ~
My suggestion in your comment on the use of ~ is because there were only two states.
the following expressions are equivalent:
~df.eq(a)
df.ne(a)

Resources