Python merge two dataframe based on text similarity of their columns - python-3.x

I am working with two dataframes which look like this:
df1
country_1 column1
united states of america abcd
Ireland (Republic of Ireland) efgh
Korea Rep Of fsdf
Switzerland (Swiss Confederation) dsaa
df2
country_2 column2
united states cdda
Ireland ddgd
South Korea rewt
Switzerland tuut
desired output:
country_1 column1 country_2 column2
united states of america abcd united states cdda
Ireland (Republic of Ireland) efgh Ireland ddgd
Korea Rep Of fsdf South Korea rewt
Switzerland (Swiss Confederation) dsaa Switzerland tuut
I am not that familiar with text analytics hence unable to understand any method to tackle this problem. I have tried string matching and regex but its not able to solve this problem.

You can use difflib.
Data:
data1 = {
"country_1": ["united states of america", "Ireland (Republic of Ireland)", "Korea Rep Of", "Switzerland (Swiss Confederation)"],
"column1": ["abcd", "efgh", "fsdf", "dsaa"]
}
df1 = pd.DataFrame(data1)
data2 = {
"country_2": ["united states", "Ireland", "Korea", "Switzerland"],
"column2": ["cdda", "ddgd", "rewt", "tuut"]
}
df2 = pd.DataFrame(data2)
Code:
import difflib
from dataclasses import dataclass
import pandas as pd
#dataclass()
class FuzzyMerge:
"""
Works like pandas merge except also merges on approximate matches.
"""
left: pd.DataFrame
right: pd.DataFrame
left_on: str
right_on: str
how: str = "inner"
cutoff: float = 0.3
def main(self) -> pd.DataFrame:
temp = self.right.copy()
temp[self.left_on] = [
self.get_closest_match(x, self.left[self.left_on]) for x in temp[self.right_on]
]
return self.left.merge(temp, on=self.left_on, how=self.how)
def get_closest_match(self, left: pd.Series, right: pd.Series) -> str or None:
matches = difflib.get_close_matches(left, right, cutoff=self.cutoff)
return matches[0] if matches else None
Call the class:
merged = FuzzyMerge(left=df1, right=df2, left_on="country_1", right_on="country_2").main()
print(merged)
Output:
country_1 column1 country_2 column2
0 united states of america abcd united states cdda
1 Ireland (Republic of Ireland) efgh Ireland ddgd
2 Korea Rep Of fsdf Korea rewt
3 Switzerland (Swiss Confederation) dsaa Switzerland tuut

you can solve this problem by using pandas operations i.e using join,merge and concat: but I suggest you go through concat first as it is easy to start with
ps: make sure this is in form of Dataframe
to convert it into DataFrame
data1 = pd.DataFrame(data1)
data2 = pd.DataFrame(data2)
using concat
data = pd.concat([data1, data2], axis=1)

Related

Full country name to country code in Dataframe

I have these kind of countries in the dataframe. There are some with full country names, there are some with alpha-2.
Country
------------------------
8836 United Kingdom
1303 ES
7688 United Kingdom
12367 FR
7884 United Kingdom
6844 United Kingdom
3706 United Kingdom
3567 UK
6238 FR
588 UK
4901 United Kingdom
568 UK
4880 United Kingdom
11284 France
1273 Spain
2719 France
1386 UK
12838 United Kingdom
868 France
1608 UK
Name: Country, dtype: object
Note: Some data in Country are empty.
How will I be able to create a new column with the alpha-2 country codes in it?
Country | Country Code
---------------------------------------
United Kingdom | UK
France | FR
FR | FR
UK | UK
Italy | IT
Spain | ES
ES | ES
...
You can try this, as already mentioned in the comment by me earlier.
import pandas as pd
df = pd.DataFrame([[1, 'UK'],[2, 'United Kingdom'],[3, 'ES'],[2, 'Spain']], columns=['id', 'Country'])
#Create copy of country column as alpha-2
df['alpha-2'] = df['Country']
#Create a look up with required values
lookup_table = {'United Kingdom':'UK', 'Spain':'ES'}
#replace the alpha-2 column with lookup values.
df = df.replace({'alpha-2':lookup_table})
print(df)
Output
You will have to define a dictionary for the replacements (or find a library that does it for you). The abbreviations look pretty close the IBAN codes to me. But the biggest stickout was United Kingdom => GB as opposed to UK in your example.
I would start with the IBAN codes and define a big dictionary like this:
mappings = {
"Afghanistan": "AF",
"Albania": "AL",
...
}
df["Country Code"] = df["Country"].replace(mappings)

Show differences at row level between columns of 2 dataframes Pandas

I have 2 dataframes containing names and some demographic information, the dataframes are not identical due to monthly changes.
I'd like to create another df to show just the names of people where there are changes in either their COUNTRY or JOBCODE or MANAGERNAME columns, and also show what kind of changes these are.
Have tried the following code so far and am able to detect changes in the country column in the 2 dataframes for the common rows.
But am not so sure how to capture the movement in the MOVEMENT columns. Appreciate any form of help.
#Merge first
dfmerge = pd.merge(df1, df2, how ='inner', on ='EMAIL')
#create function to get COUNTRY_CHANGE column
def change_in(dfmerge):
if dfmerge['COUNTRY_x'] != dfmerge['COUNTRY_y']:
return 'YES'
else:
return 'NO'
dfmerge['COUNTRYCHANGE'] = dfmerge.apply(change_in, axis = 1)
Dataframe 1
NAME EMAIL COUNTRY JOBCODE MANAGERNAME
Jason Kelly jasonkelly#123.com USA 1221 Jon Gilman
Jon Gilman jongilman#123.com CANADA 1222 Cindy Lee
Jessica Lang jessicalang#123.com AUSTRALIA 1221 Esther Donato
Bob Wilder bobwilder#123.com ROMANIA 1355 Mike Lens
Samir Bala samirbala#123.com CANADA 1221 Ricky Easton
Dataframe 2
NAME EMAIL COUNTRY JOBCODE MANAGERNAME
Jason Kelly jasonkelly#123.com VIETNAM 1221 Jon Gilman
Jon Gilman jongilman#123.com CANADA 4464 Sheldon Tracey
Jessica Lang jessicalang#123.com AUSTRALIA 2224 Esther Donato
Bob Wilder bobwilder#123.com ROMANIA 1355 Emilia Tanner
Desired Output
EMAIL COUNTRY_CHANGE COUNTRY_MOVEMENT JOBCODE_CHANGE JOBCODE_MOVEMENT MGR_CHANGE MGR_MOVEMENT
jasonkelly#123.com YES FROM USA TO VIETNAM NO NO NO NO
jongilman#123.com NO NO YES FROM 1222 to 4464 YES FROM Cindy Lee to Sheldon Tracey
jessicalang#123.com NO NO YES FROM 1221 to 2224 NO NO
bobwilder#123.com NO NO NO NO YES FROM Mike Lens to Emilia Tanner
There is not direct feature in pandas that can help but we may leverage merge function as follows. We are merging dataframes and providing suffix to merged columns and then reporting their differences via this code.
# Assuming df1 and df2 are input data frames in your example.
df3 = pd.merge(df1, df2, on=['name', 'email'], suffixes=['past', 'present'])
dfans = pd.DataFrame() # this is the final output data frame
for column in df1.columns:
if not (column + 'present' in df3.columns or column + 'past' in df3.columns):
# Here we handle those columns which will not be merged like name and email.
dfans.loc[:, column] = df1.loc[:, column] # filling name and email as it is
else:
# string manipulation to name columns correctly in output
newColumn1 = '{}_CHANGE'.format(column)
newColumn2 = '{}_MOVEMENT'.format(column)
past, present = "{}past".format(column), "{}present".format(column)
# creating the output based on input
dfans.loc[:, newColumn1] = (df3[past] == df3[present]).map(lambda x: "YES" if x != 1 else "NO")
dfans.loc[:, newColumn2] = ["FROM {} TO {}".format(x, y) if x != y else "NO" for x, y in
zip(df3[past], df3[present])]

How to select records with not exists condition in pandas dataframe

I am have two dataframes as below. I want to rewrite the data selection SQL query into pandaswhich contains not exists condition
SQL
Select ORDER_NUM, DRIVER FROM DF
WHERE
1=1
AND NOT EXISTS
(
SELECT 1 FROM
order_addition oa
WHERE
oa.Flag_Value = 'Y'
AND df.ORDER_NUM = oa.ORDER_NUM)
Sample data
order_addition.head(10)
ORDER_NUM Flag_Value
22574536 Y
32459745 Y
15642314 Y
12478965 N
25845673 N
36789156 N
df.head(10)
ORDER_NUM REGION DRIVER
22574536 WEST Ravi
32459745 WEST David
15642314 SOUTH Rahul
12478965 NORTH David
25845673 SOUTH Mani
36789156 SOUTH Tim
How can this be done in pandas easily.
IIUC, you can merge on df1 with values equal to Y, and then find the nans:
result = df2.merge(df1[df1["Flag_Value"].eq("Y")],how="left",on="ORDER_NUM")
print (result[result["Flag_Value"].isnull()])
ORDER_NUM REGION DRIVER Flag_Value
3 12478965 NORTH David NaN
4 25845673 SOUTH Mani NaN
5 36789156 SOUTH Tim NaN
Or even simpler if your ORDER_NUM are unique:
print (df2.loc[~df2["ORDER_NUM"].isin(df1.loc[df1["Flag_Value"].eq("Y"),"ORDER_NUM"])])
ORDER_NUM REGION DRIVER
3 12478965 NORTH David
4 25845673 SOUTH Mani
5 36789156 SOUTH Tim

Handling duplicate data with pandas

Hello everyone, I'm having some issues with using pandas python library. Basically I'm reading csv
file with pandas and want to remove duplicates. I've tried everything and problem is still there.
import sqlite3
import pandas as pd
import numpy
connection = sqlite3.connect("test.db")
## pandas dataframe
dataframe = pd.read_csv('Countries.csv')
##dataframe.head(3)
countries = dataframe.loc[:, ['Retailer country', 'Continent']]
countries.head(6)
Output of this will be:
Retailer country Continent
-----------------------------
0 United States North America
1 Canada North America
2 Japan Asia
3 Italy Europe
4 Canada North America
5 United States North America
6 France Europe
I want to be able to drop duplicate values based on columns from
a dataframe above so I would have smth like this unique values from each country, and continent
so that desired output of this will be:
Retailer country Continent
-----------------------------
0 United States North America
1 Canada North America
2 Japan Asia
3 Italy Europe
4 France Europe
I have tried some methods mentioned there: Using pandas for duplicate values and looked around the net and realized I could use df.drop_duplicates() function, but when I use the code below and df.head(3) function it displays only one row. What can I do to get those unique rows and finally loop through them ?
countries.head(4)
country = countries['Retailer country']
continent = countries['Continent']
df = pd.DataFrame({'a':[country], 'b':[continent]})
df.head(3)
It seems like a simple group-by could solve your problem.
import pandas as pd
na = 'North America'
a = 'Asia'
e = 'Europe'
df = pd.DataFrame({'Retailer': [0, 1, 2, 3, 4, 5, 6],
'country': ['Unitied States', 'Canada', 'Japan', 'Italy', 'Canada', 'Unitied States', 'France'],
'continent': [na, na, a, e, na, na, e]})
df.groupby(['country', 'continent']).agg('count').reset_index()
The Retailer column is now showing a count of the number of times that country, continent combination occurs. You could remove this by `df = df[['country', 'continent']].

Pandas: Fill rows if 2 column strings are the same

I have a data set with ton of columns, I just want to back fill the rows that are missing with existing row values. I am trying to back fill with this logic like: if 'school' and 'country' are the same string then replace 'state' value into the empty 'state' column.
Here is an example. Problem with this is that its combining the other rows I am trying not split the rows. Is there a way? Thanks!
Sample Data:
import pandas as pd
school = ['Univ of CT','Univ of CT','Oxford','Oxford','ABC Univ']
name = ['John','Matt','John','Ashley','John']
country = ['US','US','UK','UK','']
state = ['CT','','','ENG','']
df = pd.DataFrame({'school':school,'country':country,'state':state,'name':name})
df['school'] = df['school'].str.upper()
Above data gives preview like:
school country state name
UNIV OF CT US CT John
UNIV OF CT US Matt
OXFORD UK John
OXFORD UK ENG Ashley
ABC UNIV John
I am looking for output like this:
school country state name
UNIV OF CT US CT John
UNIV OF CT US CT Matt
OXFORD UK ENG John
OXFORD UK ENG Ashley
ABC UNIV John
Code I tried:
df = df.fillna('')
df = df.reset_index().groupby(['school','country']).agg(';'.join)
df = pd.DataFrame(df).reset_index()
len(df)
You can write a small function to basically look up the state if it is blank based on the school and country.
def find_state(school, country, state):
if len(state) > 0:
return state
found_state = df['state'][(df['school'] == school) & (df['country'] == country)]
return max(found_state)
So the full example would be as follows:
import pandas as pd
school = ['Univ of CT','Univ of CT','Oxford','Oxford','ABC Univ']
name = ['John','Matt','John','Ashley','John']
country = ['US','US','UK','UK','']
state = ['CT','','','ENG','']
df = pd.DataFrame({'school':school,'country':country,'state':state,'name':name})
df['school'] = df['school'].str.upper()
def find_state(school, country, state):
if len(state) > 0:
return state
found_state = df['state'][(df['school'] == school) & (df['country'] == country)]
return max(found_state)
df['state_new'] = [find_state(school, country, state) for school, country, state in
df[['school','country','state']].values]
print(df)
school country state name state_new
0 UNIV OF CT US CT John CT
1 UNIV OF CT US Matt CT
2 OXFORD UK John ENG
3 OXFORD UK ENG Ashley ENG
4 ABC UNIV John
try this
first try to convert empty space to nan and then simply use ffill() and bfill()
df = pd.DataFrame({'school':school,'country':country,'state':state,'name':name})
df['school'] = df['school'].str.upper()
df['state'] = df['state'].astype(str).replace('',np.nan)
df['state'] = df.groupby(['school', 'country'])['state'].transform(lambda x: x.ffill()).transform(lambda y: y.bfill())
print(df)
school country state name
UNIV OF CT US CT John
UNIV OF CT US CT Matt
OXFORD UK ENG John
OXFORD UK ENG Ashley
ABC UNIV NaN John

Resources