Merge pandas dataframes based on several conditions for a large dataset - python-3.x

I want to merge two data frames based on certain conditions. First, I want to match the Full name only then for the mismatch entries, I would like to consider First and Last names as a matching condition. I have two Dataframes as follows:
df1
first_name last_name full_name
John Shoeb John Shoeb
John Shumon John Md Shumon
Abu Babu Abu A Babu
William Curl William Curl
df2
givenName surName displayName
John Shoeb John Shoeb
John Shumon John M Shumon
Abu Babu Abu Babu
Raju Kaju Raju Kaju
Bill Curl Bill Curl
I first merge them based on full name:
df3 = pd.merge(df1, df2, left_on=df1['full_name'].str.lower(), right_on=df2['displayName'].str.lower(), how='left')
And add a status and log columns:
df3.loc[ (df3.full_name.str.lower()==df3.displayName.str.lower()), 'status'] = True
df3.loc[ (df3.full_name.str.lower()==df3.displayName.str.lower()), 'log'] = 'Full Name Matching'
So the resultant dataframe df3 now looks like:
first_name last_name full_name givenName surName displayName status log
John Shoeb John Shoeb John Shoeb John Shoeb True Full Name Matching
John Shumon John Md Shumon NaN NaN NaN NaN NaN
Abu Babu Abu A Babu NaN NaN NaN NaN NaN
William Curl William Curl NaN NaN NaN False NaN
Expected Results
Now I want to apply matching condition based on df1 (First Name and Last Name) and df2 (givenName and surName). The final dataframe should look like as follows:
first_name last_name full_name givenName surName displayName status log
John Shoeb John Shoeb John Shoeb John Shoeb True Full Name Matching
John Shumon John Md Shumon John Shumon John Shumon True FN LN Matching
Abu Babu Abu A Babu Abu Babu Abu Babu True FN LN Matching
William Curl William Curl NaN NaN NaN False NaN
Question For the second part i.e. First Name and Last Name matching, I was able to get it done using the dataframe's itertuples(). However, when the same operations are applied to a huge dataset it keeps running forever. I'm looking for efficient ways so it can be applied to a big chunk of data.

You can use indicator=True in your merges. Then compare if first merge and second merge were "both" (for example with np.where):
df3 = (
pd.merge(
df1,
df2,
left_on=df1["full_name"].str.lower(),
right_on=df2["displayName"].str.lower(),
how="left",
indicator=True,
)
.drop(columns="key_0")
.rename(columns={"_merge": "first_merge"})
)
df3 = pd.merge(
df3,
df2,
left_on=df1["first_name"].str.lower() + " " + df1["last_name"].str.lower(),
right_on=df2["givenName"].str.lower() + " " + df2["surName"].str.lower(),
how="left",
indicator=True,
)
df3["log"] = np.where(
(df3["first_merge"] == "both"),
"Full Name Matching",
np.where(df3["_merge"] == "both", "FN LN Matching", None),
)
df3["status"] = df3["log"].notna()
df3 = df3[
[
"first_name",
"last_name",
"full_name",
"givenName_y",
"surName_y",
"displayName_y",
"status",
"log",
]
].rename(
columns={
"givenName_y": "givenName",
"surName_y": "surName",
"displayName_y": "displayName",
}
)
print(df3)
Prints:
first_name last_name full_name givenName surName displayName status log
0 John Shoeb John Shoeb John Shoeb John Shoeb True Full Name Matching
1 John Shumon John Md Shumon John Shumon John M Shumon True FN LN Matching
2 Abu Babu Abu A Babu Abu Babu Abu Babu True FN LN Matching
3 William Curl William Curl NaN NaN NaN False None

Related

Updating filtered data frame in pandas

Can't find why updating filtered data frames are not working. The code is also not returning any error message. I'd be grateful for hints, help.
So the problem comes when i want to update the dataframe but only to given selection.
Given .update function on data frame objects updates the data based on index from 1 data set based on another. But it does not do anything when applied to filtered dataframe.
Sample data:
df_1
index Name Surname
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 NaN NaN
df_2
index Name Surname
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 Marie Sklodowska-Curie
Code:
df_1.update(df_2, overwrite = False)
Returns:
df_1
index Name Surname
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 Marie Sklodowska-Curie
While below code:
df_1[(df_1["Name"].notna()) & (df_1["Surname"].notna())].update(df_2, overwrite = False) #not working
Does not apply any updates to given data.frame.
Return:
df_1
index Name Surname
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 NaN NaN
Looking for help on solving and why is this happening like so. Thanks!
EDIT: If need replace only missing values by another DataFrame use DataFrame.fillna or DataFrame.combine_first:
df = df_1.fillna(df_2)
#alternative
#df = df_1.combine_first(df_2)
print (df)
Name Surname
index
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 Marie Sklodowska-Curie
It not working, because update subset of DataFrame inplace, possible ugly solution is update filtered DataFrame df and add not matched original rows:
m = (df_1["Name"].notna()) & (df_1["Surname"].notna())
df = df_1[m].copy()
df.update(df_2)
df = pd.concat([df, df_1[~m]]).sort_index()
print (df)
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN
Possible solution without update:
m = (df_1["Name"].notna()) & (df_1["Surname"].notna())
df_1[m] = df_2
print (df_1)
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN
update apply modifications in place so if you select a subset of your dataframe, only the subset will be modified and not your original dataframe.
Use mask:
df1.update(df2.mask(df1.isna().any(1)))
print(df1)
# Output:
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN

Using Regex to change the name values format in a dataframe

I'm pretty sure I'm asking the wrong question here so here goes. I have a 2 dataframes, lets call them df1 and df2.
df1 looks like this:
data = {'Employee ID' : [12345, 23456, 34567],
'Values' : [123168546543154, 13513545435145434, 556423145613],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Wendy Davis', 'Albus Dumbledore', 'James Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])
df2 looks similar:
data = {'Employee ID' : [12345, 23456, 34567],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Davis, Wendy', 'Dumbledore, Albus', 'Halliday, James']}
df2 = pd.DataFrame(data, columns=['Employee ID','Employee Name','Department Supervisor'])
My issue is that df1 is from an excel file and that sometimes has an Employee ID entered and sometimes doesn't. This is where df2 comes in, df2 is a sql pull from the employee database that I'm using to validate the employee names and supervisor names to ensure the correct employee id is used.
Normally I'd be happy to merge the dataframes to get my desired result but with the supervisor names being in different formats I'd like to use regex on df1 to turn 'Wendy Davis" into 'Davis, Wendy' along with the other supervisor names to match what df2 has. So far I'm coming up empty on how I want to search this for an answer, suggestions?
IIUC, do you need?
df1['DS Corrected'] = df1['Department Supervisor'].str.replace('(\w+) (\w+)','\\2, \\1', regex=True)
Output:
Employee ID Values Employee Name Department Supervisor DS Corrected
0 12345 123168546543154 Jones, John Wendy Davis Davis, Wendy
1 23456 13513545435145434 Potter, Harry Albus Dumbledore Dumbledore, Albus
2 34567 556423145613 Watts, Wade James Halliday Halliday, James
Since Albus' full name is Albus Percival Wulfric Brian Dumbledore and James' is James Donovan Halliday (if we're talking about Ready Player One) then consider a dataframe of:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Wendy Davis
1 23456 13513545435145434 Potter, Harry Albus Percival Wulfric Brian Dumbledore
2 34567 556423145613 Watts, Wade James Donovan Halliday
So we need to swap the last name to the front with...
import pandas as pd
data = {'Employee ID' : [12345, 23456, 34567],
'Values' : [123168546543154, 13513545435145434, 556423145613],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Wendy Davis', 'Albus Percival Wulfric Brian Dumbledore', 'James Donovan Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])
def swap_names(text):
first, *middle, last = text.split()
if len(middle) == 0:
return last + ', ' + first
else:
return last + ', ' + first + ' ' + ' '.join(middle)
df1['Department Supervisor'] = [swap_names(row) for row in df1['Department Supervisor']]
print(df1)
Outputs:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Davis, Wendy
1 23456 13513545435145434 Potter, Harry Dumbledore, Albus Percival Wulfric Brian
2 34567 556423145613 Watts, Wade Halliday, James Donovan
Maybe...
df1['Department Supervisor'] = [', '.join(x.split()[::-1]) for x in df1['Department Supervisor']]
Outputs:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Davis, Wendy
1 23456 13513545435145434 Potter, Harry Dumbledore, Albus
2 34567 556423145613 Watts, Wade Halliday, James

Joining column of different rows in pandas

If i have a dataframe and i want to merge ID column based on the Name column without deleting any row.
How would i do this?
Ex-
Name
ID
John
ABC
John
XYZ
Lucy
MNO
I want to convert the above dataframe into the below one
Name
ID
John
ABC, XYZ
John
ABC, XYZ
Lucy
MNO
Use GroupBy.transform with join:
df['ID'] = df.groupby('Name')['ID'].transform(', '.join)
print (df)
Name ID
0 John ABC, XYZ
1 John ABC, XYZ
2 Lucy MNO

How do you fill uneven pandas dataframe column with first value in column

import pandas as pd
dict = {'Name' : ['John'], 'Last Name': ['Smith'], 'Activity':['Run', 'Jump', 'Hide', 'Swim', 'Eat', 'Sleep']}
df = pd.DataFrame(dict)
How do I make it so 'John' & 'Smith' are populated in each 'Activity' that he does in a dataframe?
Let us try json_normalize
out = pd.json_normalize(d,'Activity',['Name','Last Name'])
Out[160]:
0 Name Last Name
0 Run John Smith
1 Jump John Smith
2 Hide John Smith
3 Swim John Smith
4 Eat John Smith
5 Sleep John Smith
Input
d = {'Name' : ['John'], 'Last Name': ['Smith'], 'Activity':['Run', 'Jump', 'Hide', 'Swim', 'Eat', 'Sleep']}
If you strictly have one pair of Name/Last Name, you can modify the dictionary so that pandas reads activity as a list
d = {k: [v] if len(v) > 1 else v for k, v in d.items()}
df = pd.DataFrame(d)
df.explode('Activity')
Name Last Name Activity
0 John Smith Run
0 John Smith Jump
0 John Smith Hide
0 John Smith Swim
0 John Smith Eat
0 John Smith Sleep

Separate a name into first and last name using Pandas

I have a DataFrame that looks like this:
name birth
John Henry Smith 1980
Hannah Gonzalez 1900
Michael Thomas Ford 1950
Michelle Lee 1984
And I want to create two new columns, "middle" and "last" for the middle and last names of each person, respectively. People who have no middle name should have None in that data frame.
This would be my ideal result:
name middle last birth
John Henry Smith 1980
Hannah None Gonzalez 1900
Michael Thomas Ford 1950
Michelle None Lee 1984
I have tried different approaches, such as this:
df['middle'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 2 else None)
df['last'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 1 else x.split(" ")[2])
I even made some functions that try to do the same thing more carefully, but I always get the same error: "List Index out of range". This is weird because if I go about printing df.iloc[i,0].split(" ") for i in range(len(df)), I do get lists with length 2 or length 3 only.
I also printed x.count(" ") for all x in the "name" column and I always got either 1 or 2 as a result. There are no single names.
This is my first question so thank you so much!
Use Series.str.replace with expand = True.
df2 = (df['name'].str
.split(' ',expand = True)
.rename(columns = {0:'name',1:'middle',2:'last'}))
new_df = df2.assign(middle = df2['middle'].where(df2['last'].notnull()),
last = df2['last'].fillna(df2['middle']),
birth = df['birth'])
print(new_df)
name middle last birth
0 John Henry Smith 1980
1 Hannah NaN Gonzalez 1900
2 Michael Thomas Ford 1950
3 Michelle NaN Lee 1984

Resources