Updating filtered data frame in pandas - python-3.x

Can't find why updating filtered data frames are not working. The code is also not returning any error message. I'd be grateful for hints, help.
So the problem comes when i want to update the dataframe but only to given selection.
Given .update function on data frame objects updates the data based on index from 1 data set based on another. But it does not do anything when applied to filtered dataframe.
Sample data:
df_1
index Name Surname
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 NaN NaN
df_2
index Name Surname
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 Marie Sklodowska-Curie
Code:
df_1.update(df_2, overwrite = False)
Returns:
df_1
index Name Surname
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 Marie Sklodowska-Curie
While below code:
df_1[(df_1["Name"].notna()) & (df_1["Surname"].notna())].update(df_2, overwrite = False) #not working
Does not apply any updates to given data.frame.
Return:
df_1
index Name Surname
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 NaN NaN
Looking for help on solving and why is this happening like so. Thanks!

EDIT: If need replace only missing values by another DataFrame use DataFrame.fillna or DataFrame.combine_first:
df = df_1.fillna(df_2)
#alternative
#df = df_1.combine_first(df_2)
print (df)
Name Surname
index
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 Marie Sklodowska-Curie
It not working, because update subset of DataFrame inplace, possible ugly solution is update filtered DataFrame df and add not matched original rows:
m = (df_1["Name"].notna()) & (df_1["Surname"].notna())
df = df_1[m].copy()
df.update(df_2)
df = pd.concat([df, df_1[~m]]).sort_index()
print (df)
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN
Possible solution without update:
m = (df_1["Name"].notna()) & (df_1["Surname"].notna())
df_1[m] = df_2
print (df_1)
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN

update apply modifications in place so if you select a subset of your dataframe, only the subset will be modified and not your original dataframe.
Use mask:
df1.update(df2.mask(df1.isna().any(1)))
print(df1)
# Output:
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN

Related

Merge pandas dataframes based on several conditions for a large dataset

I want to merge two data frames based on certain conditions. First, I want to match the Full name only then for the mismatch entries, I would like to consider First and Last names as a matching condition. I have two Dataframes as follows:
df1
first_name last_name full_name
John Shoeb John Shoeb
John Shumon John Md Shumon
Abu Babu Abu A Babu
William Curl William Curl
df2
givenName surName displayName
John Shoeb John Shoeb
John Shumon John M Shumon
Abu Babu Abu Babu
Raju Kaju Raju Kaju
Bill Curl Bill Curl
I first merge them based on full name:
df3 = pd.merge(df1, df2, left_on=df1['full_name'].str.lower(), right_on=df2['displayName'].str.lower(), how='left')
And add a status and log columns:
df3.loc[ (df3.full_name.str.lower()==df3.displayName.str.lower()), 'status'] = True
df3.loc[ (df3.full_name.str.lower()==df3.displayName.str.lower()), 'log'] = 'Full Name Matching'
So the resultant dataframe df3 now looks like:
first_name last_name full_name givenName surName displayName status log
John Shoeb John Shoeb John Shoeb John Shoeb True Full Name Matching
John Shumon John Md Shumon NaN NaN NaN NaN NaN
Abu Babu Abu A Babu NaN NaN NaN NaN NaN
William Curl William Curl NaN NaN NaN False NaN
Expected Results
Now I want to apply matching condition based on df1 (First Name and Last Name) and df2 (givenName and surName). The final dataframe should look like as follows:
first_name last_name full_name givenName surName displayName status log
John Shoeb John Shoeb John Shoeb John Shoeb True Full Name Matching
John Shumon John Md Shumon John Shumon John Shumon True FN LN Matching
Abu Babu Abu A Babu Abu Babu Abu Babu True FN LN Matching
William Curl William Curl NaN NaN NaN False NaN
Question For the second part i.e. First Name and Last Name matching, I was able to get it done using the dataframe's itertuples(). However, when the same operations are applied to a huge dataset it keeps running forever. I'm looking for efficient ways so it can be applied to a big chunk of data.
You can use indicator=True in your merges. Then compare if first merge and second merge were "both" (for example with np.where):
df3 = (
pd.merge(
df1,
df2,
left_on=df1["full_name"].str.lower(),
right_on=df2["displayName"].str.lower(),
how="left",
indicator=True,
)
.drop(columns="key_0")
.rename(columns={"_merge": "first_merge"})
)
df3 = pd.merge(
df3,
df2,
left_on=df1["first_name"].str.lower() + " " + df1["last_name"].str.lower(),
right_on=df2["givenName"].str.lower() + " " + df2["surName"].str.lower(),
how="left",
indicator=True,
)
df3["log"] = np.where(
(df3["first_merge"] == "both"),
"Full Name Matching",
np.where(df3["_merge"] == "both", "FN LN Matching", None),
)
df3["status"] = df3["log"].notna()
df3 = df3[
[
"first_name",
"last_name",
"full_name",
"givenName_y",
"surName_y",
"displayName_y",
"status",
"log",
]
].rename(
columns={
"givenName_y": "givenName",
"surName_y": "surName",
"displayName_y": "displayName",
}
)
print(df3)
Prints:
first_name last_name full_name givenName surName displayName status log
0 John Shoeb John Shoeb John Shoeb John Shoeb True Full Name Matching
1 John Shumon John Md Shumon John Shumon John M Shumon True FN LN Matching
2 Abu Babu Abu A Babu Abu Babu Abu Babu True FN LN Matching
3 William Curl William Curl NaN NaN NaN False None

Text data massaging to conduct distance calculations in python

I am trying to get text data from dataframe "A" to be convereted to columns while text data from dataframe "B" to be in rows in a new dataframe "C" in order to calculate distance calculations.
Data in dataframe "A" looks like this
Unique -> header
'Amy'
'little'
'sheep'
'dead'
Data in dataframe "B" looks like this
common_words -> header
'Amy'
'George'
'Barbara'
i want the output in dataframe C as
Amy George Barbara
Amy
little
sheep
dead
Can anyone help me on this
What should be the actual content of data frame C? Do you only want to initialise it to some value (i.e. 0) in the first step and then fill it with the distance calculations?
You could initialise C in the following way:
import pandas as pd
A = pd.DataFrame(['Amy', 'little', 'sheep', 'dead'])
B = pd.DataFrame(['Amy', 'George', 'Barbara'])
C = pd.DataFrame([[0] * len(B)] * len(A), index=A[0], columns=B[0])
C will then look like:
Amy George Barbara
0
Amy 0 0 0
little 0 0 0
sheep 0 0 0
dead 0 0 0
Please pd.DataFrame(index =[list],columns =[list])
Extract the relevant lists using list(df.columnname.values)
Dummy data
print(dfA)
Header
0 Amy
1 little
2 sheep
3 dead
print(dfB)
Header
0 Amy
1 George
2 Barbara
dfC=pd.DataFrame(index=list(dfA.Header.values), columns=list(dfB.Header.values))
Amy George Barbara
Amy NaN NaN NaN
little NaN NaN NaN
sheep NaN NaN NaN
dead NaN NaN NaN
If interested in dfC without NaNS. Please
dfC=pd.DataFrame(index=list(dfA.Header.values), columns=list(dfB.Header.values)).fillna(' ')
Amy George Barbara
Amy
little
sheep
dead

How to split a Dataframe column whose data is not unique

I have a column called users in dataframe which doesn't have a unique format. I am doing a data cleanup project as the data looks unreadable.
company Users
A [{"Name":"Martin","Email":"name_1#email.com","EmpType":"Full"},{"Name":"Rick","Email":"name_2#email.com","Dept":"HR"}]
B [{"Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales" }]
I used the below query to this has broke down the data frame as below
df2 = df
df2 = df2.join(df['Users_config'].str.split('},{', expand=True).add_prefix('Users'))
company Users0 Users1
A "Name":"Martin","Email":"name_1#email.com","EmpType":"Full" "Name":"Rick","Email":"name_2#email.com","Dept":"HR"
B "Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales"
and further breaking the above df with "," using the same query I got the output as
Company Users01 Users02 Users03 Users10 Users11 Users12
1 "Name":"Martin" "Email":"name_1#email.com" "EmpType":"Full" "Name":"Rick" "Email":"name_2#email.com" "Dept":"HR"
2 "Name":"John" "Email":"name_2#email.com" "EmpType":"Full" "Dept":"Sales"
As this dataframe looks messy I want to get the output as below. I feel the best way to name the column is to use the column value "Name" from "Name":"Martin" itself and If we hardcore using df.rename the column name will get mismatch.
Company Name_1 Email_1 EmpType_1 Dept_1 Name_2 Email_2 Dept_2
1 Martin name_1#email.com Full Rick name_2#email.com "HR"
2 John name_2#email.com" Full Sales
Is there any way I can get the above output from the original dataframe.
Use:
df['Users'] = df['Users'].apply(ast.literal_eval)
d = df.explode('Users').reset_index(drop=True)
d = d.join(pd.DataFrame(d.pop('Users').tolist()))
d = d.set_index(['company', d.groupby('company').cumcount().add(1).astype(str)]).unstack()
d.columns = d.columns.map('_'.join)
Details:
First we use ast.literal_eval to evaluate the strings in Users column, then use DataFrame.explode on column Users to create a dataframe d.
print(d)
company Users
0 A {'Name': 'Martin', 'Email': 'name_1#email.com', 'EmpType': 'Full'}
1 A {'Name': 'Rick', 'Email': 'name_2#email.com', 'Dept': 'HR'}
2 B {'Name': 'John', 'Email': 'name_2#email.com', 'EmpType': 'Full', 'Dept': 'Sales'}
Create a new dataframe from the Users column in d and use DataFrame.join to join this new dataframe with d.
print(d)
company Name Email EmpType Dept
0 A Martin name_1#email.com Full NaN
1 A Rick name_2#email.com NaN HR
2 B John name_2#email.com Full Sales
Use DataFrame.groupby on column company then use groupby.cumcount to create a counter for each group, then use DataFrame.set_index to set the index of d as company + counter. Then use DataFrame.unstack to reshape the dataframe creating MultiIndex columns.
print(d)
Name Email EmpType Dept
1 2 1 2 1 2 1 2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN
Finally use map along with .join to flatten the MultiIndex columns.
print(d)
Name_1 Name_2 Email_1 Email_2 EmpType_1 EmpType_2 Dept_1 Dept_2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN

Show differences at row level between columns of 2 dataframes Pandas

I have 2 dataframes containing names and some demographic information, the dataframes are not identical due to monthly changes.
I'd like to create another df to show just the names of people where there are changes in either their COUNTRY or JOBCODE or MANAGERNAME columns, and also show what kind of changes these are.
Have tried the following code so far and am able to detect changes in the country column in the 2 dataframes for the common rows.
But am not so sure how to capture the movement in the MOVEMENT columns. Appreciate any form of help.
#Merge first
dfmerge = pd.merge(df1, df2, how ='inner', on ='EMAIL')
#create function to get COUNTRY_CHANGE column
def change_in(dfmerge):
if dfmerge['COUNTRY_x'] != dfmerge['COUNTRY_y']:
return 'YES'
else:
return 'NO'
dfmerge['COUNTRYCHANGE'] = dfmerge.apply(change_in, axis = 1)
Dataframe 1
NAME EMAIL COUNTRY JOBCODE MANAGERNAME
Jason Kelly jasonkelly#123.com USA 1221 Jon Gilman
Jon Gilman jongilman#123.com CANADA 1222 Cindy Lee
Jessica Lang jessicalang#123.com AUSTRALIA 1221 Esther Donato
Bob Wilder bobwilder#123.com ROMANIA 1355 Mike Lens
Samir Bala samirbala#123.com CANADA 1221 Ricky Easton
Dataframe 2
NAME EMAIL COUNTRY JOBCODE MANAGERNAME
Jason Kelly jasonkelly#123.com VIETNAM 1221 Jon Gilman
Jon Gilman jongilman#123.com CANADA 4464 Sheldon Tracey
Jessica Lang jessicalang#123.com AUSTRALIA 2224 Esther Donato
Bob Wilder bobwilder#123.com ROMANIA 1355 Emilia Tanner
Desired Output
EMAIL COUNTRY_CHANGE COUNTRY_MOVEMENT JOBCODE_CHANGE JOBCODE_MOVEMENT MGR_CHANGE MGR_MOVEMENT
jasonkelly#123.com YES FROM USA TO VIETNAM NO NO NO NO
jongilman#123.com NO NO YES FROM 1222 to 4464 YES FROM Cindy Lee to Sheldon Tracey
jessicalang#123.com NO NO YES FROM 1221 to 2224 NO NO
bobwilder#123.com NO NO NO NO YES FROM Mike Lens to Emilia Tanner
There is not direct feature in pandas that can help but we may leverage merge function as follows. We are merging dataframes and providing suffix to merged columns and then reporting their differences via this code.
# Assuming df1 and df2 are input data frames in your example.
df3 = pd.merge(df1, df2, on=['name', 'email'], suffixes=['past', 'present'])
dfans = pd.DataFrame() # this is the final output data frame
for column in df1.columns:
if not (column + 'present' in df3.columns or column + 'past' in df3.columns):
# Here we handle those columns which will not be merged like name and email.
dfans.loc[:, column] = df1.loc[:, column] # filling name and email as it is
else:
# string manipulation to name columns correctly in output
newColumn1 = '{}_CHANGE'.format(column)
newColumn2 = '{}_MOVEMENT'.format(column)
past, present = "{}past".format(column), "{}present".format(column)
# creating the output based on input
dfans.loc[:, newColumn1] = (df3[past] == df3[present]).map(lambda x: "YES" if x != 1 else "NO")
dfans.loc[:, newColumn2] = ["FROM {} TO {}".format(x, y) if x != y else "NO" for x, y in
zip(df3[past], df3[present])]

Separate a name into first and last name using Pandas

I have a DataFrame that looks like this:
name birth
John Henry Smith 1980
Hannah Gonzalez 1900
Michael Thomas Ford 1950
Michelle Lee 1984
And I want to create two new columns, "middle" and "last" for the middle and last names of each person, respectively. People who have no middle name should have None in that data frame.
This would be my ideal result:
name middle last birth
John Henry Smith 1980
Hannah None Gonzalez 1900
Michael Thomas Ford 1950
Michelle None Lee 1984
I have tried different approaches, such as this:
df['middle'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 2 else None)
df['last'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 1 else x.split(" ")[2])
I even made some functions that try to do the same thing more carefully, but I always get the same error: "List Index out of range". This is weird because if I go about printing df.iloc[i,0].split(" ") for i in range(len(df)), I do get lists with length 2 or length 3 only.
I also printed x.count(" ") for all x in the "name" column and I always got either 1 or 2 as a result. There are no single names.
This is my first question so thank you so much!
Use Series.str.replace with expand = True.
df2 = (df['name'].str
.split(' ',expand = True)
.rename(columns = {0:'name',1:'middle',2:'last'}))
new_df = df2.assign(middle = df2['middle'].where(df2['last'].notnull()),
last = df2['last'].fillna(df2['middle']),
birth = df['birth'])
print(new_df)
name middle last birth
0 John Henry Smith 1980
1 Hannah NaN Gonzalez 1900
2 Michael Thomas Ford 1950
3 Michelle NaN Lee 1984

Resources