Can't find why updating filtered data frames are not working. The code is also not returning any error message. I'd be grateful for hints, help.
So the problem comes when i want to update the dataframe but only to given selection.
Given .update function on data frame objects updates the data based on index from 1 data set based on another. But it does not do anything when applied to filtered dataframe.
Sample data:
df_1
index Name Surname
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 NaN NaN
df_2
index Name Surname
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 Marie Sklodowska-Curie
Code:
df_1.update(df_2, overwrite = False)
Returns:
df_1
index Name Surname
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 Marie Sklodowska-Curie
While below code:
df_1[(df_1["Name"].notna()) & (df_1["Surname"].notna())].update(df_2, overwrite = False) #not working
Does not apply any updates to given data.frame.
Return:
df_1
index Name Surname
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 NaN NaN
Looking for help on solving and why is this happening like so. Thanks!
EDIT: If need replace only missing values by another DataFrame use DataFrame.fillna or DataFrame.combine_first:
df = df_1.fillna(df_2)
#alternative
#df = df_1.combine_first(df_2)
print (df)
Name Surname
index
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 Marie Sklodowska-Curie
It not working, because update subset of DataFrame inplace, possible ugly solution is update filtered DataFrame df and add not matched original rows:
m = (df_1["Name"].notna()) & (df_1["Surname"].notna())
df = df_1[m].copy()
df.update(df_2)
df = pd.concat([df, df_1[~m]]).sort_index()
print (df)
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN
Possible solution without update:
m = (df_1["Name"].notna()) & (df_1["Surname"].notna())
df_1[m] = df_2
print (df_1)
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN
update apply modifications in place so if you select a subset of your dataframe, only the subset will be modified and not your original dataframe.
Use mask:
df1.update(df2.mask(df1.isna().any(1)))
print(df1)
# Output:
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN
I'm pretty sure I'm asking the wrong question here so here goes. I have a 2 dataframes, lets call them df1 and df2.
df1 looks like this:
data = {'Employee ID' : [12345, 23456, 34567],
'Values' : [123168546543154, 13513545435145434, 556423145613],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Wendy Davis', 'Albus Dumbledore', 'James Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])
df2 looks similar:
data = {'Employee ID' : [12345, 23456, 34567],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Davis, Wendy', 'Dumbledore, Albus', 'Halliday, James']}
df2 = pd.DataFrame(data, columns=['Employee ID','Employee Name','Department Supervisor'])
My issue is that df1 is from an excel file and that sometimes has an Employee ID entered and sometimes doesn't. This is where df2 comes in, df2 is a sql pull from the employee database that I'm using to validate the employee names and supervisor names to ensure the correct employee id is used.
Normally I'd be happy to merge the dataframes to get my desired result but with the supervisor names being in different formats I'd like to use regex on df1 to turn 'Wendy Davis" into 'Davis, Wendy' along with the other supervisor names to match what df2 has. So far I'm coming up empty on how I want to search this for an answer, suggestions?
IIUC, do you need?
df1['DS Corrected'] = df1['Department Supervisor'].str.replace('(\w+) (\w+)','\\2, \\1', regex=True)
Output:
Employee ID Values Employee Name Department Supervisor DS Corrected
0 12345 123168546543154 Jones, John Wendy Davis Davis, Wendy
1 23456 13513545435145434 Potter, Harry Albus Dumbledore Dumbledore, Albus
2 34567 556423145613 Watts, Wade James Halliday Halliday, James
Since Albus' full name is Albus Percival Wulfric Brian Dumbledore and James' is James Donovan Halliday (if we're talking about Ready Player One) then consider a dataframe of:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Wendy Davis
1 23456 13513545435145434 Potter, Harry Albus Percival Wulfric Brian Dumbledore
2 34567 556423145613 Watts, Wade James Donovan Halliday
So we need to swap the last name to the front with...
import pandas as pd
data = {'Employee ID' : [12345, 23456, 34567],
'Values' : [123168546543154, 13513545435145434, 556423145613],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Wendy Davis', 'Albus Percival Wulfric Brian Dumbledore', 'James Donovan Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])
def swap_names(text):
first, *middle, last = text.split()
if len(middle) == 0:
return last + ', ' + first
else:
return last + ', ' + first + ' ' + ' '.join(middle)
df1['Department Supervisor'] = [swap_names(row) for row in df1['Department Supervisor']]
print(df1)
Outputs:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Davis, Wendy
1 23456 13513545435145434 Potter, Harry Dumbledore, Albus Percival Wulfric Brian
2 34567 556423145613 Watts, Wade Halliday, James Donovan
Maybe...
df1['Department Supervisor'] = [', '.join(x.split()[::-1]) for x in df1['Department Supervisor']]
Outputs:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Davis, Wendy
1 23456 13513545435145434 Potter, Harry Dumbledore, Albus
2 34567 556423145613 Watts, Wade Halliday, James
If i have a dataframe and i want to merge ID column based on the Name column without deleting any row.
How would i do this?
Ex-
Name
ID
John
ABC
John
XYZ
Lucy
MNO
I want to convert the above dataframe into the below one
Name
ID
John
ABC, XYZ
John
ABC, XYZ
Lucy
MNO
Use GroupBy.transform with join:
df['ID'] = df.groupby('Name')['ID'].transform(', '.join)
print (df)
Name ID
0 John ABC, XYZ
1 John ABC, XYZ
2 Lucy MNO
import pandas as pd
dict = {'Name' : ['John'], 'Last Name': ['Smith'], 'Activity':['Run', 'Jump', 'Hide', 'Swim', 'Eat', 'Sleep']}
df = pd.DataFrame(dict)
How do I make it so 'John' & 'Smith' are populated in each 'Activity' that he does in a dataframe?
Let us try json_normalize
out = pd.json_normalize(d,'Activity',['Name','Last Name'])
Out[160]:
0 Name Last Name
0 Run John Smith
1 Jump John Smith
2 Hide John Smith
3 Swim John Smith
4 Eat John Smith
5 Sleep John Smith
Input
d = {'Name' : ['John'], 'Last Name': ['Smith'], 'Activity':['Run', 'Jump', 'Hide', 'Swim', 'Eat', 'Sleep']}
If you strictly have one pair of Name/Last Name, you can modify the dictionary so that pandas reads activity as a list
d = {k: [v] if len(v) > 1 else v for k, v in d.items()}
df = pd.DataFrame(d)
df.explode('Activity')
Name Last Name Activity
0 John Smith Run
0 John Smith Jump
0 John Smith Hide
0 John Smith Swim
0 John Smith Eat
0 John Smith Sleep
I have a DataFrame that looks like this:
name birth
John Henry Smith 1980
Hannah Gonzalez 1900
Michael Thomas Ford 1950
Michelle Lee 1984
And I want to create two new columns, "middle" and "last" for the middle and last names of each person, respectively. People who have no middle name should have None in that data frame.
This would be my ideal result:
name middle last birth
John Henry Smith 1980
Hannah None Gonzalez 1900
Michael Thomas Ford 1950
Michelle None Lee 1984
I have tried different approaches, such as this:
df['middle'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 2 else None)
df['last'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 1 else x.split(" ")[2])
I even made some functions that try to do the same thing more carefully, but I always get the same error: "List Index out of range". This is weird because if I go about printing df.iloc[i,0].split(" ") for i in range(len(df)), I do get lists with length 2 or length 3 only.
I also printed x.count(" ") for all x in the "name" column and I always got either 1 or 2 as a result. There are no single names.
This is my first question so thank you so much!
Use Series.str.replace with expand = True.
df2 = (df['name'].str
.split(' ',expand = True)
.rename(columns = {0:'name',1:'middle',2:'last'}))
new_df = df2.assign(middle = df2['middle'].where(df2['last'].notnull()),
last = df2['last'].fillna(df2['middle']),
birth = df['birth'])
print(new_df)
name middle last birth
0 John Henry Smith 1980
1 Hannah NaN Gonzalez 1900
2 Michael Thomas Ford 1950
3 Michelle NaN Lee 1984