I have two dataframes with different information about a person, on the first dataframe, person's name may repeat in different rows. I want to add/update the first dataframe with data from the second dataframe where the two columns containing person's data matches on both. Here an example on what I need to accomplish:
df1:
name surname
0 john doe
1 mary doe
2 peter someone
3 mary doe
4 john another
5 paul another
df2:
name surname account_id
0 peter someone 100
1 john doe 200
2 mary doe 300
3 john another 400
I need to accomplish this:
df1:
name surname account_id
0 john doe 200
1 mary doe 300
2 peter someone 100
3 mary doe 300
4 john another 400
5 paul another <empty>
Thanks!
I'm pretty sure I'm asking the wrong question here so here goes. I have a 2 dataframes, lets call them df1 and df2.
df1 looks like this:
data = {'Employee ID' : [12345, 23456, 34567],
'Values' : [123168546543154, 13513545435145434, 556423145613],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Wendy Davis', 'Albus Dumbledore', 'James Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])
df2 looks similar:
data = {'Employee ID' : [12345, 23456, 34567],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Davis, Wendy', 'Dumbledore, Albus', 'Halliday, James']}
df2 = pd.DataFrame(data, columns=['Employee ID','Employee Name','Department Supervisor'])
My issue is that df1 is from an excel file and that sometimes has an Employee ID entered and sometimes doesn't. This is where df2 comes in, df2 is a sql pull from the employee database that I'm using to validate the employee names and supervisor names to ensure the correct employee id is used.
Normally I'd be happy to merge the dataframes to get my desired result but with the supervisor names being in different formats I'd like to use regex on df1 to turn 'Wendy Davis" into 'Davis, Wendy' along with the other supervisor names to match what df2 has. So far I'm coming up empty on how I want to search this for an answer, suggestions?
IIUC, do you need?
df1['DS Corrected'] = df1['Department Supervisor'].str.replace('(\w+) (\w+)','\\2, \\1', regex=True)
Output:
Employee ID Values Employee Name Department Supervisor DS Corrected
0 12345 123168546543154 Jones, John Wendy Davis Davis, Wendy
1 23456 13513545435145434 Potter, Harry Albus Dumbledore Dumbledore, Albus
2 34567 556423145613 Watts, Wade James Halliday Halliday, James
Since Albus' full name is Albus Percival Wulfric Brian Dumbledore and James' is James Donovan Halliday (if we're talking about Ready Player One) then consider a dataframe of:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Wendy Davis
1 23456 13513545435145434 Potter, Harry Albus Percival Wulfric Brian Dumbledore
2 34567 556423145613 Watts, Wade James Donovan Halliday
So we need to swap the last name to the front with...
import pandas as pd
data = {'Employee ID' : [12345, 23456, 34567],
'Values' : [123168546543154, 13513545435145434, 556423145613],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Wendy Davis', 'Albus Percival Wulfric Brian Dumbledore', 'James Donovan Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])
def swap_names(text):
first, *middle, last = text.split()
if len(middle) == 0:
return last + ', ' + first
else:
return last + ', ' + first + ' ' + ' '.join(middle)
df1['Department Supervisor'] = [swap_names(row) for row in df1['Department Supervisor']]
print(df1)
Outputs:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Davis, Wendy
1 23456 13513545435145434 Potter, Harry Dumbledore, Albus Percival Wulfric Brian
2 34567 556423145613 Watts, Wade Halliday, James Donovan
Maybe...
df1['Department Supervisor'] = [', '.join(x.split()[::-1]) for x in df1['Department Supervisor']]
Outputs:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Davis, Wendy
1 23456 13513545435145434 Potter, Harry Dumbledore, Albus
2 34567 556423145613 Watts, Wade Halliday, James
I am trying to get text data from dataframe "A" to be convereted to columns while text data from dataframe "B" to be in rows in a new dataframe "C" in order to calculate distance calculations.
Data in dataframe "A" looks like this
Unique -> header
'Amy'
'little'
'sheep'
'dead'
Data in dataframe "B" looks like this
common_words -> header
'Amy'
'George'
'Barbara'
i want the output in dataframe C as
Amy George Barbara
Amy
little
sheep
dead
Can anyone help me on this
What should be the actual content of data frame C? Do you only want to initialise it to some value (i.e. 0) in the first step and then fill it with the distance calculations?
You could initialise C in the following way:
import pandas as pd
A = pd.DataFrame(['Amy', 'little', 'sheep', 'dead'])
B = pd.DataFrame(['Amy', 'George', 'Barbara'])
C = pd.DataFrame([[0] * len(B)] * len(A), index=A[0], columns=B[0])
C will then look like:
Amy George Barbara
0
Amy 0 0 0
little 0 0 0
sheep 0 0 0
dead 0 0 0
Please pd.DataFrame(index =[list],columns =[list])
Extract the relevant lists using list(df.columnname.values)
Dummy data
print(dfA)
Header
0 Amy
1 little
2 sheep
3 dead
print(dfB)
Header
0 Amy
1 George
2 Barbara
dfC=pd.DataFrame(index=list(dfA.Header.values), columns=list(dfB.Header.values))
Amy George Barbara
Amy NaN NaN NaN
little NaN NaN NaN
sheep NaN NaN NaN
dead NaN NaN NaN
If interested in dfC without NaNS. Please
dfC=pd.DataFrame(index=list(dfA.Header.values), columns=list(dfB.Header.values)).fillna(' ')
Amy George Barbara
Amy
little
sheep
dead
I have a DataFrame that looks like this:
name birth
John Henry Smith 1980
Hannah Gonzalez 1900
Michael Thomas Ford 1950
Michelle Lee 1984
And I want to create two new columns, "middle" and "last" for the middle and last names of each person, respectively. People who have no middle name should have None in that data frame.
This would be my ideal result:
name middle last birth
John Henry Smith 1980
Hannah None Gonzalez 1900
Michael Thomas Ford 1950
Michelle None Lee 1984
I have tried different approaches, such as this:
df['middle'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 2 else None)
df['last'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 1 else x.split(" ")[2])
I even made some functions that try to do the same thing more carefully, but I always get the same error: "List Index out of range". This is weird because if I go about printing df.iloc[i,0].split(" ") for i in range(len(df)), I do get lists with length 2 or length 3 only.
I also printed x.count(" ") for all x in the "name" column and I always got either 1 or 2 as a result. There are no single names.
This is my first question so thank you so much!
Use Series.str.replace with expand = True.
df2 = (df['name'].str
.split(' ',expand = True)
.rename(columns = {0:'name',1:'middle',2:'last'}))
new_df = df2.assign(middle = df2['middle'].where(df2['last'].notnull()),
last = df2['last'].fillna(df2['middle']),
birth = df['birth'])
print(new_df)
name middle last birth
0 John Henry Smith 1980
1 Hannah NaN Gonzalez 1900
2 Michael Thomas Ford 1950
3 Michelle NaN Lee 1984
Background
I have the following df
import pandas as pd
df= pd.DataFrame({'Text' : ['Hi', 'Hello', 'Bye'],
'P_ID': [1,2,3],
'Name' :['Bobby,Bob Lee Brian', 'Tuck,Tom T ', 'Mark, Marky '],
})
Name P_ID Text
0 Bobby,Bob Lee Brian 1 Hi
1 Tuck,Tom T 2 Hello
2 Mark, Marky 3 Bye
Goal
1) rearrange the Name column from e.g. Bobby,Bob Lee Brian to Bob Lee Brian Bobby
2) create new column Rearranged_Name
Desired Output
Name P_ID Text Rearranged_Name
0 Bobby,Bob Lee Brian 1 Hi Bob Lee Brian Bobby
1 Tuck,Tom T 2 Hello Tom T Tuck
2 Mark, Marky 3 Bye Marky Mark
Question
How do I achieve my desired output?
Use Series.str.replace with values before and after ,, \s* means there are optionally whitespace after ,:
df['Rearranged_Name'] = df['Name'].str.replace(r'(.+),\s*(.+)', r'\2 \1')
print (df)
Text P_ID Name Rearranged_Name
0 Hi 1 Bobby,Bob Lee Brian Bob Lee Brian Bobby
1 Hello 2 Tuck,Tom T Tom T Tuck
2 Bye 3 Mark, Marky Marky Mark
Or use Series.str.split for helper DataFrame and join columns together:
df1 = df['Name'].str.split(',\s*', expand=True)
df['Rearranged_Name'] = df1[1] + ' ' + df1[0]