Separate a name into first and last name using Pandas - python-3.x

I have a DataFrame that looks like this:
name birth
John Henry Smith 1980
Hannah Gonzalez 1900
Michael Thomas Ford 1950
Michelle Lee 1984
And I want to create two new columns, "middle" and "last" for the middle and last names of each person, respectively. People who have no middle name should have None in that data frame.
This would be my ideal result:
name middle last birth
John Henry Smith 1980
Hannah None Gonzalez 1900
Michael Thomas Ford 1950
Michelle None Lee 1984
I have tried different approaches, such as this:
df['middle'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 2 else None)
df['last'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 1 else x.split(" ")[2])
I even made some functions that try to do the same thing more carefully, but I always get the same error: "List Index out of range". This is weird because if I go about printing df.iloc[i,0].split(" ") for i in range(len(df)), I do get lists with length 2 or length 3 only.
I also printed x.count(" ") for all x in the "name" column and I always got either 1 or 2 as a result. There are no single names.
This is my first question so thank you so much!

Use Series.str.replace with expand = True.
df2 = (df['name'].str
.split(' ',expand = True)
.rename(columns = {0:'name',1:'middle',2:'last'}))
new_df = df2.assign(middle = df2['middle'].where(df2['last'].notnull()),
last = df2['last'].fillna(df2['middle']),
birth = df['birth'])
print(new_df)
name middle last birth
0 John Henry Smith 1980
1 Hannah NaN Gonzalez 1900
2 Michael Thomas Ford 1950
3 Michelle NaN Lee 1984

Related

Using Regex to change the name values format in a dataframe

I'm pretty sure I'm asking the wrong question here so here goes. I have a 2 dataframes, lets call them df1 and df2.
df1 looks like this:
data = {'Employee ID' : [12345, 23456, 34567],
'Values' : [123168546543154, 13513545435145434, 556423145613],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Wendy Davis', 'Albus Dumbledore', 'James Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])
df2 looks similar:
data = {'Employee ID' : [12345, 23456, 34567],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Davis, Wendy', 'Dumbledore, Albus', 'Halliday, James']}
df2 = pd.DataFrame(data, columns=['Employee ID','Employee Name','Department Supervisor'])
My issue is that df1 is from an excel file and that sometimes has an Employee ID entered and sometimes doesn't. This is where df2 comes in, df2 is a sql pull from the employee database that I'm using to validate the employee names and supervisor names to ensure the correct employee id is used.
Normally I'd be happy to merge the dataframes to get my desired result but with the supervisor names being in different formats I'd like to use regex on df1 to turn 'Wendy Davis" into 'Davis, Wendy' along with the other supervisor names to match what df2 has. So far I'm coming up empty on how I want to search this for an answer, suggestions?
IIUC, do you need?
df1['DS Corrected'] = df1['Department Supervisor'].str.replace('(\w+) (\w+)','\\2, \\1', regex=True)
Output:
Employee ID Values Employee Name Department Supervisor DS Corrected
0 12345 123168546543154 Jones, John Wendy Davis Davis, Wendy
1 23456 13513545435145434 Potter, Harry Albus Dumbledore Dumbledore, Albus
2 34567 556423145613 Watts, Wade James Halliday Halliday, James
Since Albus' full name is Albus Percival Wulfric Brian Dumbledore and James' is James Donovan Halliday (if we're talking about Ready Player One) then consider a dataframe of:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Wendy Davis
1 23456 13513545435145434 Potter, Harry Albus Percival Wulfric Brian Dumbledore
2 34567 556423145613 Watts, Wade James Donovan Halliday
So we need to swap the last name to the front with...
import pandas as pd
data = {'Employee ID' : [12345, 23456, 34567],
'Values' : [123168546543154, 13513545435145434, 556423145613],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Wendy Davis', 'Albus Percival Wulfric Brian Dumbledore', 'James Donovan Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])
def swap_names(text):
first, *middle, last = text.split()
if len(middle) == 0:
return last + ', ' + first
else:
return last + ', ' + first + ' ' + ' '.join(middle)
df1['Department Supervisor'] = [swap_names(row) for row in df1['Department Supervisor']]
print(df1)
Outputs:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Davis, Wendy
1 23456 13513545435145434 Potter, Harry Dumbledore, Albus Percival Wulfric Brian
2 34567 556423145613 Watts, Wade Halliday, James Donovan
Maybe...
df1['Department Supervisor'] = [', '.join(x.split()[::-1]) for x in df1['Department Supervisor']]
Outputs:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Davis, Wendy
1 23456 13513545435145434 Potter, Harry Dumbledore, Albus
2 34567 556423145613 Watts, Wade Halliday, James

Join on a second column if there is not a match on the first column of a pandas dataframe

I need to be able to match on a second column if there is not a match on the first column of a pandas dataframe (Python 3.x).
Ex.
table_df = pd.DataFrame ( {
'Name': ['James','Tim','John','Emily'],
'NickName': ['Jamie','','','Em'],
'Colour': ['Blue','Black','Red','Purple']
})
lookup_df = pd.DataFrame ( {
'Name': ['Tim','John','Em','Jamie'],
'Pet': ['Cat','Dog','Fox','Dog']
})
table_df
Name NickName Colour
0 James Jamie Blue
1 Tim Black
2 John Red
3 Emily Em Purple
lookup_df
Name Pet
0 Tim Cat
1 John Dog
2 Em Fox
3 Jamie Dog
The result I need:
Name NickName Colour Pet
0 James Jamie Blue Dog
1 Tim Black Cat
2 John Red Dog
3 Emily Em Purple Fox
which is matching on the Name column, and if there is no match, match on the Nickname column,
I tried many different things, including:
pd.merge(table_df,lookup_df, how='left', left_on='Name', right_on='Name')
if Nan -> pd.merge(table_df,lookup_df, how='left', left_on='NickName', right_on='Name')
but it does not do what I need and I want to avoid having a nested loop.
Has anyone an idea on how to do this? Any feedback is really appreciated.
Thanks!
You can map on Name and fillna on NickName:
s = lookup_df.set_index("Name")["Pet"]
table_df["pet"] = table_df["Name"].map(s).fillna(table_df["NickName"].map(s))
print (table_df)
Name NickName Colour pet
0 James Jamie Blue Dog
1 Tim Black Cat
2 John Red Dog
3 Emily Em Purple Fox

Show differences at row level between columns of 2 dataframes Pandas

I have 2 dataframes containing names and some demographic information, the dataframes are not identical due to monthly changes.
I'd like to create another df to show just the names of people where there are changes in either their COUNTRY or JOBCODE or MANAGERNAME columns, and also show what kind of changes these are.
Have tried the following code so far and am able to detect changes in the country column in the 2 dataframes for the common rows.
But am not so sure how to capture the movement in the MOVEMENT columns. Appreciate any form of help.
#Merge first
dfmerge = pd.merge(df1, df2, how ='inner', on ='EMAIL')
#create function to get COUNTRY_CHANGE column
def change_in(dfmerge):
if dfmerge['COUNTRY_x'] != dfmerge['COUNTRY_y']:
return 'YES'
else:
return 'NO'
dfmerge['COUNTRYCHANGE'] = dfmerge.apply(change_in, axis = 1)
Dataframe 1
NAME EMAIL COUNTRY JOBCODE MANAGERNAME
Jason Kelly jasonkelly#123.com USA 1221 Jon Gilman
Jon Gilman jongilman#123.com CANADA 1222 Cindy Lee
Jessica Lang jessicalang#123.com AUSTRALIA 1221 Esther Donato
Bob Wilder bobwilder#123.com ROMANIA 1355 Mike Lens
Samir Bala samirbala#123.com CANADA 1221 Ricky Easton
Dataframe 2
NAME EMAIL COUNTRY JOBCODE MANAGERNAME
Jason Kelly jasonkelly#123.com VIETNAM 1221 Jon Gilman
Jon Gilman jongilman#123.com CANADA 4464 Sheldon Tracey
Jessica Lang jessicalang#123.com AUSTRALIA 2224 Esther Donato
Bob Wilder bobwilder#123.com ROMANIA 1355 Emilia Tanner
Desired Output
EMAIL COUNTRY_CHANGE COUNTRY_MOVEMENT JOBCODE_CHANGE JOBCODE_MOVEMENT MGR_CHANGE MGR_MOVEMENT
jasonkelly#123.com YES FROM USA TO VIETNAM NO NO NO NO
jongilman#123.com NO NO YES FROM 1222 to 4464 YES FROM Cindy Lee to Sheldon Tracey
jessicalang#123.com NO NO YES FROM 1221 to 2224 NO NO
bobwilder#123.com NO NO NO NO YES FROM Mike Lens to Emilia Tanner
There is not direct feature in pandas that can help but we may leverage merge function as follows. We are merging dataframes and providing suffix to merged columns and then reporting their differences via this code.
# Assuming df1 and df2 are input data frames in your example.
df3 = pd.merge(df1, df2, on=['name', 'email'], suffixes=['past', 'present'])
dfans = pd.DataFrame() # this is the final output data frame
for column in df1.columns:
if not (column + 'present' in df3.columns or column + 'past' in df3.columns):
# Here we handle those columns which will not be merged like name and email.
dfans.loc[:, column] = df1.loc[:, column] # filling name and email as it is
else:
# string manipulation to name columns correctly in output
newColumn1 = '{}_CHANGE'.format(column)
newColumn2 = '{}_MOVEMENT'.format(column)
past, present = "{}past".format(column), "{}present".format(column)
# creating the output based on input
dfans.loc[:, newColumn1] = (df3[past] == df3[present]).map(lambda x: "YES" if x != 1 else "NO")
dfans.loc[:, newColumn2] = ["FROM {} TO {}".format(x, y) if x != y else "NO" for x, y in
zip(df3[past], df3[present])]

How to remove first chracter from the string and store the same into new column in Pandas?

I have a column name called Student name and each row has four or five student names -- like this John mills, Tim Harry, Alex win, Kate marry... I want to take the first two student names and store into a new column called Student 1 and Student 2. Names have been separated from comma.
I created a function and i can able to extract first student name . result storing into my dataframe called student_0
def find_student(df2):
for i in range(2):
df2[f"student name_{i}"] = [x.split(',')[i] for x in df2["student name"]]
return df2
new_df = find_student(df2)
df2 is my dataframe name
I AM NOT GETTING SECOND STUDENT NAME. PLEASE ADVISE
Use Series.str.split with select first 2 columns by positions by DataFrame.iloc if need name and surnames:
print (df2)
student name
0 John mills, Tim Harry, Alex win, Kate marry
1 Brando XI, James Caan, Richard S. Castellano
2 Heath Ledger, Aaron Eckhart, Michael Caine
N = 2
df3 = df2["student name"].str.split(', ', expand=True).iloc[:, :N]
#rename columns names
df3.columns = [f"student name_{i+1}" for i in range(len(df3.columns))]
print (df3)
student name_1 student name_2
0 John mills Tim Harry
1 Brando XI James Caan
2 Heath Ledger Aaron Eckhart
Or use list comprehension:
N = 2
L = [x.split(',')[:2] for x in df2["student name"]]
df3 = pd.DataFrame(L, columns=[f"student name_{i+1}" for i in range(N)])
print (df3)
student name_1 student name_2
0 John mills Tim Harry
1 Brando XI James Caan
2 Heath Ledger Aaron Eckhart
If need only names:
N = 2
L = [[y.split()[0] for y in x.split(',')[:2]] for x in df2["student name"]]
df3 = pd.DataFrame(L, columns=[f"student name_{i+1}" for i in range(N)])
print (df3)
student name_1 student name_2
0 John Tim
1 Brando James
2 Heath Aaron
#join to original if necessary
df2 = df2.join(df3)
try this
def find_student(df2):
for i in range(2):
df2[f"student name_{i}"] = pd.Series(map(lambda x: x.split(',')[i], df2["student name"]))
return df2
Use pandas functionality(str and split), you don't need to write a function.
df = [["John mills, Tim Harry, Alex win, Kate marry"],
["Brando XI, James Caan, Richard S. Castellano"],
["Heath Ledger,Aaron Eckhart, Michael Caine"]]
df2 = pd.DataFrame(df)
df2.columns = ['Student_Name']
df2['student name_1'] = df2.Student_Name.str.split(",").str[0]
df2['student name_2'] = df2.Student_Name.str.split(",").str[1]

Splitting the strings and store into new column in pandas dataframe

I have a data frame which as names column. Name column appear as below
Name
Tim, Brook, morgan han, wang, chen
adam, kate, ken fin, brad chun
Smith, leo
Elisa, Adam, Smith, Brad, james
I want to split the above names and store into new columns as below
Name Name 1 Name 2 Name 3 Name 4 Name 5
Tim Brook Morgan Han Wang Chen
Adam Kate Ken fin brad chun
Smith Leo
Elsa Adam Amith Brad James
I have more than 1000 rows in the column and having 20 columns. Want to split the names using function (def). Some rows has 5 names and other has 4, 3 or 2 names.
Used str.split and it's just giving the first name. But I am not sure how it split other names. Also used the function but it's not working
df[['Name']] = df.Name.str.split(', ', expand = True)
def clean(column_name):
name=set()
for name_string in df[column_name]:
name.update(name_string.split(', '))
name=sorted(name)
return name
df[['Name']] = df[['Name']].apply(clean)
If I use the above function I am getting this error
KeyError: 'Tim, Brook, morgan han, wang, chen'
Please advise. I have gone through all the posts here but not successful.
# to recreate example table
a = pd.DataFrame([str('Tim, Brook, morgan han, wang, chen')])
b = pd.DataFrame([str('Smith, leo')])
c = pd.DataFrame([str('Elisa, Adam, Smith, Brad, james')])
x = pd.concat([a,b,c], axis=0, ignore_index=True)
x.columns = ['Name']
Solution
x['Name1'] = x.Name.apply(lambda x: x.split(',')[0])
x['Name2'] = x.Name.apply(lambda x: x.split(',')[1])
x['Name3'] = x.Name.apply(lambda x: x.split(',')[2] if len(x.split(',')) > 2 else '')
x['Name4'] = x.Name.apply(lambda x: x.split(',')[3] if len(x.split(',')) > 3 else '')
x['Name5'] = x.Name.apply(lambda x: x.split(',')[4] if len(x.split(',')) > 4 else '')
Output
Name Name1 Name2 Name3 Name4 \
0 Tim, Brook, morgan han, wang, chen Tim Brook morgan han wang
1 Smith, leo Smith leo
2 Elisa, Adam, Smith, Brad, james Elisa Adam Smith Brad
Name5
0 chen
1
2 james
You can implement a for-loop out of the solution, depending on the max number of names in each row. Hope this takes you somewhere. Cheers.

Resources