Python dataframe : converting columns into rows - python-3.x

I have a dataframe as follows
d = {'Movie' : ['The Shawshank Redemption', 'The Godfather'],
'FirstName1': ['Tim', 'Marlon'],
'FirstName2': ['Morgan', 'Al'],
'LastName1': ['Robbins', 'Brando'],
'LastName2': ['Freeman', 'Pacino'],
'ID1': ['TM', 'MB'],
'ID2': ['MF', 'AP']
}
df = pd.DataFrame(d)
df
I would like to re-arrange it into a 4 column dataframe,
by converting Firstname1, LastName1, FirstName2, LastName2, ID1, ID2 into 3 column rows as FirstName, LastName, ID then column movie repeats as follows.
In sql we do it as follows
select Movie as Movie, FirstName1 as FirstName, LastName1 as LastName, ID1 as ID from table
union
select Movie as Movie, FirstName2 as FirstName, LastName2 as LastName, ID2 as ID from table
Can we achieve it using pandas ?

If possible number in column names more like 9 use Series.str.extract for get integers with values before to MultiIndex to columns, so possible DataFrame.stack:
df = df.set_index('Movie')
df1 = df.columns.to_series().str.extract('([a-zA-Z]+)(\d+)')
df.columns = pd.MultiIndex.from_arrays([df1[0], df1[1].astype(int)])
df = df.rename_axis((None, None), axis=1).stack().reset_index(level=1, drop=True).reset_index()
print (df)
Movie FirstName ID LastName
0 The Shawshank Redemption Tim TM Robbins
1 The Shawshank Redemption Morgan MF Freeman
2 The Godfather Marlon MB Brando
3 The Godfather Al AP Pacino
If not use indexing for get last values of columns names with all previous and pass to MultiIndex.from_arrays:
df = df.set_index('Movie')
df.columns = pd.MultiIndex.from_arrays([df.columns.str[:-1], df.columns.str[-1].astype(int)])
df = df.stack().reset_index(level=1, drop=True).reset_index()
print (df)
Movie FirstName ID LastName
0 The Shawshank Redemption Tim TM Robbins
1 The Shawshank Redemption Morgan MF Freeman
2 The Godfather Marlon MB Brando
3 The Godfather Al AP Pacino

df = df.set_index('Movie')
df.columns = pd.MultiIndex.from_tuples([(col[:-1], col[-1:]) for col in df.columns])
df.stack()
# FirstName ID LastName
#Movie
#The Shawshank Redemption 1 Tim TM Robbins
# 2 Morgan MF Freeman
#The Godfather 1 Marlon MB Brando
# 2 Al AP Pacino
Use the power of MultiIndex! With from_tuples you create a DataFrame that has one column for FirstNames, divided in FirstName1 and FirstName2 (see below) and similar for ID and LastName. With stack you convert it into rows for each. Before you do this, make Movie the Index to exclude it from what you are doing. You could use reset_index() to regain everything as columns, but I'm not sure if you want that.
Before stack:
# FirstName LastName ID
# 1 2 1 2 1 2
#Movie
#The Shawshank Redemption Tim Morgan Robbins Freeman TM MF
#The Godfather Marlon Al Brando Pacino MB AP

I think an easy way to do this is to use the copy function from Pandas.
You can copy the columns "Movie", "FirstName", "LastName", "ID" to a new table. Then delete the columns you don't need in your first column. You can also create a new table for the other.
new = d['Movie', 'FirstName', 'LastName', 'ID].copy

Try below:
d1 = df.filter(regex="1$|Movie").rename(columns=lambda x: x[:-1])
d2 = df.filter(regex="2$|Movie").rename(columns=lambda x: x[:-1])
pd.concat([d1, d2]).rename({'Movi':'Movie'})

Related

Using Regex to change the name values format in a dataframe

I'm pretty sure I'm asking the wrong question here so here goes. I have a 2 dataframes, lets call them df1 and df2.
df1 looks like this:
data = {'Employee ID' : [12345, 23456, 34567],
'Values' : [123168546543154, 13513545435145434, 556423145613],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Wendy Davis', 'Albus Dumbledore', 'James Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])
df2 looks similar:
data = {'Employee ID' : [12345, 23456, 34567],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Davis, Wendy', 'Dumbledore, Albus', 'Halliday, James']}
df2 = pd.DataFrame(data, columns=['Employee ID','Employee Name','Department Supervisor'])
My issue is that df1 is from an excel file and that sometimes has an Employee ID entered and sometimes doesn't. This is where df2 comes in, df2 is a sql pull from the employee database that I'm using to validate the employee names and supervisor names to ensure the correct employee id is used.
Normally I'd be happy to merge the dataframes to get my desired result but with the supervisor names being in different formats I'd like to use regex on df1 to turn 'Wendy Davis" into 'Davis, Wendy' along with the other supervisor names to match what df2 has. So far I'm coming up empty on how I want to search this for an answer, suggestions?
IIUC, do you need?
df1['DS Corrected'] = df1['Department Supervisor'].str.replace('(\w+) (\w+)','\\2, \\1', regex=True)
Output:
Employee ID Values Employee Name Department Supervisor DS Corrected
0 12345 123168546543154 Jones, John Wendy Davis Davis, Wendy
1 23456 13513545435145434 Potter, Harry Albus Dumbledore Dumbledore, Albus
2 34567 556423145613 Watts, Wade James Halliday Halliday, James
Since Albus' full name is Albus Percival Wulfric Brian Dumbledore and James' is James Donovan Halliday (if we're talking about Ready Player One) then consider a dataframe of:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Wendy Davis
1 23456 13513545435145434 Potter, Harry Albus Percival Wulfric Brian Dumbledore
2 34567 556423145613 Watts, Wade James Donovan Halliday
So we need to swap the last name to the front with...
import pandas as pd
data = {'Employee ID' : [12345, 23456, 34567],
'Values' : [123168546543154, 13513545435145434, 556423145613],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Wendy Davis', 'Albus Percival Wulfric Brian Dumbledore', 'James Donovan Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])
def swap_names(text):
first, *middle, last = text.split()
if len(middle) == 0:
return last + ', ' + first
else:
return last + ', ' + first + ' ' + ' '.join(middle)
df1['Department Supervisor'] = [swap_names(row) for row in df1['Department Supervisor']]
print(df1)
Outputs:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Davis, Wendy
1 23456 13513545435145434 Potter, Harry Dumbledore, Albus Percival Wulfric Brian
2 34567 556423145613 Watts, Wade Halliday, James Donovan
Maybe...
df1['Department Supervisor'] = [', '.join(x.split()[::-1]) for x in df1['Department Supervisor']]
Outputs:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Davis, Wendy
1 23456 13513545435145434 Potter, Harry Dumbledore, Albus
2 34567 556423145613 Watts, Wade Halliday, James

How to split a Dataframe column whose data is not unique

I have a column called users in dataframe which doesn't have a unique format. I am doing a data cleanup project as the data looks unreadable.
company Users
A [{"Name":"Martin","Email":"name_1#email.com","EmpType":"Full"},{"Name":"Rick","Email":"name_2#email.com","Dept":"HR"}]
B [{"Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales" }]
I used the below query to this has broke down the data frame as below
df2 = df
df2 = df2.join(df['Users_config'].str.split('},{', expand=True).add_prefix('Users'))
company Users0 Users1
A "Name":"Martin","Email":"name_1#email.com","EmpType":"Full" "Name":"Rick","Email":"name_2#email.com","Dept":"HR"
B "Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales"
and further breaking the above df with "," using the same query I got the output as
Company Users01 Users02 Users03 Users10 Users11 Users12
1 "Name":"Martin" "Email":"name_1#email.com" "EmpType":"Full" "Name":"Rick" "Email":"name_2#email.com" "Dept":"HR"
2 "Name":"John" "Email":"name_2#email.com" "EmpType":"Full" "Dept":"Sales"
As this dataframe looks messy I want to get the output as below. I feel the best way to name the column is to use the column value "Name" from "Name":"Martin" itself and If we hardcore using df.rename the column name will get mismatch.
Company Name_1 Email_1 EmpType_1 Dept_1 Name_2 Email_2 Dept_2
1 Martin name_1#email.com Full Rick name_2#email.com "HR"
2 John name_2#email.com" Full Sales
Is there any way I can get the above output from the original dataframe.
Use:
df['Users'] = df['Users'].apply(ast.literal_eval)
d = df.explode('Users').reset_index(drop=True)
d = d.join(pd.DataFrame(d.pop('Users').tolist()))
d = d.set_index(['company', d.groupby('company').cumcount().add(1).astype(str)]).unstack()
d.columns = d.columns.map('_'.join)
Details:
First we use ast.literal_eval to evaluate the strings in Users column, then use DataFrame.explode on column Users to create a dataframe d.
print(d)
company Users
0 A {'Name': 'Martin', 'Email': 'name_1#email.com', 'EmpType': 'Full'}
1 A {'Name': 'Rick', 'Email': 'name_2#email.com', 'Dept': 'HR'}
2 B {'Name': 'John', 'Email': 'name_2#email.com', 'EmpType': 'Full', 'Dept': 'Sales'}
Create a new dataframe from the Users column in d and use DataFrame.join to join this new dataframe with d.
print(d)
company Name Email EmpType Dept
0 A Martin name_1#email.com Full NaN
1 A Rick name_2#email.com NaN HR
2 B John name_2#email.com Full Sales
Use DataFrame.groupby on column company then use groupby.cumcount to create a counter for each group, then use DataFrame.set_index to set the index of d as company + counter. Then use DataFrame.unstack to reshape the dataframe creating MultiIndex columns.
print(d)
Name Email EmpType Dept
1 2 1 2 1 2 1 2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN
Finally use map along with .join to flatten the MultiIndex columns.
print(d)
Name_1 Name_2 Email_1 Email_2 EmpType_1 EmpType_2 Dept_1 Dept_2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN

Show differences at row level between columns of 2 dataframes Pandas

I have 2 dataframes containing names and some demographic information, the dataframes are not identical due to monthly changes.
I'd like to create another df to show just the names of people where there are changes in either their COUNTRY or JOBCODE or MANAGERNAME columns, and also show what kind of changes these are.
Have tried the following code so far and am able to detect changes in the country column in the 2 dataframes for the common rows.
But am not so sure how to capture the movement in the MOVEMENT columns. Appreciate any form of help.
#Merge first
dfmerge = pd.merge(df1, df2, how ='inner', on ='EMAIL')
#create function to get COUNTRY_CHANGE column
def change_in(dfmerge):
if dfmerge['COUNTRY_x'] != dfmerge['COUNTRY_y']:
return 'YES'
else:
return 'NO'
dfmerge['COUNTRYCHANGE'] = dfmerge.apply(change_in, axis = 1)
Dataframe 1
NAME EMAIL COUNTRY JOBCODE MANAGERNAME
Jason Kelly jasonkelly#123.com USA 1221 Jon Gilman
Jon Gilman jongilman#123.com CANADA 1222 Cindy Lee
Jessica Lang jessicalang#123.com AUSTRALIA 1221 Esther Donato
Bob Wilder bobwilder#123.com ROMANIA 1355 Mike Lens
Samir Bala samirbala#123.com CANADA 1221 Ricky Easton
Dataframe 2
NAME EMAIL COUNTRY JOBCODE MANAGERNAME
Jason Kelly jasonkelly#123.com VIETNAM 1221 Jon Gilman
Jon Gilman jongilman#123.com CANADA 4464 Sheldon Tracey
Jessica Lang jessicalang#123.com AUSTRALIA 2224 Esther Donato
Bob Wilder bobwilder#123.com ROMANIA 1355 Emilia Tanner
Desired Output
EMAIL COUNTRY_CHANGE COUNTRY_MOVEMENT JOBCODE_CHANGE JOBCODE_MOVEMENT MGR_CHANGE MGR_MOVEMENT
jasonkelly#123.com YES FROM USA TO VIETNAM NO NO NO NO
jongilman#123.com NO NO YES FROM 1222 to 4464 YES FROM Cindy Lee to Sheldon Tracey
jessicalang#123.com NO NO YES FROM 1221 to 2224 NO NO
bobwilder#123.com NO NO NO NO YES FROM Mike Lens to Emilia Tanner
There is not direct feature in pandas that can help but we may leverage merge function as follows. We are merging dataframes and providing suffix to merged columns and then reporting their differences via this code.
# Assuming df1 and df2 are input data frames in your example.
df3 = pd.merge(df1, df2, on=['name', 'email'], suffixes=['past', 'present'])
dfans = pd.DataFrame() # this is the final output data frame
for column in df1.columns:
if not (column + 'present' in df3.columns or column + 'past' in df3.columns):
# Here we handle those columns which will not be merged like name and email.
dfans.loc[:, column] = df1.loc[:, column] # filling name and email as it is
else:
# string manipulation to name columns correctly in output
newColumn1 = '{}_CHANGE'.format(column)
newColumn2 = '{}_MOVEMENT'.format(column)
past, present = "{}past".format(column), "{}present".format(column)
# creating the output based on input
dfans.loc[:, newColumn1] = (df3[past] == df3[present]).map(lambda x: "YES" if x != 1 else "NO")
dfans.loc[:, newColumn2] = ["FROM {} TO {}".format(x, y) if x != y else "NO" for x, y in
zip(df3[past], df3[present])]

How to remove first chracter from the string and store the same into new column in Pandas?

I have a column name called Student name and each row has four or five student names -- like this John mills, Tim Harry, Alex win, Kate marry... I want to take the first two student names and store into a new column called Student 1 and Student 2. Names have been separated from comma.
I created a function and i can able to extract first student name . result storing into my dataframe called student_0
def find_student(df2):
for i in range(2):
df2[f"student name_{i}"] = [x.split(',')[i] for x in df2["student name"]]
return df2
new_df = find_student(df2)
df2 is my dataframe name
I AM NOT GETTING SECOND STUDENT NAME. PLEASE ADVISE
Use Series.str.split with select first 2 columns by positions by DataFrame.iloc if need name and surnames:
print (df2)
student name
0 John mills, Tim Harry, Alex win, Kate marry
1 Brando XI, James Caan, Richard S. Castellano
2 Heath Ledger, Aaron Eckhart, Michael Caine
N = 2
df3 = df2["student name"].str.split(', ', expand=True).iloc[:, :N]
#rename columns names
df3.columns = [f"student name_{i+1}" for i in range(len(df3.columns))]
print (df3)
student name_1 student name_2
0 John mills Tim Harry
1 Brando XI James Caan
2 Heath Ledger Aaron Eckhart
Or use list comprehension:
N = 2
L = [x.split(',')[:2] for x in df2["student name"]]
df3 = pd.DataFrame(L, columns=[f"student name_{i+1}" for i in range(N)])
print (df3)
student name_1 student name_2
0 John mills Tim Harry
1 Brando XI James Caan
2 Heath Ledger Aaron Eckhart
If need only names:
N = 2
L = [[y.split()[0] for y in x.split(',')[:2]] for x in df2["student name"]]
df3 = pd.DataFrame(L, columns=[f"student name_{i+1}" for i in range(N)])
print (df3)
student name_1 student name_2
0 John Tim
1 Brando James
2 Heath Aaron
#join to original if necessary
df2 = df2.join(df3)
try this
def find_student(df2):
for i in range(2):
df2[f"student name_{i}"] = pd.Series(map(lambda x: x.split(',')[i], df2["student name"]))
return df2
Use pandas functionality(str and split), you don't need to write a function.
df = [["John mills, Tim Harry, Alex win, Kate marry"],
["Brando XI, James Caan, Richard S. Castellano"],
["Heath Ledger,Aaron Eckhart, Michael Caine"]]
df2 = pd.DataFrame(df)
df2.columns = ['Student_Name']
df2['student name_1'] = df2.Student_Name.str.split(",").str[0]
df2['student name_2'] = df2.Student_Name.str.split(",").str[1]

Appending new elements to a column in pandas dataframe

I have a pandas dataframe like this:
df1:
id name gender
1 Alice Male
2 Jenny Female
3 Bob Male
And now I want to add a new column sport which will contain values in the form of list.Let's I want to add Football to the rows where gender is male So df1 will look like:
df1:
id name gender sport
1 Alice Male [Football]
2 Jenny Female NA
3 Bob Male [Football]
Now if I want to add Badminton to rows where gender is female and tennis to rows where gender is male so that final output is:
df1:
id name gender sport
1 Alice Male [Football,Tennis]
2 Jenny Female [Badminton]
3 Bob Male [Football,Tennis]
How to write a general function in python which will accomplish this task of appending new values into the column based upon some other column value?
The below should work for you. Initialize column with an empty list and proceed
df['sport'] = np.empty((len(df), 0)).tolist()
def append_sport(df, filter_df, sport):
df.loc[filter_df, 'sport'] = df.loc[filter_df, 'sport'].apply(lambda x: x.append(sport) or x)
return df
filter_df = (df.gender == 'Male')
df = append_sport(df, filter_df, 'Football')
df = append_sport(df, filter_df, 'Cricket')
Output
id name gender sport
0 1 Alice Male [Football, Cricket]
1 2 Jenny Female []
2 3 Bob Male [Football, Cricket]

Resources