Pandas sum of multi-indexed columns - python-3.x

If I have a data frame with nested headers like this:
John Joan
Smith, Jones, Smith,
Index1 234 432 324
Index2 2987 234 4354
...how do I create a new column that sums the values of each row?
I tried df['sum']=df['John']+df['Joan'] but that resulted in this error:
ValueError: Wrong number of items passed 3, placement implies 1

If I understand you correctly:
...how do I create a new column that sums the values of each row?
Solution
The sum of each row is just
df.sum(axis=1)
The trick is getting to be a new column. You need to ensure the column you add has 2 levels of column heading.
df.loc[:, ('sum', 'sum')] = df.sum(axis=1)
I'm not happy with it, but it works.
Joan John sum
Smith, Jones, Smith, sum
Index1 324 432 234 990
Index2 4354 234 2987 7575

Dance Party, haven't heard from you in a while.
You want to groupby, but specify a level and axis. axis=1 means you want to sum the rows instead of the columns. level=0 is the top row of the columns.
df = pd.DataFrame({
('John', 'Smith,'): [234, 2987],
('John', 'Jones,'): [432, 234],
('Joan', 'Smith,'): [324, 4354]}, index=['Index1', 'Index2'])
>>> df.groupby(level=0, axis=1).sum()
Joan John
Index1 324 666
Index2 4354 3221

Related

Filter rows of a dataframe based on certain conditions in pandas

While I was handling the dataframe in pandas, got some unexpected cells which consists values like-
E_no
E_name
6654-0984
Elvin-Johnson
430
Fred
663/547/900
Banty/Shon/Crio
87
Arif
546
Zerin
322,76
Chris,Deory
In some rows, more than one E_name and E_no has been assigned which is supposed to be a single employee in each and every cell
My data consists of E_no and E_name both these column needs to be separated in different rows.
What I want is
E_no
E_name
6654
ELvin
0984
Johnson
430
Fred
663
Banty
547
Shon
900
Crio
87
Arif
546
Zerin
322
Chris
76
Deory
Seperate those values and put in different rows.
Please help me in doing this so that I can proceed further, and it will be really helpful if someone can mention the logic , how to think for this prblm.
Thanks in advance.
Let me know if ur facing any kind of difficulty in understanding the prblm
Similar to Divyaansh's solution. Just use split, explode and merge.
import pandas as pd
df = pd.DataFrame({'E_no':['6654-0984','430','663/547/900','87','546', '322,76'],
'E_name':['Elvin-Johnson','Fred','Banty/Shon/Crio','Arif','Zerin','Chris,Deory']})
#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)
y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)
#Merge both the columns together
df2 = pd.merge(x,y,left_index=True,right_index=True)
#print the modified dataframe
print (df2)
Output of this will be:
Original Dataframe:
E_no E_name
0 6654-0984 Elvin-Johnson
1 430 Fred
2 663/547/900 Banty/Shon/Crio
3 87 Arif
4 546 Zerin
5 322,76 Chris,Deory
Modified Dataframe:
E_no E_name
0 6654 Elvin
1 0984 Johnson
2 430 Fred
3 663 Banty
4 547 Shon
5 900 Crio
6 87 Arif
7 546 Zerin
8 322 Chris
9 76 Deory
Alternate, you can also create a new dataframe with the values from x and y.
x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)
y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)
#Create a new dataframe with the new values from x and y
df3 = pd.DataFrame({'E_no':x,'E_name':y})
print (df3)
Same result as before.
Or this:
#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index()
y = df['E_name'].str.split('[,-/]').explode().reset_index()
#Create a new dataframe with the new values from x and y
df3 = pd.DataFrame({'E_no':x['E_no'],'E_name':y['E_name']})
print (df3)
Or you can do:
#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)
y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)
df4 = pd.DataFrame([x,y]).T
print (df4)
Split, flatten, recombine, rename:
a = [item for sublist in df.E_no.str.split('\W').tolist() for item in sublist]
b = [item for sublist in df.E_name.str.split('\W').tolist() for item in sublist]
df2 = pd.DataFrame(list(zip(a, b)), columns=df.columns)
Output:
E_no E_name
0 6654 Elvin
1 0984 Johnson
2 430 Fred
3 663 Banty
4 547 Shon
5 900 Crio
6 87 Arif
7 546 Zerin
8 322 Chris
9 76 Deory
You should provide multiple delimiter in the read_csv arguments.
pd.read_csv("filepath",sep='-|/|,| ')
This is the best I can help right now without the data tables.
I think this is actually rather tricky. Here's a solution in which we use the E_no column to build a column of regexes that we will then use to split the two original columns into parts. Finally we construct a new DataFrame from those parts. This method ensures that the second column's format matches the first's.
df = pd.DataFrame.from_records(
[
{"E_no": "6654-0984", "E_name": "Elvin-Johnson"},
{"E_no": "430", "E_name": "Fred"},
{"E_no": "663/547/900", "E_name": "Banty/Shon/Crio"},
{"E_no": "87", "E_name": "Arif"},
{"E_no": "546", "E_name": "Zerin"},
{"E_no": "322,76", "E_name": "Chris,Deory"},
{"E_no": "888+88", "E_name": "FIRST+SEC|OND"},
{"E_no": "999|99", "E_name": "TH,IRD|FOURTH"},
]
)
def get_pattern(e_no, delimiters=None):
if delimiters is None:
delimiters = "-/,|+"
delimiters = "|".join(re.escape(d) for d in delimiters)
non_match_delims = f"(?:(?!{delimiters}).)*"
delim_parts = re.findall(f"{non_match_delims}({delimiters})", e_no)
pattern_parts = []
for delim_part in delim_parts:
delim = re.escape(delim_part)
pattern_parts.append(f"((?:(?!{delim}).)*)")
pattern_parts.append(delim)
pattern_parts.append("(.*)")
return "".join(pattern_parts)
def extract_items(row, delimiters=None):
pattern = get_pattern(row["E_no"], delimiters)
nos = re.search(pattern, row["E_no"]).groups()
names = re.search(pattern, row["E_name"]).groups()
return (nos, names)
nos, names = map(
lambda L: [e for tup in L for e in tup],
zip(*df.apply(extract_items, axis=1))
)
print(pd.DataFrame({"E_no": nos, "E_names": names}))
E_no E_names
0 6654 Elvin
1 0984 Johnson
2 430 Fred
3 663 Banty
4 547 Shon
5 900 Crio
6 87 Arif
7 546 Zerin
8 322 Chris
9 76 Deory
10 888 FIRST
11 88 SEC|OND
12 999 TH,IRD
13 99 FOURTH
Here is my approach to this:
1)Replace -&/ with comma(,) for both columns
2)Split each field on comma(,) then expand it and then stack it.
3)Resetting the index of each data frame which gives below DF
4) Finally merging two DF's into one finalDF
a={'E_no':['6654-0984','430','663/547/900','87','546','322,76'],
'E_name':['Elvin-Johnson','Fred','Banty/Shon/Crio','Arif','Zerin','Chris,Deory']}
df = pd.DataFrame(a)
df1=df['E_no'].str.replace('-|/',',').str.split(',',expand=True).stack().reset_index()
df2=df['E_name'].str.replace('-|/',',').str.split(',',expand=True).stack().reset_index()
df1.drop(['level_0','level_1'],axis=1,inplace=True)
df1.rename(columns={0:'E_no'},inplace=True)
df2.drop(['level_0','level_1'],axis=1,inplace=True)
df2.rename(columns={0:'E_name'},inplace=True)
finalDF=pd.merge(df1,df2,left_index=True,right_index=True)
Output:

How to split a Dataframe column whose data is not unique

I have a column called users in dataframe which doesn't have a unique format. I am doing a data cleanup project as the data looks unreadable.
company Users
A [{"Name":"Martin","Email":"name_1#email.com","EmpType":"Full"},{"Name":"Rick","Email":"name_2#email.com","Dept":"HR"}]
B [{"Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales" }]
I used the below query to this has broke down the data frame as below
df2 = df
df2 = df2.join(df['Users_config'].str.split('},{', expand=True).add_prefix('Users'))
company Users0 Users1
A "Name":"Martin","Email":"name_1#email.com","EmpType":"Full" "Name":"Rick","Email":"name_2#email.com","Dept":"HR"
B "Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales"
and further breaking the above df with "," using the same query I got the output as
Company Users01 Users02 Users03 Users10 Users11 Users12
1 "Name":"Martin" "Email":"name_1#email.com" "EmpType":"Full" "Name":"Rick" "Email":"name_2#email.com" "Dept":"HR"
2 "Name":"John" "Email":"name_2#email.com" "EmpType":"Full" "Dept":"Sales"
As this dataframe looks messy I want to get the output as below. I feel the best way to name the column is to use the column value "Name" from "Name":"Martin" itself and If we hardcore using df.rename the column name will get mismatch.
Company Name_1 Email_1 EmpType_1 Dept_1 Name_2 Email_2 Dept_2
1 Martin name_1#email.com Full Rick name_2#email.com "HR"
2 John name_2#email.com" Full Sales
Is there any way I can get the above output from the original dataframe.
Use:
df['Users'] = df['Users'].apply(ast.literal_eval)
d = df.explode('Users').reset_index(drop=True)
d = d.join(pd.DataFrame(d.pop('Users').tolist()))
d = d.set_index(['company', d.groupby('company').cumcount().add(1).astype(str)]).unstack()
d.columns = d.columns.map('_'.join)
Details:
First we use ast.literal_eval to evaluate the strings in Users column, then use DataFrame.explode on column Users to create a dataframe d.
print(d)
company Users
0 A {'Name': 'Martin', 'Email': 'name_1#email.com', 'EmpType': 'Full'}
1 A {'Name': 'Rick', 'Email': 'name_2#email.com', 'Dept': 'HR'}
2 B {'Name': 'John', 'Email': 'name_2#email.com', 'EmpType': 'Full', 'Dept': 'Sales'}
Create a new dataframe from the Users column in d and use DataFrame.join to join this new dataframe with d.
print(d)
company Name Email EmpType Dept
0 A Martin name_1#email.com Full NaN
1 A Rick name_2#email.com NaN HR
2 B John name_2#email.com Full Sales
Use DataFrame.groupby on column company then use groupby.cumcount to create a counter for each group, then use DataFrame.set_index to set the index of d as company + counter. Then use DataFrame.unstack to reshape the dataframe creating MultiIndex columns.
print(d)
Name Email EmpType Dept
1 2 1 2 1 2 1 2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN
Finally use map along with .join to flatten the MultiIndex columns.
print(d)
Name_1 Name_2 Email_1 Email_2 EmpType_1 EmpType_2 Dept_1 Dept_2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN

Show differences at row level between columns of 2 dataframes Pandas

I have 2 dataframes containing names and some demographic information, the dataframes are not identical due to monthly changes.
I'd like to create another df to show just the names of people where there are changes in either their COUNTRY or JOBCODE or MANAGERNAME columns, and also show what kind of changes these are.
Have tried the following code so far and am able to detect changes in the country column in the 2 dataframes for the common rows.
But am not so sure how to capture the movement in the MOVEMENT columns. Appreciate any form of help.
#Merge first
dfmerge = pd.merge(df1, df2, how ='inner', on ='EMAIL')
#create function to get COUNTRY_CHANGE column
def change_in(dfmerge):
if dfmerge['COUNTRY_x'] != dfmerge['COUNTRY_y']:
return 'YES'
else:
return 'NO'
dfmerge['COUNTRYCHANGE'] = dfmerge.apply(change_in, axis = 1)
Dataframe 1
NAME EMAIL COUNTRY JOBCODE MANAGERNAME
Jason Kelly jasonkelly#123.com USA 1221 Jon Gilman
Jon Gilman jongilman#123.com CANADA 1222 Cindy Lee
Jessica Lang jessicalang#123.com AUSTRALIA 1221 Esther Donato
Bob Wilder bobwilder#123.com ROMANIA 1355 Mike Lens
Samir Bala samirbala#123.com CANADA 1221 Ricky Easton
Dataframe 2
NAME EMAIL COUNTRY JOBCODE MANAGERNAME
Jason Kelly jasonkelly#123.com VIETNAM 1221 Jon Gilman
Jon Gilman jongilman#123.com CANADA 4464 Sheldon Tracey
Jessica Lang jessicalang#123.com AUSTRALIA 2224 Esther Donato
Bob Wilder bobwilder#123.com ROMANIA 1355 Emilia Tanner
Desired Output
EMAIL COUNTRY_CHANGE COUNTRY_MOVEMENT JOBCODE_CHANGE JOBCODE_MOVEMENT MGR_CHANGE MGR_MOVEMENT
jasonkelly#123.com YES FROM USA TO VIETNAM NO NO NO NO
jongilman#123.com NO NO YES FROM 1222 to 4464 YES FROM Cindy Lee to Sheldon Tracey
jessicalang#123.com NO NO YES FROM 1221 to 2224 NO NO
bobwilder#123.com NO NO NO NO YES FROM Mike Lens to Emilia Tanner
There is not direct feature in pandas that can help but we may leverage merge function as follows. We are merging dataframes and providing suffix to merged columns and then reporting their differences via this code.
# Assuming df1 and df2 are input data frames in your example.
df3 = pd.merge(df1, df2, on=['name', 'email'], suffixes=['past', 'present'])
dfans = pd.DataFrame() # this is the final output data frame
for column in df1.columns:
if not (column + 'present' in df3.columns or column + 'past' in df3.columns):
# Here we handle those columns which will not be merged like name and email.
dfans.loc[:, column] = df1.loc[:, column] # filling name and email as it is
else:
# string manipulation to name columns correctly in output
newColumn1 = '{}_CHANGE'.format(column)
newColumn2 = '{}_MOVEMENT'.format(column)
past, present = "{}past".format(column), "{}present".format(column)
# creating the output based on input
dfans.loc[:, newColumn1] = (df3[past] == df3[present]).map(lambda x: "YES" if x != 1 else "NO")
dfans.loc[:, newColumn2] = ["FROM {} TO {}".format(x, y) if x != y else "NO" for x, y in
zip(df3[past], df3[present])]

Show One to Many in One Row

I'd like to reformat a cross reference table I am using before merging it to my data.
Certain parts have a one to many relationship and I want to reformat these cases into a single row so I capturing all the info when I later merge/vlookup this table to my data. Most of the data is a one to one relationship so the solution has to be selective.
Currently:
Marketing Number SKU
0 XXX 111
1 XXX 222
2 XXX 333
What I want:
Marketing Number SKU
0 XXX 111; 222; 333
Use groupby to get the SKU values into a list
Then join the list values.
Since the values in the list are int type, they must be converted to strings, to join them.
import pandas as pd
# data and dataframe
data = {'Marketing Number': ['XXX', 'XXX', 'XXX', 'y', 'z', 'a'],
'SKU': [111, 222, 333, 444, 555, 666]}
df = pd.DataFrame(data)
Marketing Number SKU
0 XXX 111
1 XXX 222
2 XXX 333
3 y 444
4 z 555
5 a 666
# groupby with agg list
dfg = df.groupby('Marketing Number', as_index=False).agg({'SKU': list})
# join into string
dfg.SKU = dfg.SKU.apply(lambda x: '; '.join([str(i) for i in x]))
Marketing Number SKU
0 XXX 111; 222; 333
1 a 666
2 y 444
3 z 555

How can I create an aggregate/summary pandas dataframe based on overlapping dates derived from a more specific dataframe?

The following dataframe shows trips made by employees of different companies:
source:
import pandas as pd
emp_trips = {'Name': ['Bob','Joe','Sue','Jack', 'Henry', 'Frank', 'Lee', 'Jack'],
'Company': ['ABC', 'ABC', 'ABC', 'HIJ', 'HIJ', 'DEF', 'DEF', 'DEF'],
'Depart' : ['01/01/2020', '01/01/2020', '01/06/2020', '01/01/2020', '05/01/2020', '01/13/2020', '01/12/2020', '01/14/2020'],
'Return' : ['01/31/2020', '02/15/2020', '02/20/2020', '03/01/2020', '05/05/2020', '01/15/2020', '01/30/2020', '02/02/2020'],
'Charges': [10.10, 20.25, 30.32, 40.00, 50.01, 60.32, 70.99, 80.87]
}
df = pd.DataFrame(emp_trips, columns = ['Name', 'Company', 'Depart', 'Return', 'Charges'])
# Convert to date format
df['Return']= pd.to_datetime(df['Return'])
df['Depart']= pd.to_datetime(df['Depart'])
output:
Name Company Depart Return Charges
0 Bob ABC 2020-01-01 2020-01-31 10.10
1 Joe ABC 2020-01-01 2020-02-15 20.25
2 Sue ABC 2020-01-06 2020-02-20 30.32
3 Jack HIJ 2020-01-01 2020-03-01 40.00
4 Henry HIJ 2020-05-01 2020-05-05 50.01
5 Frank DEF 2020-01-13 2020-01-15 60.32
6 Lee DEF 2020-01-12 2020-01-30 70.99
7 Jack DEF 2020-01-14 2020-02-02 80.87
How can I create another dataframe based on the following aspects:
The original dataframe is based on employee names/trips.
The generated dataframe will be based on companies grouped by overlapping dates.
The 'Name' column will not be included as it is no longer needed.
The 'Company' column will remain.
The 'Depart' date will be of the earliest date of any overlapping trip dates.
The 'Return' date will be of the latest date of any overlapping trip dates.
Any company trips that do not have overlapping dates will be its own entry/row.
The 'Charges' for each trip will be totaled for the new company entry.
Here is the desired output of the new dataframe:
Company Depart Return Charges
0 ABC 01/01/2020 02/20/2020 60.67
1 HIJ 01/01/2020 03/01/2020 40.00
2 HIJ 05/01/2020 05/05/2020 50.01
3 DEF 01/12/2020 02/02/2020 212.18
I've looked into the following as possible solutions:
Create a hierarchical index based on the company and date. As I worked through this, I realized that all this really does is create a hierarchical index but that's based on the specific columns. Also, this method won't aggregate the individual rows into summary rows.
df1 = df.set_index(['Company', not exactly sure how to say overlapping dates])
I also tried using timedelta but it resulted in True/False values in a separate column, and I'm not entirely sure how that would be used to combine into a single row based on overlapping date and company. Also, I don't think groupby('Company') works since there could be different trips and that non-overlapping that would require their own rows.
df['trips_overlap'] = (df.groupby('Company')
.apply(lambda x: (x['Return'].shift() - x['Depart']) > timedelta(0))
.reset_index(level=0, drop=True))

Resources