Group two Dataframes with MultiIndex columns - python-3.x

I have two dataframes and I would like to create a new one, which will include all the unique columns of the two source dataframes and the aggregation of the common columns.
These are two samples:
And this is the result:
All the column indexes should match in order to be aggregated.
I have written the following code:
df_all = pd.DataFrame
for dfColumn in df_1:
if dfColumn in df_2.columns:
df_all[dfColumn] = df_1.loc[:, dfColumn].add(df_2.loc[:, dfColumn])
else:
df_all[dfColumn] = df_1[dfColumn]
for dfColumn in df_2:
if dfColumn not in df_all.columns:
df_all[dfColumn] = df_2[dfColumn]
However, I get an error on the following line:
df_all[dfColumn] = df_1.loc[:, dfColumn].add(df_2.loc[:, dfColumn])
when I am trying to assign the value to df_all[dfColumn]
It drives me crazy all the different possibilities that you have with Python.
But I cannot find one to make it work.
Thanks for your help and time.

Actually,
I fixed it only with the following:
df_all = pd.concat([df_1, df_2], axis=1)
df_all = df_all.groupby(level=[0, 1, 2], axis=1).sum()
Is there a way to replace level=[0, 1, 2] with something like level=df_all.columns.levels ?

Related

Data to explode between two columns

My current dataframe looks as below:
existing_data = {'STORE_ID': ['1234','5678','9876','3456','6789'],
'FULFILLMENT_TYPE': ['DELIVERY','DRIVE','DELIVERY','DRIVE','DELIVERY'],
'FORECAST_DATE':['2020-08-01','2020-08-02','2020-08-03','2020-08-04','2020-08-05'],
'DAY_OF_WEEK':['SATURDAY','SUNDAY','MONDAY','TUESDAY','WEDNESDAY'],
'START_HOUR':[8,8,6,7,9],
'END_HOUR':[19,19,18,19,17]}
existing = pd.DataFrame(data=existing_data)
I would need the data to be exploded between the start and end hour such that each hour is a different row like below:
needed_data = {'STORE_ID': ['1234','1234','1234','1234','1234'],
'FULFILLMENT_TYPE': ['DELIVERY','DELIVERY','DELIVERY','DELIVERY','DELIVERY'],
'FORECAST_DATE':['2020-08-01','2020-08-01','2020-08-01','2020-08-01','2020-08-01'],
'DAY_OF_WEEK': ['SATURDAY','SATURDAY','SATURDAY','SATURDAY','SATURDAY'],
'HOUR':[8,9,10,11,12]}
required = pd.DataFrame(data=needed_data)
Not sure how to achieve this ..I know it should be with explode() but unable to achieve it.
If small DataFrame or performance is not important use range per both columns with DataFrame.explode:
existing['HOUR'] = existing.apply(lambda x: range(x['START_HOUR'], x['END_HOUR']+1), axis=1)
existing = (existing.explode('HOUR')
.reset_index(drop=True)
.drop(['START_HOUR','END_HOUR'], axis=1))
If performance is important use Index.repeat by subtract both columns and then add counter by GroupBy.cumcount to START_HOUR:
s = existing["END_HOUR"].sub(existing["START_HOUR"]) + 1
df = existing.loc[existing.index.repeat(s)].copy()
add = df.groupby(level=0).cumcount()
df['HOUR'] = df["START_HOUR"].add(add)
df = df.reset_index(drop=True).drop(['START_HOUR','END_HOUR'], axis=1)

create MultiIndex columns based on "lookup"

I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!
Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755

create a new pandas column based on condition in one column and assigning the value from multiple columns in the same data frame

I am trying to create a new column based on condition in one column and assigning the value from multiple columns in the same data frame.
Below is the code that i tried.
data["Associate"]= data.apply(lambda x: np.where(x.BU=='SBS',x.SBS,x.MAS_NAS_HRO),axis=1)
BU SBS MAS_NAS_HRO Associate
SBS Ren Sunil Ren
MAS Uma Majid Majid
NAS Sumit Uma Uma
Above image is what i am trying to achieve i get this error:
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
I tried also
data['Associate']=''
data.loc[data['BU'] == 'SBS',Associate]=data['SBS']
I tried this and as well it did not work.
associate_details=['SBS']
associate1=data[data.BU.isin(associate_details)]
choice_assoc1=efile_data['SBS']
associate2=data[~data.BU.isin(associate_details)]
choice_assoc2=efile_data['MAS_NAS_HRO']
efile_data['Associate']=np.select([associate1,associate2],[choice_assoc1,choice_assoc2],default=np.nan)
i get this message [0 rows x 4 columns]
Empty DataFrame
How do i change these errors.
Regards,
Ren.
I'm not sure why your first piece of code isn't working because it worked for me, but you could try one of the other methods below.
I'm using numpy version 1.16.4 and pandas version 0.24.2 on Python 3.7.3.
data = {
'BU': ['SBS', 'MAS', 'NAS'],
'SBS': ['Ren', 'Uma', 'Sumit'],
'MAS_NAS_HRO': ['Sunil', 'Majid', 'Uma']
}
df = pd.DataFrame(data)
# The following all work, but I prefer no. 3. It's faster and more concise.
# 1. Using apply with lambda and if statement
df['Associates'] = df.apply(lambda row: row['SBS'] if row['BU'] == 'SBS' else row['MAS_NAS_HRO'], axis=1)
# 2. Using apply with lambda and np.where (op's method)
df["Associate"] = df.apply(lambda x: np.where(x.BU=='SBS', x.SBS, x.MAS_NAS_HRO), axis=1)
# 3. Using np.where only (s/o to #Erfan)
df['Associate'] = np.where(df['BU']=='SBS', df['SBS'], df['MAS_NAS_HRO'])

Compare column names of multiple pandas dataframes

In the below code, I created list of dataframes. Now I want to check if all the dataframes in dataframes list has same column names (I just want to compare headers, not the values) and if the condition is not met, it should error out.
dataframes = []
list_of_files = os.listdir(os.path.join(folder_location, quarter, "inputs"))
for files in list_of_files:
df = pd.read_excel(os.path.join(folder_location, quarter, "inputs", files), header=[0,1], sheetname= "Ratings Inputs", parse_cols ="B:AC", index_col=None).reset_index()
df.columns = pd.MultiIndex.from_tuples([tuple(df.columns.names)]
+ list(df.columns)[1:])
dataframes.append(df)
Not the most elegant solution, but it will get you there:
np.all([sorted(dataframes[0].columns) == sorted(i.columns) for i in dataframes])
sorted serves the purpose of both transforming into lists and making sure they dont fail because they are in different order

assigning a list of tokens as a row to dataframe

I am attempting to create a dataframe where the first column is a list of tokens and where additional columns of information can be added. However pandas will not allow a list of tokens to be added as one column.
So code looks as below
array1 = ['two', 'sample', 'statistical', 'inferences', 'includes']
array2 = ['references', 'please', 'see', 'next', 'page', 'the','material', 'of', 'these']
array3 = ['time', 'student', 'interest', 'and', 'lecturer', 'preference', 'other', 'topics']
## initialise list
list = []
list.append(array1)
list.append(array2)
list.append(array3)
## create dataFrame
numberOfRows = len(list)
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns = ('data', 'diversity'))
df.iloc[0] = list[0]
the error message reads
ValueError: cannot copy sequence with size 6 to array axis with dimension 2
Any insight into how I can better achieve creating a dataframe and updating columns would be appreciated.
Thanks
ok so the answer was fairly simple posting it for prosperity.
When adding lists as rows I needed to include the column name and position..
so the code looks like below.
df.data[0] = array1

Resources