Using Pandas groupby with total column and total row - python-3.x

The Dataframe used in my code list capital and yield amounts belonging to owners. The purpose is to group the values by owners and then to add a total column to the groupby dataframe and then add a global total row.
Here's the code:
import pandas as pd
OWNER = 'OWNER'
CAPITAL = 'CAPITAL'
YIELD = 'YIELD AMT'
TOTAL = 'TOTAL'
# defining the dataframe
df = pd.DataFrame({OWNER: 2 * ['Joe'] + 3 * ['Carla'] + ['Rob'],
CAPITAL: [10000, 5000, 20000, 3000, -4000, 2000],
YIELD: [1000, 500, 2000, 300, 400, 200]})
'''
OWNER CAPITAL YIELD AMT
0 Joe 10000 1000
1 Joe 5000 500
2 Carla 20000 2000
3 Carla 3000 300
4 Carla -4000 400
5 Rob 2000 200
'''
print(df)
print()
# grouping the rows by owner
dfg = df.groupby([OWNER]).sum().reset_index()
'''
OWNER CAPITAL YIELD AMT
0 Carla 19000 2700
1 Joe 15000 1500
2 Rob 2000 200
'''
print(dfg)
print()
# adding a TOTAL column
for index in range(0, len(dfg)):
dfg.loc[index, TOTAL] = dfg.loc[index, CAPITAL] + dfg.loc[index, YIELD]
'''
OWNER CAPITAL YIELD AMT TOTAL
0 Carla 19000 2700 21700.0
1 Joe 15000 1500 16500.0
2 Rob 2000 200 2200.0
'''
print(dfg)
print()
# resetting index to OWNER column
dfg = dfg.set_index(OWNER)
'''
CAPITAL YIELD AMT TOTAL
OWNER
Carla 19000 2700 21700.0
Joe 15000 1500 16500.0
Rob 2000 200 2200.0
'''
print(dfg)
print()
# finally, adding a TOTAL row
dfg.loc[TOTAL] = dfg.sum(numeric_only=True, axis=0)[[CAPITAL, YIELD, TOTAL]]
'''
CAPITAL YIELD AMT TOTAL
OWNER
Carla 19000.0 2700.0 21700.0
Joe 15000.0 1500.0 16500.0
Rob 2000.0 200.0 2200.0
TOTAL 36000.0 4400.0 40400.0
'''
print(dfg.fillna(''))
My question is: is there a more concise way of coding the total column or row computation using Pandas agg() or aggregate() and a lambda expression ?

df[TOTAL] = df[CAPITAL] + df[YIELD]
output = df.groupby(by=[OWNER]).sum()
is what you look for. output is the dataframe you need.

Related

Join two dataframes based on closest combination that sums up to a target value

Im trying to join below two dataframes based on closest combination of rows from df2 column Sales that sums up to target value in df1 column Total Sales, columns Name & Date in both dataframes should be the same when joining (as showed in expected output).
For Example : in df1 row number 0 should be matched only with df2 rows 0 & 1, since columns Name & Date is the same, Which is Name : John and Date : 2021-10-01.
df1 :
df1 = pd.DataFrame({"Name":{"0":"John","1":"John","2":"Jack","3":"Nancy","4":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-11-01","2":"2021-10-10","3":"2021-10-12","4":"2021-10-30"},
"Total Sales":{"0":15500,"1":5500,"2":17600,"3":20700,"4":12000}})
Name Date Total Sales
0 John 2021-10-01 15500
1 John 2021-11-01 5500
2 Jack 2021-10-10 17600
3 Nancy 2021-10-12 20700
4 Ahmed 2021-10-30 12000
df2 :
df2 = pd.DataFrame({"ID":{"0":"JO1","1":"JO2","2":"JO3","3":"JO4","4":"JA1","5":"JA2","6":"NA1",
"7":"NA2","8":"NA3","9":"NA4","10":"AH1","11":"AH2","12":"AH3","13":"AH3"},
"Name":{"0":"John","1":"John","2":"John","3":"John","4":"Jack","5":"Jack","6":"Nancy","7":"Nancy",
"8":"Nancy","9":"Nancy","10":"Ahmed","11":"Ahmed","12":"Ahmed","13":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-10-01","2":"2021-11-01","3":"2021-11-01","4":"2021-10-10","5":"2021-10-10","6":"2021-10-12","7":"2021-10-12",
"8":"2021-10-12","9":"2021-10-12","10":"2021-10-30","11":"2021-10-30","12":"2021-10-30","13":"2021-10-29"},
"Sales":{"0":10000,"1":5000,"2":1000,"3":5500,"4":10000,"5":7000,"6":20000,
"7":100,"8":500,"9":100,"10":5000,"11":7000,"12":10000,"13":12000}})
ID Name Date Sales
0 JO1 John 2021-10-01 10000
1 JO2 John 2021-10-01 5000
2 JO3 John 2021-11-01 1000
3 JO4 John 2021-11-01 5500
4 JA1 Jack 2021-10-10 10000
5 JA2 Jack 2021-10-10 7000
6 NA1 Nancy 2021-10-12 20000
7 NA2 Nancy 2021-10-12 100
8 NA3 Nancy 2021-10-12 500
9 NA4 Nancy 2021-10-12 100
10 AH1 Ahmed 2021-10-30 5000
11 AH2 Ahmed 2021-10-30 7000
12 AH3 Ahmed 2021-10-30 10000
13 AH3 Ahmed 2021-10-29 12000
Expected Output :
Name Date Total Sales Comb IDs Comb Total
0 John 2021-10-01 15500 JO1, JO2 15000.0
1 John 2021-11-01 5500 JO4 5500.0
2 Jack 2021-10-10 17600 JA1, JA2 17000.0
3 Nancy 2021-10-12 20700 NA1, NA2, NA3, NA4 20700.0
4 Ahmed 2021-10-30 12000 AH1, AH2 12000.0
What i have tried below is working for only one row at a time, but im not sure how to apply it in pandas dataframes to get the expected output.
Variable numbers in below script represent Sales column in df2, and variable target below represent Total Sales column in df1.
import itertools
import math
numbers = [1000, 5000, 3000]
target = 6000
best_combination = ((None,))
best_result = math.inf
best_sum = 0
for L in range(0, len(numbers)+1):
for combination in itertools.combinations(numbers, L):
sum = 0
for number in combination:
sum += number
result = target - sum
if abs(result) < abs(best_result):
best_result = result
best_combination = combination
best_sum = sum
print("\nbest sum{} = {}".format(best_combination, best_sum))
[Out] best sum(1000, 5000) = 6000
Take the code you wrote which finds the best sum and turn it into a function (let's call it opt, which has parameters for target and a dataframe (which will be a subset of df2. It needs to return a list of IDs which correspond to the optimal combination.
Write another function which takes 3 arguments name, date and target (let's call it calc). This function will filter df2 based on name and date, and pass it, along with the target to the opt function and return the result of that function. Finally, iterate through rows of df1, and call calc with the row arguments (or alternatively use pandas.DataFrame.apply

Adding a grouped column header to an existing dataframe

How can we add to an existing Pandas dataframe a column header on a supplementary row above two sub column headers ? Here's the searched result:
Here's the current code which adds the CAPITAL header, but does not position it correctly.
import pandas as pd
OWNER = 'OWNER'
CAPITAL = 'CAPITAL'
USD = 'USD'
CHF = 'CHF'
YIELD = 'YIELD AMT'
df = pd.DataFrame({
OWNER: 2*['JOE']+3*['ROB'],
USD: [10000, 30000, 4000, 24000, 16000],
CHF: [9000, 27000, 3600, 21600, 14400],
YIELD: [100, 300, 40, 240, 160]
})
print(df)
'''
OWNER USD CHF YIELD AMT
0 JOE 10000 9000 100
1 JOE 30000 27000 300
2 ROB 4000 3600 40
3 ROB 24000 21600 240
4 ROB 16000 14400 160
'''
df.columns = pd.MultiIndex.from_product([[CAPITAL], df.columns])
print('\nUsing pd.from_product()')
print(df)
'''
CAPITAL
OWNER USD CHF YIELD AMT
0 JOE 10000 9000 100
1 JOE 30000 27000 300
2 ROB 4000 3600 40
3 ROB 24000 21600 240
4 ROB 16000 14400 160
'''
The solution is to use pd.MultiIndex.from_arrays() instead of pd.MultiIndex.from_product(). Here's the code:
import pandas as pd
OWNER = 'OWNER'
CAPITAL = 'CAPITAL'
USD = 'USD'
CHF = 'CHF'
YIELD = 'YIELD AMT'
df_ok = pd.DataFrame({
OWNER: 2*['JOE']+3*['ROB'],
USD: [10000, 30000, 4000, 24000, 16000],
CHF: [9000, 27000, 3600, 21600, 14400],
YIELD: [100, 300, 40, 240, 160]
})
df_ok.columns = pd.MultiIndex.from_arrays([[' ', ' ', CAPITAL, ' '], df_ok.columns])
print('\nUsing pd.from_arrays()')
print()
print(df_ok)
'''
CAPITAL
OWNER USD CHF YIELD AMT
0 JOE 10000 9000 100
1 JOE 30000 27000 300
2 ROB 4000 3600 40
3 ROB 24000 21600 240
4 ROB 16000 14400 160
'''

Adding total rows to a Pandas DataFrame

I define a Pandas DataFrame containing several deposit/withdrawal rows for different owners. I want to add a total row for each owner to totalize the deposits/withdrawals aswell as the yield amounts generated by each capital amount.
Here's the result of the code below:
Here's my code:
import pandas as pd
OWNER = 'OWNER'
DEPWITHDR = 'DEP/WITHDR'
DATEFROM = 'DATE FROM'
DATETO = 'DATE TO'
CAPITAL = 'CAPITAL'
YIELD = 'YIELD AMT'
TOTAL = 'TOTAL'
df = pd.DataFrame({
OWNER: 2*['JOE']+3*['ROB'],
DEPWITHDR: [10000, 20000, 4000, 20000, -8000],
CAPITAL: [10000, 30000, 4000, 24000, 16000],
DATEFROM: ['2021-01-01', '2021-01-02', '2021-01-01', '2021-01-03', '2021-01-04'],
DATETO: ['2021-01-01', '2021-01-05', '2021-01-02', '2021-01-03', '2021-01-05'],
YIELD: [100, 1200, 80, 240, 320]
})
print('SOURCE DATAFRAME\n')
print(df)
print()
newDf = pd.DataFrame(columns=[OWNER, DEPWITHDR, CAPITAL, DATEFROM, DATETO, YIELD])
currentOwner = df.loc[1, OWNER]
# using groupby function to compute the two columns totals
dfTotal = df.groupby([OWNER]).agg({DEPWITHDR:'sum', YIELD:'sum'}).reset_index()
totalIndex = 0
# deactivating SettingWithCopyWarning caueed by totalRow[OWNER] += ' total'
pd.set_option('mode.chained_assignment', None)
for index, row in df.iterrows():
if currentOwner == row[OWNER]:
newDf = newDf.append({OWNER: row[OWNER],
DEPWITHDR: row[DEPWITHDR],
CAPITAL: row[CAPITAL],
DATEFROM: row[DATEFROM],
DATETO: row[DATETO],
YIELD: row[YIELD]}, ignore_index=True)
else:
totalRow = dfTotal.loc[totalIndex]
totalRow[OWNER] += ' total'
newDf = newDf.append(totalRow, ignore_index=True)
totalIndex += 1
newDf = newDf.append({OWNER: '',
DEPWITHDR: '',
CAPITAL: '',
DATEFROM: '',
DATETO: '',
YIELD: ''}, ignore_index=True)
newDf = newDf.append({OWNER: row[OWNER],
DEPWITHDR: row[DEPWITHDR],
CAPITAL: row[CAPITAL],
DATEFROM: row[DATEFROM],
DATETO: row[DATETO],
YIELD: row[YIELD]}, ignore_index=True)
currentOwner = row[OWNER]
totalRow = dfTotal.loc[totalIndex]
totalRow[OWNER] += ' total'
newDf = newDf.append(totalRow, ignore_index=True)
print('TARGET DATAFRAME\n')
print(newDf.fillna(''))
My question is: what is a better, more Pandas friendly, way, to obtain the desired result ?
You can use groupby and concat:
df_total = pd.concat((
df,
df.replace({o: o + ' total' for o in df[OWNER].unique()}).groupby(OWNER).agg({DEPWITHDR: sum, YIELD: sum}).reset_index())
).fillna('').reset_index().sort_values([OWNER, DATEFROM, DATETO])
In detail:
df.replace({o: o + ' total' for o in df[OWNER].unique()}): replace each occurrence of the name of every owner with the name itself plus the string ' total' (e.g., 'JOE' -> 'JOE total'); so that the result of the groupby will have those values in the column OWNER.
groupby(OWNER).agg({DEPWITHDR: sum, YIELD: sum}): get the sum of the column DEPWITHDR and YIELD per each owner.
pd.concat(...).fillna('').reset_index().sort_values([OWNER, DATEFROM, DATETO]): concatenate the original DataFrame and that with the totals and then sort rows by column OWNER, than DATEFROM, than DATETO, so that the rows with the totals for each OWNER will be placed at the ends of the rows belonging to that owner (because they ends with ' total') and moreover the rows will be chronologically sorted by DATEFROM, DATETO.
Here df_total:
index OWNER DEP/WITHDR CAPITAL DATE FROM DATE TO YIELD AMT
0 0 JOE 10000 10000 2021-01-01 2021-01-01 100
1 1 JOE 20000 30000 2021-01-02 2021-01-05 1200
5 0 JOE total 30000 1300
2 2 ROB 4000 4000 2021-01-01 2021-01-02 80
3 3 ROB 20000 24000 2021-01-03 2021-01-03 240
4 4 ROB -8000 16000 2021-01-04 2021-01-05 320
6 1 ROB total 16000 640
IMHO, I'd create a different DataFrame from each owner, with only his/her data, and then a summary DataFrame with totals for each owner. But, maybe, in your use case, this is the best solution.

Collapse/Transpose Columns of a DataFrame Based on Repeating - pandas

I have a data frame sample_df like this,
id pd pd_dt pd_tp pd.1 pd_dt.1 pd_tp.1 pd.2 pd_dt.2 pd_tp.2
0 1 100 per year 468 200 per year 400 300 per year 320
1 2 100 per year 60 200 per year 890 300 per year 855
I need my output like this,
id pd pd_dt pd_tp
1 100 per year 468
1 200 per year 400
1 300 per year 320
2 100 per year 60
2 200 per year 890
2 300 per year 855
I tried the following,
sample_df.stack().reset_index().drop('level_1',axis=1)
This does not work.
I have pd, pd_dt, pd_tp are repeating with .1, .2 .. values.
I have How can I achieve output?
You want pd.wide_to_long, but with some tweak since your first few columns do not share the same patterns with the rest:
# rename
df.columns = [x+'.0' if '.' not in x and x != 'id' else x
for x in df.columns]
pd.wide_to_long(df, stubnames=['pd','pd_dt','pd_tp'],
i='id', j='order', sep='.')
Output:
pd pd_dt pd_tp
id order
1 0 100 per year 468
2 0 100 per year 60
1 1 200 per year 400
2 1 200 per year 890
1 2 300 per year 320
2 2 300 per year 855
You can use numpy split to split it into n arrays and concetanate them back together. Then repeat the id column by the number of rows in your new dataframe.
new_df = pd.DataFrame(np.concatenate(np.split(df.iloc[:,1:].values, (df.shape[1] - 1)/3, axis=1)))
new_df.columns = ['pd','pd_dt','pd_tp']
new_df['id'] = pd.concat([df.id] * (new_df.shape[0]//2), ignore_index=True)
new_df.sort_values('id')
Result:
pd pd_dt pd_tp id
0 100 per year 468 1
2 200 per year 400 1
4 300 per year 320 1
1 100 per year 60 2
3 200 per year 890 2
5 300 per year 855 2
You can do this:
dt_mask=df.columns.str.contains('dt')
tp_mask=df.columns.str.contains('tp')
new_df=pd.DataFrame()
new_df['pd']=df[df.columns[~(dt_mask|tp_mask)]].stack().reset_index(level=1,drop='level_1')
new_df['pd_dt']=df[df.columns[dt_mask]].stack().reset_index(level=1,drop='level_1')
new_df['pd_tp']=df[df.columns[tp_mask]].stack().reset_index(level=1,drop='level_1')
new_df.reset_index(inplace=True)
print(new_df)
id pd pd_dt pd_tp
0 1 100 per_year 468
1 1 200 per_year 400
2 1 300 per_year 320
3 2 100 per_year 60
4 2 200 per_year 890
5 2 300 per_year 855

pandas create a flag when merging two dataframes

I have two df - df_a and df_b,
# df_a
number cur
1000 USD
2000 USD
3000 USD
# df_b
number amount deletion
1000 0.0 L
1000 10.0 X
1000 10.0 X
2000 20.0 X
2000 20.0 X
3000 0.0 L
3000 0.0 L
I want to left merge df_a with df_b,
df_a = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df_a.fillna(value={'amount':0}, inplace=True)
but also create a flag called deleted in the result df_a, that has three possible values - full, partial and none;
full - if all rows associated with a particular number value, have deletion = L;
partial - if some rows associated with a particular number value, have deletion = L;
none - no rows associated with a particular number value, have deletion = L;
Also when doing the merge, rows from df_b with deletion = L should not be considered; so the result looks like,
number amount deletion deleted cur
1000 10.0 X partial USD
1000 10.0 X partial USD
2000 20.0 X none USD
2000 20.0 X none USD
3000 0.0 NaN full USD
I am wondering how to achieve that.
Idea is compare deletion column and aggregate all and
any, create helper dictionary and last map for new column:
g = df_b['deletion'].eq('L').groupby(df_b['number'])
m1 = g.any()
m2 = g.all()
d1 = dict.fromkeys(m1.index[m1 & ~m2], 'partial')
d2 = dict.fromkeys(m2.index[m2], 'full')
#join dictionries together
d = {**d1, **d2}
print (d)
{1000: 'partial', 3000: 'full'}
df = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df['deleted'] = df['number'].map(d).fillna('none')
print (df)
number cur amount deletion deleted
0 1000 USD 10.0 X partial
1 1000 USD 10.0 X partial
2 2000 USD 20.0 X none
3 2000 USD 20.0 X none
4 3000 USD NaN NaN full
For specify column none, if want create dictionary for it:
d1 = dict.fromkeys(m1.index[m1 & ~m2], 'partial')
d2 = dict.fromkeys(m2.index[m2], 'full')
d3 = dict.fromkeys(m2.index[~m1], 'none')
d = {**d1, **d2, **d3}
print (d)
{1000: 'partial', 3000: 'full', 2000: 'none'}
df = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df['deleted'] = df['number'].map(d)
print (df)
number cur amount deletion deleted
0 1000 USD 10.0 X partial
1 1000 USD 10.0 X partial
2 2000 USD 20.0 X none
3 2000 USD 20.0 X none
4 3000 USD NaN NaN full

Resources