groupby with totals/subtotals - python-3.x

Say I have the following dataframe
Strategy AssetClass Symbol Value Indicator
Strat1 OPT OPT_ABC1 50 -0.3
Strat1 OPT OPT_ABC2 50 1.5
Strat1 STK STK_ABC 50 2.7
Strat2 STK STK_XYZ 70 -3.8
Strat3 OPT OPT_MNO 25 10
I would like to produce the following:
Strategy AssetClass Symbol Value Indicator
Strat1 3.9
OPT 1.2
OPT_ABC1 50 -0.3
OPT_ABC2 50 1.5
STK 2.7
STK_ABC 50 2.7
Strat2 -3.8
STK -3.8
STK_XYZ 70 -3.8
Strat3 10
OPT 10
OPT_MNO 25 10
So the idea is to rearrange the data with totals per Strategy, then AssetClass and then per Symbol. The column "Value" is available at the Symbol level, while the column "Indicator" is the sum of the subgroup.
I thought of using pd.pivot_table but it doesn't seem to produce the totals/sub_totals I am looking for. I think I should use/loop over pd.groupby on Strategy and then loop over another groupby on Strategy/AssetClass and then loop over a groupby on Strategy/AssetClass/Symbol
With df being the dataframe above, I did this:
container = []
for label, _df in df.groupby(['Strategy', 'AssetClass', 'Symbol']):
_df.loc[f'{label}'] = _df[['Indicator']].sum()
container.append(_df)
df_res = pd.concat(container)
print(df_res.fillna(''))
My problem is that the subtotal is inserted after the corresponding rows and the label is used as index. Besides I can't figure out an easy/pythonic way of adding the other lopps(ie subtotals)

You can aggregate by different columns, so for performance is better not use nested groupby.apply but rather multple aggregation, last join them togehether by concat, change order of columns by DataFrame.reindex and last sorting per first 2 columns:
df1 = df.groupby(['Strategy', 'AssetClass', 'Symbol'], as_index=False).sum()
df2 = (df1.groupby(['Strategy', 'AssetClass'], as_index=False)['Indicator'].sum()
.assign(Symbol = ''))
df3 = (df1.groupby('Strategy', as_index=False)['Indicator'].sum()
.assign(AssetClass = ''))
df = (pd.concat([df3, df2, df1])
.reindex(df.columns, axis=1)
.fillna('')
.sort_values(['Strategy','AssetClass'], ignore_index=True))
print (df)
Strategy AssetClass Symbol Value Indicator
0 Strat1 3.9
1 Strat1 OPT 1.2
2 Strat1 OPT OPT_ABC1 50.0 -0.3
3 Strat1 OPT OPT_ABC2 50.0 1.5
4 Strat1 STK 2.7
5 Strat1 STK STK_ABC 50.0 2.7
6 Strat2 -3.8
7 Strat2 STK -3.8
8 Strat2 STK STK_XYZ 70.0 -3.8
9 Strat3 10.0
10 Strat3 OPT 10.0
11 Strat3 OPT OPT_MNO 25.0 10.0

Related

How can i remove repeated values in a single column?

i have a dataframe like:
shops prod_id atv_y1
company_b A 56.3
company_b B 4.3
company_b C 136.3
company_b D 89.3
company_c A 7.3
company_c B 64.0
company_c A 34.7
For the purpose of plotting i would like to remove the repeated company_b/company_c values so that it takes only the first time it is referenced like below:
shops prod_id atv_y1
company_b A 56.3
B 4.3
C 136.3
D 89.3
company_c A 7.3
B 64.0
A 34.7
how can i do this in pandas ?
You might be able to manage this within plots itself by the way. But if you really want the df transformed like you asked, then you could try something like below.
It may not be the best way, but does the job.
shops = df.groupby('shops').first().reset_index()['shops']
for i in shops:
l = np.where(df['shops'] == i)[0]
df.loc[l[1]:l[len(l)-1],'shops'] = ''
print(df)
prints
shops prod_id atv_y1
0 company_b A 56.3
1 B 4.3
2 C 136.3
3 D 89.3
4 company_c A 7.3
5 B 64.0
6 A 34.7

Removing outliers based on column variables or multi-index in a dataframe

This is another IQR outlier question. I have a dataframe that looks something like this:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df
I would like to find and remove the outliers for each condition (i.e. Spring Placebo, Spring Drug, etc). Not the whole row, just the cell. And would like to do it for each of the 'red', 'yellow', 'green' columns.
Is there way to do this without breaking the dataframe into a whole bunch of sub dataframes with all of the conditions broken out separately? I'm not sure if this would be easier if 'Season' and 'Treatment' were handled as columns or indices. I'm fine with either way.
I've tried a few things with .iloc and .loc but I can't seem to make it work.
If need replace outliers by missing values use GroupBy.transform with DataFrame.quantile, then compare for lower and greater values by DataFrame.lt and DataFrame.gt, chain masks by | for bitwise OR and set missing values in DataFrame.mask, default replacement, so not specified:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)
c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)
print (df)
Season Treatment red yellow green
0 Spring Placebo NaN NaN 67.0
1 Spring Placebo 67.0 91.0 3.0
2 Spring Placebo 71.0 56.0 29.0
3 Spring Placebo 48.0 32.0 24.0
4 Spring Placebo 74.0 9.0 51.0
.. ... ... ... ... ...
95 Fall Drug 90.0 35.0 55.0
96 Fall Drug 40.0 55.0 90.0
97 Fall Drug NaN 54.0 NaN
98 Fall Drug 28.0 50.0 74.0
99 Fall Drug NaN 73.0 11.0
[100 rows x 5 columns]

Cannot relate new groupby dataframe to original dataframe

I have a dataframe called differ_vt that I wanted to apply groupby and summarize on the columns 'CO_FIPS' and 'FLD_ZONE':
FLD_ZONE AWPOP_1 Area_SQMI_1 AWHU_1 AWPOP_2 Area_SQMI_2 AWHU_2 CO_FIPS
0 A 18.1 23.1 101.3 3.0 23.1 3.1 50001
1 AE 6.7 13.5 58.6 0.03 13.5 4.8 50001
2 N 1.3 1.2 23.0 7.1 1.2 8.3 50001
3 X 0.0 38.5 0.0 0.0 38.5 0.0 50001
4 X500 4.6 44.5 4.8 4.8 44.5 2.1 50001
I create a new dataframe for the grouped and summarized data:
vt_sum = differ_vt_outer.groupby(['CO_FIPS', 'FLD_ZONE']).agg({'AWPOP_1': 'sum', 'Area_SQMI_1': 'sum', 'AWHU_1': 'sum', 'AWPOP_2': 'sum', 'Area_SQMI_2': 'sum', 'AWHU_2': 'sum'})
The new dataframe looks something like this:
vt_sum.head()
>
AWPOP_1 Area_SQMI_1 AWHU_1 AWPOP_2 Area_SQMI_2 AWHU_2
CO_FIPS FLD_ZONE
50001 A 2335.8 79.7 1095.1 2334.0 79.7 1094.1
AE 2134.5 74.1 1179.5 2134.5 74.1 1179.5
N 96.8 0.2 13.1 94.0 0.2 11.7
X 68119.7 1333.2 30623.9 68115.5 1333.2 30621.9
X500 339.2 4.4 149.8 339.2 4.4 149.8
50003 A 1006.9 4.8 542.7 1006.9 4.8 542.7
AE 2441.6 2.3 1265.0 2441.6 2.3 1265.0
AO 3.1 0.0 3.5 3.1 0.0 3.5
X 34896.6 700.4 20075.2 34896.6 700.4 20075.2
Now, I want to relate the summarized dataframe back to the original diff_vt dataframe and create new columns based on the summarized values. For example, for CO_FIPS = 50001 and FLD_ZONE = A, I want to add a column called Tot_AWPOP_1 that has a value of 2335.8.
differ_vt_outer['Tot_AWPOP_1'] = vt_sum['AWPOP_1'].values
However, when I run this, I get the error:
ValueError: Length of values does not match length of index
How can I resolve this?
You can use transform instead of agg after the groupby and join the result to the orignal dataframe after add_prefix to the columns' names, try:
list_col_sum = ['AWPOP_1', 'Area_SQMI_1', 'AWHU_1', 'AWPOP_2', 'Area_SQMI_2', 'AWHU_2']
differ_vt_outer = differ_vt_outer.join(differ_vt_outer.groupby(['CO_FIPS', 'FLD_ZONE'])[list_col_sum ]\
.transform(sum).add_prefix('Tot_'))

Pandas rolling mean don't change numbers to NaN in DataFrame

I'm working with a pandas DataFrame which looks like this:
(**N.B - the offset is set as the index of the DataFrame)
offset X Y Z
0 -0.140137 -1.924316 -0.426758
10 -2.789123 -1.111212 -0.416016
20 -0.133789 -1.923828 -4.408691
30 -0.101112 -1.457891 -0.425781
40 -0.126465 -1.926758 -0.414062
50 -0.137207 -1.916992 -0.404297
60 -0.130371 -3.784591 -0.987654
70 -0.125000 -1.918457 -0.403809
80 -0.123456 -1.917480 -0.413574
90 -0.126465 -1.926758 -0.333554
I have applied the rolling mean with window size = 5, to the data frame using the following code.
I need to keep this window size = 5 and I need values for the whole dataframe for all of the offset values (no NaNs).
df = df.rolling(center=False, window=5).mean()
Which gives me:
offset X Y Z
0.0 NaN NaN NaN
10.0 NaN NaN NaN
20.0 NaN NaN NaN
30.0 NaN NaN NaN
40.0 -0.658125 -1.668801 -1.218262
50.0 -0.657539 -1.667336 -1.213769
60.0 -0.125789 -2.202012 -1.328097
70.0 -0.124031 -2.200938 -0.527121
80.0 -0.128500 -2.292856 -0.524679
90.0 -0.128500 -2.292856 -0.508578
I would like the DataFrame to be able to keep the first values that are NaN unchanged and have the the rest of the values as the result of the rolling mean. Is there a simple way that I would be able to do this? Thanks
i.e.
offset X Y Z
0.0 -0.140137 -1.924316 -0.426758
10.0 -2.789123 -1.111212 -0.416016
20.0 -0.133789 -1.923828 -4.408691
30.0 -0.101112 -1.457891 -0.425781
40.0 -0.658125 -1.668801 -1.218262
50.0 -0.657539 -1.667336 -1.213769
60.0 -0.125789 -2.202012 -1.328097
70.0 -0.124031 -2.200938 -0.527121
80.0 -0.128500 -2.292856 -0.524679
90.0 -0.128500 -2.292856 -0.508578
You can fill with the original df:
df.rolling(center=False, window=5).mean().fillna(df)
Out:
X Y Z
offset
0 -0.140137 -1.924316 -0.426758
10 -2.789123 -1.111212 -0.416016
20 -0.133789 -1.923828 -4.408691
30 -0.101112 -1.457891 -0.425781
40 -0.658125 -1.668801 -1.218262
50 -0.657539 -1.667336 -1.213769
60 -0.125789 -2.202012 -1.328097
70 -0.124031 -2.200938 -0.527121
80 -0.128500 -2.292856 -0.524679
90 -0.128500 -2.292856 -0.508578
There is also an argument, min_periods that you can use. If you pass min_periods=1 then it will take the first value as it is, second value as the mean of the first two etc. It might make more sense in some cases.
df.rolling(center=False, window=5, min_periods=1).mean()
Out:
X Y Z
offset
0 -0.140137 -1.924316 -0.426758
10 -1.464630 -1.517764 -0.421387
20 -1.021016 -1.653119 -1.750488
30 -0.791040 -1.604312 -1.419311
40 -0.658125 -1.668801 -1.218262
50 -0.657539 -1.667336 -1.213769
60 -0.125789 -2.202012 -1.328097
70 -0.124031 -2.200938 -0.527121
80 -0.128500 -2.292856 -0.524679
90 -0.128500 -2.292856 -0.508578
Assuming you don't have other rows with all NaN's, you can identify which rows have all NaN's in your rolling_df, and replace them with the corresponding rows from the original. Example:
df=pd.DataFrame(np.random.rand(13,5))
df_rolling=df.rolling(center=False,window=5).mean()
#identify which rows are all NaN
idx = df_rolling.index[df_rolling.isnull().all(1)]
#replace those rows with the original data
df_rolling.loc[idx,:]=df.loc[idx,:]

Pandas Pivot and Summarize For Multiple Rows Vertically

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Site':['a','a','a','b','b','b'],
'x':[1,1,0,1,0,0],
'y':[1,np.nan,0,1,1,0]
})
df
Site y x
0 a 1.0 1
1 a NaN 1
2 a 0.0 0
3 b 1.0 1
4 b 1.0 0
5 b 0.0 0
I am looking for the most efficient way, for each numerical column (y and x), to produce a percent per group, label the column name, and stack them in one column.
Here's how I accomplish this for 'y':
df=df.loc[~np.isnan(df['y'])] #do not count non-numbers
t=pd.pivot_table(df,index='Site',values='y',aggfunc=[np.sum,len])
t['Item']='y'
t['Perc']=round(t['sum']/t['len']*100,1)
t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
Now all I need is a way to add 2 more rows to this; the results for 'x' if I had pivoted with its values above, like this:
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1 2 x 50.0
b 1 3 x 33.3
In reality, I have 48 such numerical data columns that need to be stacked as such.
Thanks in advance!
First you can use notnull. Then omit in pivot_table parameter value, stack and sort_values by new column Item. Last you can use pandas function round:
df=df.loc[df['y'].notnull()]
t=pd.pivot_table(df,index='Site', aggfunc=[sum,len])
.stack()
.reset_index(level=1)
.rename(columns={'level_1':'Item'})
.sort_values('Item', ascending=False)
t['Perc']= (t['sum']/t['len']*100).round(1)
#reorder columns
t = t[['sum','len','Item','Perc']]
print t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1.0 2.0 x 50.0
b 1.0 3.0 x 33.3
Another solution if is neccessary define values columns in pivot_table:
df=df.loc[df['y'].notnull()]
t=pd.pivot_table(df,index='Site',values=['y', 'x'], aggfunc=[sum,len])
.stack()
.reset_index(level=1)
.rename(columns={'level_1':'Item'})
.sort_values('Item', ascending=False)
t['Perc']= (t['sum']/t['len']*100).round(1)
#reorder columns
t = t[['sum','len','Item','Perc']]
print t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1.0 2.0 x 50.0
b 1.0 3.0 x 33.3

Resources