Cannot relate new groupby dataframe to original dataframe - python-3.x

I have a dataframe called differ_vt that I wanted to apply groupby and summarize on the columns 'CO_FIPS' and 'FLD_ZONE':
FLD_ZONE AWPOP_1 Area_SQMI_1 AWHU_1 AWPOP_2 Area_SQMI_2 AWHU_2 CO_FIPS
0 A 18.1 23.1 101.3 3.0 23.1 3.1 50001
1 AE 6.7 13.5 58.6 0.03 13.5 4.8 50001
2 N 1.3 1.2 23.0 7.1 1.2 8.3 50001
3 X 0.0 38.5 0.0 0.0 38.5 0.0 50001
4 X500 4.6 44.5 4.8 4.8 44.5 2.1 50001
I create a new dataframe for the grouped and summarized data:
vt_sum = differ_vt_outer.groupby(['CO_FIPS', 'FLD_ZONE']).agg({'AWPOP_1': 'sum', 'Area_SQMI_1': 'sum', 'AWHU_1': 'sum', 'AWPOP_2': 'sum', 'Area_SQMI_2': 'sum', 'AWHU_2': 'sum'})
The new dataframe looks something like this:
vt_sum.head()
>
AWPOP_1 Area_SQMI_1 AWHU_1 AWPOP_2 Area_SQMI_2 AWHU_2
CO_FIPS FLD_ZONE
50001 A 2335.8 79.7 1095.1 2334.0 79.7 1094.1
AE 2134.5 74.1 1179.5 2134.5 74.1 1179.5
N 96.8 0.2 13.1 94.0 0.2 11.7
X 68119.7 1333.2 30623.9 68115.5 1333.2 30621.9
X500 339.2 4.4 149.8 339.2 4.4 149.8
50003 A 1006.9 4.8 542.7 1006.9 4.8 542.7
AE 2441.6 2.3 1265.0 2441.6 2.3 1265.0
AO 3.1 0.0 3.5 3.1 0.0 3.5
X 34896.6 700.4 20075.2 34896.6 700.4 20075.2
Now, I want to relate the summarized dataframe back to the original diff_vt dataframe and create new columns based on the summarized values. For example, for CO_FIPS = 50001 and FLD_ZONE = A, I want to add a column called Tot_AWPOP_1 that has a value of 2335.8.
differ_vt_outer['Tot_AWPOP_1'] = vt_sum['AWPOP_1'].values
However, when I run this, I get the error:
ValueError: Length of values does not match length of index
How can I resolve this?

You can use transform instead of agg after the groupby and join the result to the orignal dataframe after add_prefix to the columns' names, try:
list_col_sum = ['AWPOP_1', 'Area_SQMI_1', 'AWHU_1', 'AWPOP_2', 'Area_SQMI_2', 'AWHU_2']
differ_vt_outer = differ_vt_outer.join(differ_vt_outer.groupby(['CO_FIPS', 'FLD_ZONE'])[list_col_sum ]\
.transform(sum).add_prefix('Tot_'))

Related

how to get quartiles and classify a value according to this quartile range

I have this df:
d = pd.DataFrame({'Name':['Andres','Lars','Paul','Mike'],
'target':['A','A','B','C'],
'number':[10,12.3,11,6]})
And I want classify each number in a quartile. I am doing this:
(d.groupby(['Name','target','number'])['number']
.quantile([0.25,0.5,0.75,1]).unstack()
.reset_index()
.rename(columns={0.25:"1Q",0.5:"2Q",0.75:"3Q",1:"4Q"})
)
But as you can see, the 4 quartiles are all equal because the code above is calculating per row so if there's one 1 number per row all quartiles are equal.
If a run instead:
d['number'].quantile([0.25,0.5,0.75,1])
Then I have the 4 quartiles I am looking for:
0.25 9.000
0.50 10.500
0.75 11.325
1.00 12.300
What I need as output(showing only first 2 rows)
Name target number 1Q 2Q 3Q 4Q Rank
0 Andres A 10.0 9.0 10.5 11.325 12.30 1
1 Lars A 12.3 9.0 10.5 11.325 12.30 4
you can see all quartiles has the the values considering tall values in the number column. Besides that, now we have a column names Rank that classify the number according to it's quartile. ex. In the first row 10 is within the 1st quartile.
Here's one way that build on the quantiles you've created by making it a DataFrame and joining it to d. Also assigns "Rank" column using rank method:
out = (d.join(d['number'].quantile([0.25,0.5,0.75,1])
.set_axis([f'{i}Q' for i in range(1,5)], axis=0)
.to_frame().T
.pipe(lambda x: x.loc[x.index.repeat(len(d))])
.reset_index(drop=True))
.assign(Rank=d['number'].rank(method='dense')))
Output:
Name target number 1Q 2Q 3Q 4Q Rank
0 Andres A 10.0 9.0 10.5 11.325 12.3 2.0
1 Lars A 12.3 9.0 10.5 11.325 12.3 4.0
2 Paul B 11.0 9.0 10.5 11.325 12.3 3.0
3 Mike C 6.0 9.0 10.5 11.325 12.3 1.0

groupby with totals/subtotals

Say I have the following dataframe
Strategy AssetClass Symbol Value Indicator
Strat1 OPT OPT_ABC1 50 -0.3
Strat1 OPT OPT_ABC2 50 1.5
Strat1 STK STK_ABC 50 2.7
Strat2 STK STK_XYZ 70 -3.8
Strat3 OPT OPT_MNO 25 10
I would like to produce the following:
Strategy AssetClass Symbol Value Indicator
Strat1 3.9
OPT 1.2
OPT_ABC1 50 -0.3
OPT_ABC2 50 1.5
STK 2.7
STK_ABC 50 2.7
Strat2 -3.8
STK -3.8
STK_XYZ 70 -3.8
Strat3 10
OPT 10
OPT_MNO 25 10
So the idea is to rearrange the data with totals per Strategy, then AssetClass and then per Symbol. The column "Value" is available at the Symbol level, while the column "Indicator" is the sum of the subgroup.
I thought of using pd.pivot_table but it doesn't seem to produce the totals/sub_totals I am looking for. I think I should use/loop over pd.groupby on Strategy and then loop over another groupby on Strategy/AssetClass and then loop over a groupby on Strategy/AssetClass/Symbol
With df being the dataframe above, I did this:
container = []
for label, _df in df.groupby(['Strategy', 'AssetClass', 'Symbol']):
_df.loc[f'{label}'] = _df[['Indicator']].sum()
container.append(_df)
df_res = pd.concat(container)
print(df_res.fillna(''))
My problem is that the subtotal is inserted after the corresponding rows and the label is used as index. Besides I can't figure out an easy/pythonic way of adding the other lopps(ie subtotals)
You can aggregate by different columns, so for performance is better not use nested groupby.apply but rather multple aggregation, last join them togehether by concat, change order of columns by DataFrame.reindex and last sorting per first 2 columns:
df1 = df.groupby(['Strategy', 'AssetClass', 'Symbol'], as_index=False).sum()
df2 = (df1.groupby(['Strategy', 'AssetClass'], as_index=False)['Indicator'].sum()
.assign(Symbol = ''))
df3 = (df1.groupby('Strategy', as_index=False)['Indicator'].sum()
.assign(AssetClass = ''))
df = (pd.concat([df3, df2, df1])
.reindex(df.columns, axis=1)
.fillna('')
.sort_values(['Strategy','AssetClass'], ignore_index=True))
print (df)
Strategy AssetClass Symbol Value Indicator
0 Strat1 3.9
1 Strat1 OPT 1.2
2 Strat1 OPT OPT_ABC1 50.0 -0.3
3 Strat1 OPT OPT_ABC2 50.0 1.5
4 Strat1 STK 2.7
5 Strat1 STK STK_ABC 50.0 2.7
6 Strat2 -3.8
7 Strat2 STK -3.8
8 Strat2 STK STK_XYZ 70.0 -3.8
9 Strat3 10.0
10 Strat3 OPT 10.0
11 Strat3 OPT OPT_MNO 25.0 10.0

python pandas data frame: single column to multiple columns based on values

I am new to pandas.
I am trying to split a single column to multiple columns based on index value using Groupby. Below is the program wrote.
import pandas as pd
data = [(0,1.1),
(1,1.2),
(2,1.3),
(0,2.1),
(1,2.2),
(0,3.1),
(1,3.2),
(2,3.3),
(3,3.4)]
df = pd.DataFrame(data, columns=['ID','test_data'])
df = df.groupby('ID',sort=True).apply(lambda g: pd.Series(g['test_data'].values))
print(df)
df=df.unstack(level=-1).rename(columns=lambda x: 'test_data%s' %x)
print(df)
I have to use unstack(level=-1) because when we have uneven column size, the groupie and series stores the result as shown below.
ID
0 0 1.1
1 2.1
2 3.1
1 0 1.2
1 2.2
2 3.2
2 0 1.3
1 3.3
3 0 3.4
dtype: float64
End result I am getting after unstack is like below
test_data0 test_data1 test_data2
ID
0 1.1 2.1 3.1
1 1.2 2.2 3.2
2 1.3 3.3 NaN
3 3.4 NaN NaN
but what I am expecting is
test_data0 test_data1 test_data2
ID
0 1.1 2.1 3.1
1 1.2 2.2 3.2
2 1.3 NAN 3.3
3 NAN NAN 3.4
Let me know if there is any better way to do this other than groupby.
This will work if your dataframe is sorted as you show
df['num_zeros_seen'] = df['ID'].eq(0).cumsum()
#reshape the table
df = df.pivot(
index='ID',
columns='num_zeros_seen',
values='test_data',
)
print(df)
Output:
num_zeros_seen 1 2 3
ID
0 1.1 2.1 3.1
1 1.2 2.2 3.2
2 1.3 NaN 3.3
3 NaN NaN 3.4

Removing outliers based on column variables or multi-index in a dataframe

This is another IQR outlier question. I have a dataframe that looks something like this:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df
I would like to find and remove the outliers for each condition (i.e. Spring Placebo, Spring Drug, etc). Not the whole row, just the cell. And would like to do it for each of the 'red', 'yellow', 'green' columns.
Is there way to do this without breaking the dataframe into a whole bunch of sub dataframes with all of the conditions broken out separately? I'm not sure if this would be easier if 'Season' and 'Treatment' were handled as columns or indices. I'm fine with either way.
I've tried a few things with .iloc and .loc but I can't seem to make it work.
If need replace outliers by missing values use GroupBy.transform with DataFrame.quantile, then compare for lower and greater values by DataFrame.lt and DataFrame.gt, chain masks by | for bitwise OR and set missing values in DataFrame.mask, default replacement, so not specified:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)
c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)
print (df)
Season Treatment red yellow green
0 Spring Placebo NaN NaN 67.0
1 Spring Placebo 67.0 91.0 3.0
2 Spring Placebo 71.0 56.0 29.0
3 Spring Placebo 48.0 32.0 24.0
4 Spring Placebo 74.0 9.0 51.0
.. ... ... ... ... ...
95 Fall Drug 90.0 35.0 55.0
96 Fall Drug 40.0 55.0 90.0
97 Fall Drug NaN 54.0 NaN
98 Fall Drug 28.0 50.0 74.0
99 Fall Drug NaN 73.0 11.0
[100 rows x 5 columns]

How to plot multiple charts using matplotlib from unstacked dataframe with Pandas

This is a sample of the dataset I have using the following piece of code
ComplaintCity = nyc_df.groupby(['City','Complaint Type']).size().sort_values().unstack()
top5CitiesByComplaints = ComplaintCity[top5Complaints].rename_axis(None, axis=1)
top5CitiesByComplaints
Blocked Driveway Illegal Parking Noise - Street/Sidewalk Noise - Commercial Derelict Vehicle
City
ARVERNE 35.0 58.0 29.0 2.0 27.0
ASTORIA 2734.0 1281.0 500.0 1554.0 363.0
BAYSIDE 377.0 514.0 15.0 40.0 198.0
BELLEROSE 95.0 106.0 13.0 37.0 89.0
BREEZY POINT 3.0 15.0 1.0 4.0 3.0
BRONX 12754.0 7859.0 8890.0 2433.0 1952.0
BROOKLYN 28147.0 27461.0 13354.0 11458.0 5179.0
CAMBRIA HEIGHTS 147.0 76.0 25.0 12.0 115.0
CENTRAL PARK NaN 2.0 95.0 NaN NaN
COLLEGE POINT 435.0 352.0 33.0 35.0 184.0
CORONA 2761.0 660.0 238.0 248.0
I want to be able to plot the same as a horizontal bar chart for each complaint. It should display the Cities with the highest count of complaints. Something similar to the image below. I am not sure how to go about it.
You can create a list of axis instances with subplots and plot the columns one-by-one:
fig, axes = plt.subplots(3,2,figsize=(10,6))
for c,ax in zip(df.columns, axes.ravel()):
df[c].sort_values().plot.barh(ax=ax)
fig.tight_layout()
Then you would get something like this:

Resources