How do I conditionally aggregate values in projection part of pandas query? - python-3.x

I currently have a csv file with this content:
ID PRODUCT_ID NAME STOCK SELL_COUNT DELIVERED_BY
1 P1 PRODUCT_P1 12 15 UPS
2 P2 PRODUCT_P2 4 3 DHL
3 P3 PRODUCT_P3 120 22 DHL
4 P1 PRODUCT_P1 423 18 UPS
5 P2 PRODUCT_P2 0 5 GLS
6 P3 PRODUCT_P3 53 10 DHL
7 P4 PRODUCT_P4 22 0 UPS
8 P1 PRODUCT_P1 94 56 GLS
9 P1 PRODUCT_P1 9 24 GLS
When I execute this SQL query:
SELECT
PRODUCT_ID,
MIN(CASE WHEN DELIVERED_BY = 'UPS' THEN STOCK END) as STOCK,
SUM(CASE WHEN ID > 6 THEN SELL_COUNT END) as TOTAL_SELL_COUNT,
SUM(CASE WHEN SELL_COUNT * 100 > 1000 THEN SELL_COUNT END) as COND_SELL_COUNT
FROM products
GROUP BY PRODUCT_ID;
I get the desired result:
PRODUCT_ID STOCK TOTAL_SELL_COUNT COND_SELL_COUNT
P1 12 80 113
P2 null null null
P3 null null 22
P4 22 0 null
Now I'm trying to somehow get the same result on that dataset using pandas, and that's what I'm struggling with.
I imported the csv file to da DataFrame called df_products.
Then I tried this:
def custom_aggregate(grouped):
data = {
'STOCK': np.where(grouped['DELIVERED_BY'] == 'UPS', grouped['STOCK'].min(), np.nan) # [grouped['STOCK'].min() if grouped['DELIVERED_BY'] == 'UPS' else None]
}
d_series = pd.Series(data)
return d_series
result = df_products.groupby('PRODUCT_ID').apply(custom_aggregate)
print(result)
As you can see I'm nowhere near the expected result as I'm already having problems getting the conditional STOCK aggregration to work depending on the DELIVERED_BY values.
This outputs:
STOCK
PRODUCT_ID
P1 [9.0, 9.0, nan, nan]
P2 [nan, nan]
P3 [nan, nan]
P4 [22.0]
which is not even in the correct format, but I'd be happy if I could get the expected 12.0 instead of 9.0 for P1.
Thanks
I just wanted to add that I got near the result by creating additional columns:
df_products['COND_STOCK'] = df_products[df_products['DELIVERED_BY'] == 'UPS']['STOCK']
df_products['SELL_COUNT_ID_GT6'] = df_products[df_products['ID'] > 6]['SELL_COUNT']
df_products['SELL_COUNT_GT1000'] = df_products[(df_products['SELL_COUNT'] * 100) > 1000]['SELL_COUNT']
The function would then look like this:
def custom_aggregate(grouped):
data = {
'STOCK': grouped['COND_STOCK'].min(),
'TOTAL_SELL_COUNT': grouped['SELL_COUNT_ID_GT6'].sum(),
'COND_SELL_COUNT': grouped['SELL_COUNT_GT1000'].sum(),
}
d_series = pd.Series(data)
return d_series
result = df_products.groupby('PRODUCT_ID').apply(custom_aggregate)
This is the 'almost' desired result:
STOCK TOTAL_SELL_COUNT COND_SELL_COUNT
PRODUCT_ID
P1 12.0 80.0 113.0
P2 NaN 0.0 0.0
P3 NaN 0.0 22.0
P4 22.0 0.0 0.0

Usually we can write the pandas as below
df.groupby('PRODUCT_ID').apply(lambda x : pd.Series({'STOCK':x.loc[x.DELIVERED_BY =='UPS','STOCK'].min(),
'TOTAL_SELL_COUNT': x.loc[x.ID>6,'SELL_COUNT'].sum(min_count=1),
'COND_SELL_COUNT':x.loc[x.SELL_COUNT>10,'SELL_COUNT'].sum(min_count=1)}))
Out[105]:
STOCK TOTAL_SELL_COUNT COND_SELL_COUNT
PRODUCT_ID
P1 12.0 80.0 113.0
P2 NaN NaN NaN
P3 NaN NaN 22.0
P4 22.0 0.0 NaN

Related

Create some features based on the average growth rate of y for the month over the past few years

Assuming we have dataset df (which can be downloaded from this link), I want to create some features based on the average growth rate of y for the month of the past several years, for example: y_agr_last2, y_agr_last3, y_agr_last4, etc.
The formula is:
For example, for September 2022, y_agr_last2 = ((1 + 3.85/100)*(1 + 1.81/100))^(1/2) -1, y_agr_last3 = ((1 + 3.85/100)*(1 + 1.81/100)*(1 + 1.6/100))^(1/3) -1.
The code I use is as follows, which is relatively repetitive and trivial:
import math
df['y_shift12'] = df['y'].shift(12)
df['y_shift24'] = df['y'].shift(24)
df['y_shift36'] = df['y'].shift(36)
df['y_agr_last2'] = pow(((1+df['y_shift12']/100) * (1+df['y_shift24']/100)), 1/2) -1
df['y_agr_last3'] = pow(((1+df['y_shift12']/100) * (1+df['y_shift24']/100) * (1+df['y_shift36']/100)), 1/3) -1
df.drop(['y_shift12', 'y_shift24', 'y_shift36'], axis=1, inplace=True)
df
How can the desired result be achieved more concisely?
References:
Create some features based on the mean of y for the month over the past few years
Following is one way to generalise it:
import functools
import operator
num_yrs = 3
for n in range(1, num_yrs+1):
df[f"y_shift{n*12}"] = df["y"].shift(n*12)
df[f"y_agr_last{n}"] = pow(functools.reduce(operator.mul, [1+df[f"y_shift{i*12}"]/100 for i in range(1, n+1)], 1), 1/n) - 1
df = df.drop(["y_agr_last1"] + [f"y_shift{n*12}" for n in range(1, num_yrs+1)], axis=1)
Output:
date y x1 x2 y_agr_last2 y_agr_last3
0 2018/1/31 -13.80 1.943216 3.135839 NaN NaN
1 2018/2/28 -14.50 0.732108 0.375121 NaN NaN
...
22 2019/11/30 4.00 -0.273262 -0.021146 NaN NaN
23 2019/12/31 7.60 1.538851 1.903968 NaN NaN
24 2020/1/31 -11.34 2.858537 3.268478 -0.077615 NaN
25 2020/2/29 -34.20 -1.246915 -0.883807 -0.249940 NaN
26 2020/3/31 46.50 -4.213756 -4.670146 0.221816 NaN
...
33 2020/10/31 -1.00 1.967062 1.860070 -0.035569 NaN
34 2020/11/30 12.99 2.302166 2.092842 0.041998 NaN
35 2020/12/31 5.54 3.814303 5.611199 0.030017 NaN
36 2021/1/31 -6.41 4.205601 4.948924 -0.064546 -0.089701
37 2021/2/28 -22.38 4.185913 3.569100 -0.342000 -0.281975
38 2021/3/31 17.64 5.370519 3.130884 0.465000 0.298025
...
54 2022/7/31 0.80 -6.259455 -6.716896 0.057217 0.052793
55 2022/8/31 -5.30 1.302754 1.412277 0.015121 -0.000492
56 2022/9/30 NaN -2.876968 -3.785964 0.028249 0.024150

Boolean indexing in pandas dataframes

I'm trying to apply boolean indexing to a pandas DataFrame.
nm - stores the names of players
ag- stores the player ages
sc - stores the scores
capt - stores boolean index values
import pandas as pd
nm=pd.Series(['p1','p2', 'p3', 'p4'])
ag=pd.Series([12,17,14, 19])
sc=pd.Series([120, 130, 150, 100])
capt=[True, False, True, True]
Cricket=pd.DataFrame({"Name":nm,"Age":ag ,"Score":sc}, index=capt)
print(Cricket)
Output:
Name Age Score
True NaN NaN NaN
False NaN NaN NaN
True NaN NaN NaN
True NaN NaN NaN
Whenever I run the code above, I get a DataFrame filled with NaN values. The only case in which this seems to work is when capt doesn't have repeating elements.
i.e When capt=[False, True] (and reasonable values are given for nm, ag and sc) this code works as expected.
I'm running python 3.8.5, pandas 1.1.1 Is this a deprecated functionality?
Desired output:
Name Age Score
True p1 12 120
False p2 17 130
True p3 14 150
True p4 19 100
Set index values for each Series for avoid mismatch between default RangeIndex of each Series and new index values from capt:
capt=[True, False, True, True]
nm=pd.Series(['p1','p2', 'p3', 'p4'], index=capt)
ag=pd.Series([12,17,14, 19], index=capt)
sc=pd.Series([120, 130, 150, 100], index=capt)
Cricket=pd.DataFrame({"Name":nm,"Age":ag ,"Score":sc})
print(Cricket)
Name Age Score
True p1 12 120
False p2 17 130
True p3 14 150
True p4 19 100
Detail:
print(pd.Series(['p1','p2', 'p3', 'p4']))
0 p1
1 p2
2 p3
3 p4
dtype: object
print(pd.Series(['p1','p2', 'p3', 'p4'], index=capt))
True p1
False p2
True p3
True p4
dtype: object
Boolean indexing is filtration:
capt=[True, False, True, True]
nm=pd.Series(['p1','p2', 'p3', 'p4'])
ag=pd.Series([12,17,14, 19])
sc=pd.Series([120, 130, 150, 100])
Cricket=pd.DataFrame({"Name":nm,"Age":ag ,"Score":sc})
print(Cricket)
Name Age Score
0 p1 12 120
1 p2 17 130
2 p3 14 150
3 p4 19 100
print (Cricket[capt])
Name Age Score
0 p1 12 120
2 p3 14 150
3 p4 19 100

Removing outliers based on column variables or multi-index in a dataframe

This is another IQR outlier question. I have a dataframe that looks something like this:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df
I would like to find and remove the outliers for each condition (i.e. Spring Placebo, Spring Drug, etc). Not the whole row, just the cell. And would like to do it for each of the 'red', 'yellow', 'green' columns.
Is there way to do this without breaking the dataframe into a whole bunch of sub dataframes with all of the conditions broken out separately? I'm not sure if this would be easier if 'Season' and 'Treatment' were handled as columns or indices. I'm fine with either way.
I've tried a few things with .iloc and .loc but I can't seem to make it work.
If need replace outliers by missing values use GroupBy.transform with DataFrame.quantile, then compare for lower and greater values by DataFrame.lt and DataFrame.gt, chain masks by | for bitwise OR and set missing values in DataFrame.mask, default replacement, so not specified:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)
c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)
print (df)
Season Treatment red yellow green
0 Spring Placebo NaN NaN 67.0
1 Spring Placebo 67.0 91.0 3.0
2 Spring Placebo 71.0 56.0 29.0
3 Spring Placebo 48.0 32.0 24.0
4 Spring Placebo 74.0 9.0 51.0
.. ... ... ... ... ...
95 Fall Drug 90.0 35.0 55.0
96 Fall Drug 40.0 55.0 90.0
97 Fall Drug NaN 54.0 NaN
98 Fall Drug 28.0 50.0 74.0
99 Fall Drug NaN 73.0 11.0
[100 rows x 5 columns]

Iterate over rows in a data frame create a new column then adding more columns based on the new column

I have a data frame as below:
Date Quantity
2019-04-25 100
2019-04-26 148
2019-04-27 124
The output that I need is to take the quantity difference between two next dates and average over 24 hours and create 23 columns with hourly quantity difference added to the column before such as below:
Date Quantity Hour-1 Hour-2 ....Hour-23
2019-04-25 100 102 104 .... 146
2019-04-26 148 147 146 .... 123
2019-04-27 124
I'm trying to iterate over a loop but it's not working ,my code is as below:
for i in df.index:
diff=(df.get_value(i+1,'Quantity')-df.get_value(i,'Quantity'))/24
for j in range(24):
df[i,[1+j]]=df.[i,[j]]*(1+diff)
I did some research but I have not found how to create columns like above iteratively. I hope you could help me. Thank you in advance.
IIUC using resample and interpolate, then we pivot the output
s=df.set_index('Date').resample('1 H').interpolate()
s=pd.pivot_table(s,index=s.index.date,columns=s.groupby(s.index.date).cumcount(),values=s,aggfunc='mean')
s.columns=s.columns.droplevel(0)
s
Out[93]:
0 1 2 3 ... 20 21 22 23
2019-04-25 100.0 102.0 104.0 106.0 ... 140.0 142.0 144.0 146.0
2019-04-26 148.0 147.0 146.0 145.0 ... 128.0 127.0 126.0 125.0
2019-04-27 124.0 NaN NaN NaN ... NaN NaN NaN NaN
[3 rows x 24 columns]
If I have understood the question correctly.
for loop approach:
list_of_values = []
for i,row in df.iterrows():
if i < len(df) - 2:
qty = row['Quantity']
qty_2 = df.at[i+1,'Quantity']
diff = (qty_2 - qty)/24
list_of_values.append(diff)
else:
list_of_values.append(0)
df['diff'] = list_of_values
Output:
Date Quantity diff
2019-04-25 100 2
2019-04-26 148 -1
2019-04-27 124 0
Now create the columns required.
i.e.
df['Hour-1'] = df['Quantity'] + df['diff']
df['Hour-2'] = df['Quantity'] + 2*df['diff']
.
.
.
.
There are other approaches which will work way better.

crosstab pandas with condition on columns doesn't display summed values

I have a problem displaying what I want with pd.crosstab
I tried those lines:
pd.crosstab(df_temp['date'].apply(lambda x: pd.to_datetime(x).year), df_temp['state'][df_temp['state'] >= 20], margins=True])
pd.crosstab(df_temp['date'].apply(lambda x: pd.to_datetime(x).year), df_temp['state'][df_temp['state'] >= 20], margins=True, aggfunc = lambda x: x.count(), values = df_temp['state'][df_temp['state'] >= 20])
And they both display this:
state 20.0 30.0 32.0 50.0 All
date
2017 303.0 327.0 6.0 118.0 754.0
2018 328.0 167.0 3.0 58.0 556.0
All 631.0 494.0 9.0 176.0 1310.0`
But what I want is not for each state to count the number of values being the state. For example for the state 20 for each year I want the value to be the count of all values greater or equal than 20. Thus it should be 754. For 30 it sould be 754 - 303 = 451. And so on for the other states.
I also tried this line of command but it doesn't work either:
pd.crosstab(df_temp['date'].apply(lambda x: pd.to_datetime(x).year), df_temp['state'][(df_temp['state'] >= 20) | (df_temp['state'] == 30)], margins=True, aggfunc = lambda x: x.count(), values = df_temp['state'][(df_temp['state'] == 20) | (df_temp['state'] == 30)])
It displays the following table:
state 20.0 30.0 32.0 50.0 All
date
2017 303.0 327.0 0.0 0.0 630.0
2018 328.0 167.0 0.0 0.0 495.0
All 631.0 494.0 NaN NaN 1125.0

Resources