crosstab pandas with condition on columns doesn't display summed values - python-3.x

I have a problem displaying what I want with pd.crosstab
I tried those lines:
pd.crosstab(df_temp['date'].apply(lambda x: pd.to_datetime(x).year), df_temp['state'][df_temp['state'] >= 20], margins=True])
pd.crosstab(df_temp['date'].apply(lambda x: pd.to_datetime(x).year), df_temp['state'][df_temp['state'] >= 20], margins=True, aggfunc = lambda x: x.count(), values = df_temp['state'][df_temp['state'] >= 20])
And they both display this:
state 20.0 30.0 32.0 50.0 All
date
2017 303.0 327.0 6.0 118.0 754.0
2018 328.0 167.0 3.0 58.0 556.0
All 631.0 494.0 9.0 176.0 1310.0`
But what I want is not for each state to count the number of values being the state. For example for the state 20 for each year I want the value to be the count of all values greater or equal than 20. Thus it should be 754. For 30 it sould be 754 - 303 = 451. And so on for the other states.
I also tried this line of command but it doesn't work either:
pd.crosstab(df_temp['date'].apply(lambda x: pd.to_datetime(x).year), df_temp['state'][(df_temp['state'] >= 20) | (df_temp['state'] == 30)], margins=True, aggfunc = lambda x: x.count(), values = df_temp['state'][(df_temp['state'] == 20) | (df_temp['state'] == 30)])
It displays the following table:
state 20.0 30.0 32.0 50.0 All
date
2017 303.0 327.0 0.0 0.0 630.0
2018 328.0 167.0 0.0 0.0 495.0
All 631.0 494.0 NaN NaN 1125.0

Related

Create new columns by comparing the current row's values and previous in Pandas

Given a dummy dataset df as follow:
year v1 v2
0 2017 0.3 0.1
1 2018 0.1 0.1
2 2019 -0.2 0.5
3 2020 NaN -0.3
4 2021 0.8 0.0
or:
[{'year': 2017, 'v1': 0.3, 'v2': 0.1},
{'year': 2018, 'v1': 0.1, 'v2': 0.1},
{'year': 2019, 'v1': -0.2, 'v2': 0.5},
{'year': 2020, 'v1': nan, 'v2': -0.3},
{'year': 2021, 'v1': 0.8, 'v2': 0.0}]
I need to create two more columns trend_v1 and trend_v2 based on v1 and v2 respectively.
The logic to create new columns is this: for each column, if its current value is greater than the previous, the trend value is increase, if its current value is less than the previous, the trend value is decrease, if its current value is equal to the previous, the trend value is equal, if the current or previous value is NaN, the trend also is NaN.
year v1 v2 trend_v1 trend_v2
0 2017 0.3 0.1 NaN NaN
1 2018 0.1 0.1 decrease equal
2 2019 -0.2 0.5 decrease increase
3 2020 NaN -0.3 NaN decrease
4 2021 0.8 0.0 NaN increase
How could I achieve that in Pandas? Thanks for your help at advance.
You can specify columns for test trend by compare shifted values with filtered missing values:
cols = ['v1','v2']
arr = np.where(df[cols] < df[cols].shift(),'decrease',
np.where(df[cols] > df[cols].shift(),'increase',
np.where(df[cols].isna() | df[cols].shift().isna(), None, 'equal')))
df = df.join(pd.DataFrame(arr, columns=cols, index=df.index).add_prefix('trend_'))
print (df)
year v1 v2 trend_v1 trend_v2
0 2017 0.3 0.1 None None
1 2018 0.1 0.1 decrease equal
2 2019 -0.2 0.5 decrease increase
3 2020 NaN -0.3 None decrease
4 2021 0.8 0.0 None increase
Or:
cols = ['v1','v2']
m1 = df[cols] < df[cols].shift()
m2 = df[cols] > df[cols].shift()
m3 = df[cols].isna() | df[cols].shift().isna()
arr = np.select([m1, m2, m3],['decrease','increase', None], default='equal')
df = df.join(pd.DataFrame(arr, columns=cols, index=df.index).add_prefix('trend_'))
EDIT:
Nice improvement is change m3 like mentioned in comments:
cols = ['v1','v2']
m1 = df[cols] < df[cols].shift()
m2 = df[cols] > df[cols].shift()
m3 = df[cols] == df[cols].shift()
arr = np.select([m1, m2, m3],['decrease','increase', 'equal'], default=None)

Removing outliers based on column variables or multi-index in a dataframe

This is another IQR outlier question. I have a dataframe that looks something like this:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df
I would like to find and remove the outliers for each condition (i.e. Spring Placebo, Spring Drug, etc). Not the whole row, just the cell. And would like to do it for each of the 'red', 'yellow', 'green' columns.
Is there way to do this without breaking the dataframe into a whole bunch of sub dataframes with all of the conditions broken out separately? I'm not sure if this would be easier if 'Season' and 'Treatment' were handled as columns or indices. I'm fine with either way.
I've tried a few things with .iloc and .loc but I can't seem to make it work.
If need replace outliers by missing values use GroupBy.transform with DataFrame.quantile, then compare for lower and greater values by DataFrame.lt and DataFrame.gt, chain masks by | for bitwise OR and set missing values in DataFrame.mask, default replacement, so not specified:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)
c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)
print (df)
Season Treatment red yellow green
0 Spring Placebo NaN NaN 67.0
1 Spring Placebo 67.0 91.0 3.0
2 Spring Placebo 71.0 56.0 29.0
3 Spring Placebo 48.0 32.0 24.0
4 Spring Placebo 74.0 9.0 51.0
.. ... ... ... ... ...
95 Fall Drug 90.0 35.0 55.0
96 Fall Drug 40.0 55.0 90.0
97 Fall Drug NaN 54.0 NaN
98 Fall Drug 28.0 50.0 74.0
99 Fall Drug NaN 73.0 11.0
[100 rows x 5 columns]

How do I conditionally aggregate values in projection part of pandas query?

I currently have a csv file with this content:
ID PRODUCT_ID NAME STOCK SELL_COUNT DELIVERED_BY
1 P1 PRODUCT_P1 12 15 UPS
2 P2 PRODUCT_P2 4 3 DHL
3 P3 PRODUCT_P3 120 22 DHL
4 P1 PRODUCT_P1 423 18 UPS
5 P2 PRODUCT_P2 0 5 GLS
6 P3 PRODUCT_P3 53 10 DHL
7 P4 PRODUCT_P4 22 0 UPS
8 P1 PRODUCT_P1 94 56 GLS
9 P1 PRODUCT_P1 9 24 GLS
When I execute this SQL query:
SELECT
PRODUCT_ID,
MIN(CASE WHEN DELIVERED_BY = 'UPS' THEN STOCK END) as STOCK,
SUM(CASE WHEN ID > 6 THEN SELL_COUNT END) as TOTAL_SELL_COUNT,
SUM(CASE WHEN SELL_COUNT * 100 > 1000 THEN SELL_COUNT END) as COND_SELL_COUNT
FROM products
GROUP BY PRODUCT_ID;
I get the desired result:
PRODUCT_ID STOCK TOTAL_SELL_COUNT COND_SELL_COUNT
P1 12 80 113
P2 null null null
P3 null null 22
P4 22 0 null
Now I'm trying to somehow get the same result on that dataset using pandas, and that's what I'm struggling with.
I imported the csv file to da DataFrame called df_products.
Then I tried this:
def custom_aggregate(grouped):
data = {
'STOCK': np.where(grouped['DELIVERED_BY'] == 'UPS', grouped['STOCK'].min(), np.nan) # [grouped['STOCK'].min() if grouped['DELIVERED_BY'] == 'UPS' else None]
}
d_series = pd.Series(data)
return d_series
result = df_products.groupby('PRODUCT_ID').apply(custom_aggregate)
print(result)
As you can see I'm nowhere near the expected result as I'm already having problems getting the conditional STOCK aggregration to work depending on the DELIVERED_BY values.
This outputs:
STOCK
PRODUCT_ID
P1 [9.0, 9.0, nan, nan]
P2 [nan, nan]
P3 [nan, nan]
P4 [22.0]
which is not even in the correct format, but I'd be happy if I could get the expected 12.0 instead of 9.0 for P1.
Thanks
I just wanted to add that I got near the result by creating additional columns:
df_products['COND_STOCK'] = df_products[df_products['DELIVERED_BY'] == 'UPS']['STOCK']
df_products['SELL_COUNT_ID_GT6'] = df_products[df_products['ID'] > 6]['SELL_COUNT']
df_products['SELL_COUNT_GT1000'] = df_products[(df_products['SELL_COUNT'] * 100) > 1000]['SELL_COUNT']
The function would then look like this:
def custom_aggregate(grouped):
data = {
'STOCK': grouped['COND_STOCK'].min(),
'TOTAL_SELL_COUNT': grouped['SELL_COUNT_ID_GT6'].sum(),
'COND_SELL_COUNT': grouped['SELL_COUNT_GT1000'].sum(),
}
d_series = pd.Series(data)
return d_series
result = df_products.groupby('PRODUCT_ID').apply(custom_aggregate)
This is the 'almost' desired result:
STOCK TOTAL_SELL_COUNT COND_SELL_COUNT
PRODUCT_ID
P1 12.0 80.0 113.0
P2 NaN 0.0 0.0
P3 NaN 0.0 22.0
P4 22.0 0.0 0.0
Usually we can write the pandas as below
df.groupby('PRODUCT_ID').apply(lambda x : pd.Series({'STOCK':x.loc[x.DELIVERED_BY =='UPS','STOCK'].min(),
'TOTAL_SELL_COUNT': x.loc[x.ID>6,'SELL_COUNT'].sum(min_count=1),
'COND_SELL_COUNT':x.loc[x.SELL_COUNT>10,'SELL_COUNT'].sum(min_count=1)}))
Out[105]:
STOCK TOTAL_SELL_COUNT COND_SELL_COUNT
PRODUCT_ID
P1 12.0 80.0 113.0
P2 NaN NaN NaN
P3 NaN NaN 22.0
P4 22.0 0.0 NaN

Pandas rolling mean don't change numbers to NaN in DataFrame

I'm working with a pandas DataFrame which looks like this:
(**N.B - the offset is set as the index of the DataFrame)
offset X Y Z
0 -0.140137 -1.924316 -0.426758
10 -2.789123 -1.111212 -0.416016
20 -0.133789 -1.923828 -4.408691
30 -0.101112 -1.457891 -0.425781
40 -0.126465 -1.926758 -0.414062
50 -0.137207 -1.916992 -0.404297
60 -0.130371 -3.784591 -0.987654
70 -0.125000 -1.918457 -0.403809
80 -0.123456 -1.917480 -0.413574
90 -0.126465 -1.926758 -0.333554
I have applied the rolling mean with window size = 5, to the data frame using the following code.
I need to keep this window size = 5 and I need values for the whole dataframe for all of the offset values (no NaNs).
df = df.rolling(center=False, window=5).mean()
Which gives me:
offset X Y Z
0.0 NaN NaN NaN
10.0 NaN NaN NaN
20.0 NaN NaN NaN
30.0 NaN NaN NaN
40.0 -0.658125 -1.668801 -1.218262
50.0 -0.657539 -1.667336 -1.213769
60.0 -0.125789 -2.202012 -1.328097
70.0 -0.124031 -2.200938 -0.527121
80.0 -0.128500 -2.292856 -0.524679
90.0 -0.128500 -2.292856 -0.508578
I would like the DataFrame to be able to keep the first values that are NaN unchanged and have the the rest of the values as the result of the rolling mean. Is there a simple way that I would be able to do this? Thanks
i.e.
offset X Y Z
0.0 -0.140137 -1.924316 -0.426758
10.0 -2.789123 -1.111212 -0.416016
20.0 -0.133789 -1.923828 -4.408691
30.0 -0.101112 -1.457891 -0.425781
40.0 -0.658125 -1.668801 -1.218262
50.0 -0.657539 -1.667336 -1.213769
60.0 -0.125789 -2.202012 -1.328097
70.0 -0.124031 -2.200938 -0.527121
80.0 -0.128500 -2.292856 -0.524679
90.0 -0.128500 -2.292856 -0.508578
You can fill with the original df:
df.rolling(center=False, window=5).mean().fillna(df)
Out:
X Y Z
offset
0 -0.140137 -1.924316 -0.426758
10 -2.789123 -1.111212 -0.416016
20 -0.133789 -1.923828 -4.408691
30 -0.101112 -1.457891 -0.425781
40 -0.658125 -1.668801 -1.218262
50 -0.657539 -1.667336 -1.213769
60 -0.125789 -2.202012 -1.328097
70 -0.124031 -2.200938 -0.527121
80 -0.128500 -2.292856 -0.524679
90 -0.128500 -2.292856 -0.508578
There is also an argument, min_periods that you can use. If you pass min_periods=1 then it will take the first value as it is, second value as the mean of the first two etc. It might make more sense in some cases.
df.rolling(center=False, window=5, min_periods=1).mean()
Out:
X Y Z
offset
0 -0.140137 -1.924316 -0.426758
10 -1.464630 -1.517764 -0.421387
20 -1.021016 -1.653119 -1.750488
30 -0.791040 -1.604312 -1.419311
40 -0.658125 -1.668801 -1.218262
50 -0.657539 -1.667336 -1.213769
60 -0.125789 -2.202012 -1.328097
70 -0.124031 -2.200938 -0.527121
80 -0.128500 -2.292856 -0.524679
90 -0.128500 -2.292856 -0.508578
Assuming you don't have other rows with all NaN's, you can identify which rows have all NaN's in your rolling_df, and replace them with the corresponding rows from the original. Example:
df=pd.DataFrame(np.random.rand(13,5))
df_rolling=df.rolling(center=False,window=5).mean()
#identify which rows are all NaN
idx = df_rolling.index[df_rolling.isnull().all(1)]
#replace those rows with the original data
df_rolling.loc[idx,:]=df.loc[idx,:]

Pandas DataFrame Apply Efficiency

I have a dataframe to which I wan't to add a column with a kind of status if there is a matching value in another dataframe. I have the current code which works:
df1['NewColumn'] = df1['ComparisonColumn'].apply(lambda x: 'Match' if any(df2.ComparisonColumn == x) else ('' if x is None else 'Missing'))
I know the line is ugly, but I get the impression that its inefficient. Can you suggest a better way to make this comparison?
You can use np.where, isin, and isnull:
Create some dummy data:
np.random.seed(123)
df = pd.DataFrame({'ComparisonColumn':np.random.randint(10,20,20)})
df.iloc[4] = np.nan #Create missing data
df2 = pd.DataFrame({'ComparisonColumn':np.random.randint(15,30,20)})
Do matching with np.where:
df['NewColumn'] = np.where(df.ComparisonColumn.isin(df2.ComparisonColumn),'Matched',np.where(df.ComparisonColumn.isnull(),'Missing',''))
Output:
ComparisonColumn NewColumn
0 12.0
1 12.0
2 16.0 Matched
3 11.0
4 NaN Missing
5 19.0 Matched
6 16.0 Matched
7 11.0
8 10.0
9 11.0
10 19.0 Matched
11 10.0
12 10.0
13 19.0 Matched
14 13.0
15 14.0
16 10.0
17 10.0
18 14.0
19 11.0

Resources