Get Propportions of One Hot Encoded Values While Aggregation - Pandas - python-3.x

I have a df like this,
Date Value
0 2019-03-01 0
1 2019-04-01 1
2 2019-09-01 0
3 2019-10-01 1
4 2019-12-01 0
5 2019-12-20 0
6 2019-12-20 0
7 2020-01-01 0
Now, I need to group them by quarter and get the proportions of 1 and 0. So, I get my final output like this,
Date Value1 Value0
0 2019-03-31 0 1
1 2019-06-30 1 0
2 2019-09-30 0 1
3 2019-12-31 0.25 0.75
4 2020-03-31 0 1
I tried the following code, doesn't seem to work.
def custom_resampler(array):
import numpy as np
return array/np.sum(array)
>>df.set_index('Date').resample('Q')['Value'].apply(custom_resampler)
Is there a pandastic way I can achieve my desired output?

Resample by quarter, get the value_counts, and unstack. Next, rename the columns, using the name property of the columns. Last, divide each row value by the total per row :
df = pd.read_clipboard(sep='\s{2,}', parse_dates = ['Date'])
res = (df
.resample(rule="Q",on="Date")
.Value
.value_counts()
.unstack("Value",fill_value=0)
)
res.columns = [f"{res.columns.name}{ent}" for ent in res.columns]
res = res.div(res.sum(axis=1),axis=0)
res
Value0 Value1
Date
2019-03-31 1.00 0.00
2019-06-30 0.00 1.00
2019-09-30 1.00 0.00
2019-12-31 0.75 0.25
2020-03-31 1.00 0.00

Related

How to subtract X rows in a dataframe with first value from another dataframe?

I am using pandas for this work.
I have a 2 datasets. The first dataset has approximately 6 million rows and 6 columns. For example the first data set looks something like this:
Date
Time
U
V
W
T
2020-12-30
2:34
3
4
5
7
2020-12-30
2:35
2
3
6
5
2020-12-30
2:36
1
5
8
5
2020-12-30
2:37
2
3
0
8
2020-12-30
2:38
4
4
5
7
2020-12-30
2:39
5
6
5
9
this is just the raw data collected from the machine.
The second is the average values of three rows at a time from each column (U,V,W,T).
U
V
W
T
2
4
6.33
5.67
3.66
4.33
3.33
8
What I am trying to do is calculate the perturbation for each column per second.
U(perturbation)=U(raw)-U(avg)
U(raw)= dataset 1
U(avg)= dataset 2
Basically take the first three rows from the first column of the first dataset and individually subtract them by the first value in the first column of the second dataset, then take the next three values from the first column of the first data set and individually subtract them by second value in the first column of the second dataset. Do the same for all three columns.
The desired final output should be as the following:
Date
Time
U
V
W
T
2020-12-30
2:34
1
0
-1.33
1.33
2020-12-30
2:35
0
-1
-0.33
-0.67
2020-12-30
2:36
-1
1
1.67
-0.67
2020-12-30
2:37
-1.66
-1.33
-3.33
0
2020-12-30
2:38
0.34
-0.33
1.67
-1
2020-12-30
2:39
1.34
1.67
1.67
1
I am new to pandas and do not know how to approach this.
I hope it makes sense.
a = df1.assign(index = df1.index // 3).merge(df2.reset_index(), on='index')
b = a.filter(regex = '_x', axis=1) - a.filter(regex = '_y', axis = 1).to_numpy()
pd.concat([a.filter(regex='^[^_]+$', axis = 1), b], axis = 1)
Date Time index U_x V_x W_x T_x
0 2020-12-30 2:34 0 0.00 0.00 -1.33 1.33
1 2020-12-30 2:35 0 -1.00 -1.00 -0.33 -0.67
2 2020-12-30 2:36 0 -2.00 1.00 1.67 -0.67
3 2020-12-30 2:37 1 -1.66 -1.33 -3.33 0.00
4 2020-12-30 2:38 1 0.34 -0.33 1.67 -1.00
5 2020-12-30 2:39 1 1.34 1.67 1.67 1.00
You can use numpy:
import numpy as np
df1[df2.columns] -= np.repeat(df2.to_numpy(), 3, axis=0)
NB. This modifies df1 in place, if you want you can make a copy first (df_final = df1.copy()) and apply the subtraction on this copy.

Pandas - group by function and sum columns to extract rows where sum of other columns is 0

I have a data frame with over three million rows. I am trying to Group values in Bar_Code column and extract only those rows where sum of all rows in SOH, Cost and Sold_Date is zero.
My dataframe is as under:
Location Bar_Code SOH Cost Sold_Date
1 00000003589823 0 0.00 NULL
2 00000003589823 0 0.00 NULL
3 00000003589823 0 0.00 NULL
1 0000000151818 -102 0.00 NULL
2 0000000151818 0 8.00 NULL
3 0000000151818 0 0.00 2020-10-06T16:35:25.000
1 0000131604108 0 0.00 NULL
2 0000131604108 0 0.00 NULL
3 0000131604108 0 0.00 NULL
1 0000141073505 -53 3.00 2020-10-06T16:35:25.000
2 0000141073505 0 0.00 NULL
3 0000141073505 -20 20.00 2020-09-25T10:11:30.000
I have tried the below code:
df.groupby(['Bar_Code','SOH','Cost','Sold_Date']).sum()
but I am getting the below output:
Bar_Code SOH Cost Sold_Date
0000000151818 -102.0 0.0000 2021-12-13T10:01:59.000
0.0 8.0000 2020-10-06T16:35:25.000
0000131604108 0.0 0.0000 NULL
0000141073505 -53.0 0.0000 2021-11-28T16:57:59.000
3.0000 2021-12-05T11:23:02.000
0.0 0.0000 2020-04-14T08:02:45.000
0000161604109 -8.0 4.1000 2020-09-25T10:11:30.000
00000003589823 0 0.00 NULL
I need to check if it is possible to get the below desired output to get only the specific rows where sum of SOH, Cost & Sold_Date is 0 or NULL, its safe that the code ignores first Column (Locations):
Bar_Code SOH Cost Sold_Date
00000003589823 0 0.00 NULL
0000131604108 0.0 0.0000 NULL
Idea is filter all groups if SOH, Cost and Sold_Date is 0 or NaN by filter rows if not match first, get Bar_Code and last invert mask for filter all groups in isin:
g = df.loc[df[['SOH','Cost','Sold_Date']].fillna(0).ne(0).any(axis=1), 'Bar_Code']
df1 = df[~df['Bar_Code'].isin(g)].drop_duplicates('Bar_Code').drop('Location', axis=1)
print (df1)
Bar_Code SOH Cost Sold_Date
0 00000003589823 0 0.0 NaN
6 0000131604108 0 0.0 NaN

Filter rows based one column' value and calculate percentage of sum in Pandas

Given a small dataset as follows:
value input
0 3 0
1 4 1
2 3 -1
3 2 1
4 3 -1
5 5 0
6 1 0
7 1 1
8 1 1
I have used the following code:
df['pct'] = df['value'] / df['value'].sum()
But I want to calculate pct by excluding input = -1, which means if input value is -1, then the correspondent values will not taken into account to sum up, neither necessary to calculate pct, for rows 2 and 4 at this case.
The expected result will like this:
value input pct
0 3 0 0.18
1 4 1 0.24
2 3 -1 NaN
3 2 1 0.12
4 3 -1 NaN
5 5 0 0.29
6 1 0 0.06
7 1 1 0.06
8 1 1 0.06
How could I do that in Pandas? Thanks.
You can sum not matched rows by missing values to Series s by Series.where and divide only rows not matched mask filtered by DataFrame.loc, last round by Series.round:
mask = df['input'] != -1
df.loc[mask, 'pct'] = (df.loc[mask, 'value'] / df['value'].where(mask).sum()).round(2)
print (df)
value input pct
0 3 0 0.18
1 4 1 0.24
2 3 -1 NaN
3 2 1 0.12
4 3 -1 NaN
5 5 0 0.29
6 1 0 0.06
7 1 1 0.06
8 1 1 0.06
EDIT: If need replace missing values to 0 is possible use second argument in where for set values to 0, this Series is possible also sum for same output like replace to missing values:
s = df['value'].where(df['input'] != -1, 0)
df['pct'] = (s / s.sum()).round(2)
print (df)
value input pct
0 3 0 0.18
1 4 1 0.24
2 3 -1 0.00
3 2 1 0.12
4 3 -1 0.00
5 5 0 0.29
6 1 0 0.06
7 1 1 0.06
8 1 1 0.06

Pandas groupby and conditional check on multiple columns

I have a dataframe like so:
id date status value
1 2009-06-17 1 NaN
1 2009-07-17 B NaN
1 2009-08-17 A NaN
1 2009-09-17 5 NaN
1 2009-10-17 0 0.55
2 2010-07-17 B NaN
2 2010-08-17 A NaN
2 2010-09-17 0 0.00
Now I want to group by id and then check if value becomes non-zero after status changes to A. So for group with id=1, status does change to A and after(in terms of date) that value also becomes non-zero. But for group with id=2, even after status changes to A, value does not become non-zero. Please note that if status does not change to A then I don't even need to check value.
So finally I want a new dataframe like this:
id check
1 True
2 False
Use:
print (df)
id date status value
0 1 2009-06-17 1 NaN
1 1 2009-07-17 B NaN
2 1 2009-08-17 A NaN
3 1 2009-09-17 5 NaN
4 1 2009-10-17 0 0.55
5 2 2010-07-17 B NaN
6 2 2010-08-17 A NaN
7 2 2010-09-17 0 0.00
8 3 2010-08-17 R NaN
9 3 2010-09-17 0 0.00
idx = df['id'].unique()
#filter A values
m = df['status'].eq('A')
#filter all rows after A per groups
df1 = df[m.groupby(df['id']).cumsum().gt(0)]
print (df1)
id date status value
2 1 2009-08-17 A NaN
3 1 2009-09-17 5 NaN
4 1 2009-10-17 0 0.55
6 2 2010-08-17 A NaN
7 2 2010-09-17 0 0.00
#compare by 0 and test if no 0 value per group and last added all posible id by reindex
df2 = (df1['value'].ne(0)
.groupby(df1['id'])
.all()
.reindex(idx, fill_value=False)
.reset_index(name='check'))
print (df2)
id check
0 1 True
1 2 False
2 3 False

pandas count frequency of column value in another dataframe column

I want to calculate the frequencies of values of a dataframe column in a column from another dataframe. Right now, I have the code as below:
df2["freq"] = df1[["col1"]].groupby(df2["col2"])["col1"].transform('count')
But it is giving freq of 1.0 for all the values in df2["col2"], even for those values that don't exist in df1["col1"].
df1:
col1
0 636
1 636
2 801
3 802
df2:
col2
0 636
1 734
2 801
3 803
df2 after adding freq column:
col2 freq
0 636 1.0
1 734 1.0
2 801 1.0
3 803 1.0
What I actually want:
col2 freq
0 636 2
1 734 0
2 801 1
3 803 0
I am new to pandas, so I am not getting what I am doing wrong. Any help is appreciated! Thanks!
Use Series.map by Series created by Series.value_counts, last replace missing values to 0:
df2["freq"] = df2["col2"].map(df1["col1"].value_counts()).fillna(0).astype(int)
print (df2)
col2 freq
0 636 2
1 734 0
2 801 1
3 803 0

Resources