i work with customers consumptions and sometime didn't have this consumption for month or more
so the first consumption after that need to break it down into those months
example
df = pd.DataFrame({'customerId':[1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'month':['2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01','2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01'],
'consumption':[100,130,0,0,400,140,105,500,0,0,0,0,0,3300]})
bfill() return same value not mean (value/count of null +1)
desired value
'c':[100,130,133,133,133,140,105,500,550,550,550,550,550,550]
You can try something like this:
df = pd.DataFrame({'customerId':[1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'month':['2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01','2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01'],
'consumption':[100,130,0,0,400,140,105,500,0,0,0,0,0,3300]})
df['grp'] = df['consumption'].ne(0)[::-1].cumsum()
df['c'] = df.groupby(['customerId', 'grp'])['consumption'].transform('mean')
df
Output:
customerId month consumption grp c
0 1 2021-10-01 100 7 100.000000
1 1 2021-11-01 130 6 130.000000
2 1 2021-12-01 0 5 133.333333
3 1 2022-01-01 0 5 133.333333
4 1 2022-02-01 400 5 133.333333
5 1 2022-03-01 140 4 140.000000
6 1 2022-04-01 105 3 105.000000
7 2 2021-10-01 500 2 500.000000
8 2 2021-11-01 0 1 550.000000
9 2 2021-12-01 0 1 550.000000
10 2 2022-01-01 0 1 550.000000
11 2 2022-02-01 0 1 550.000000
12 2 2022-03-01 0 1 550.000000
13 2 2022-04-01 3300 1 550.000000
Details:
Create a group by checking for zero, the do a cumsum in reverse order
to group zeroes with the next non-zero value.
Groupby that group and transform mean to distribute that non-zero
value across zeroes.
Related
I have a dataframe that looks like
Date col_1 col_2 col_3
2022-08-20 5 B 1
2022-07-21 6 A 1
2022-07-20 2 A 1
2022-06-15 5 B 1
2022-06-11 3 C 1
2022-06-05 5 C 2
2022-06-01 3 B 2
2022-05-21 6 A 1
2022-05-13 6 A 0
2022-05-10 2 B 3
2022-04-11 2 C 3
2022-03-16 5 A 3
2022-02-20 5 B 1
and i want to add a new column col_new that cumcount the number of rows with the same elements in col_1 and col_2 but excluding that row itself and such that the element in col_3 is 1. So the desired output would look like
Date col_1 col_2 col_3 col_new
2022-08-20 5 B 1 3
2022-07-21 6 A 1 2
2022-07-20 2 A 1 1
2022-06-15 5 B 1 2
2022-06-11 3 C 1 1
2022-06-05 5 C 2 0
2022-06-01 3 B 2 0
2022-05-21 6 A 1 1
2022-05-13 6 A 0 0
2022-05-10 2 B 3 0
2022-04-11 2 C 3 0
2022-03-16 5 A 3 0
2022-02-20 5 B 1 1
And here's what I have tried:
Date = pd.to_datetime(df['Date'], dayfirst=True)
list_col_3_is_1 = (df
.assign(Date=Date)
.sort_values('Date', ascending=True)
['col_3'].eq(1))
df['col_new'] = (list_col_3_is_1.groupby(df[['col_1','col_2']]).apply(lambda g: g.shift(1, fill_value=0).cumsum()))
But then I got the following error: ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
Thanks in advance.
Your solution should be changed:
df['col_new'] = list_col_3_is_1.groupby([df['col_1'],df['col_2']]).cumsum()
print (df)
Date col_1 col_2 col_3 col_new
0 2022-08-20 5 B 1 3
1 2022-07-21 6 A 1 2
2 2022-07-20 2 A 1 1
3 2022-06-15 5 B 1 2
4 2022-06-11 3 C 1 1
5 2022-06-05 5 C 2 0
6 2022-06-01 3 B 2 0
7 2022-05-21 6 A 1 1
8 2022-05-13 6 A 0 0
9 2022-05-10 2 B 3 0
10 2022-04-11 2 C 3 0
11 2022-03-16 5 A 3 0
12 2022-02-20 5 B 1 1
Assuming you already have the rows sorted in the desired order, you can use:
df['col_new'] = (df[::-1].assign(n=df['col_3'].eq(1))
.groupby(['col_1', 'col_2'])['n'].cumsum()
)
Output:
Date col_1 col_2 col_3 col_new
0 2022-08-20 5 B 1 3
1 2022-07-21 6 A 1 2
2 2022-07-20 2 A 1 1
3 2022-06-15 5 B 1 2
4 2022-06-11 3 C 1 1
5 2022-06-05 5 C 2 0
6 2022-06-01 3 B 2 0
7 2022-05-21 6 A 1 1
8 2022-05-13 6 A 0 0
9 2022-05-10 2 B 3 0
10 2022-04-11 2 C 3 0
11 2022-03-16 5 A 3 0
12 2022-02-20 5 B 1 1
I have a pandas dataframe as below:
import pandas as pd
import numpy as np
import datetime
# intialise data of lists.
data = {'month' :[2,3,4,5,6,7,2,3,6,5],
'flag': ["A","A","A","A","A","A","B","B","B","B"],
'month1' :[4,4,7,15,11,13,6,5,6,5],
'value' :[100,20,50,10,65,86,24,12,1000,200]
}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
month flag month1 value
0 2 A 4 100
1 3 A 4 20
2 4 A 7 50
3 5 A 15 10
4 6 A 11 65
5 7 A 13 86
6 2 B 6 24
7 3 B 5 12
8 6 B 6 1000
9 5 B 5 200
Now for each month in unique flag, I want to perform below logic
1) Create a variable "final" and set it to 0
2) for each month, If month1 <= max(month), set "final" for where month == month1 to "final" from month1 + value from original month. For example,
index 0 to 5 are one group(flag = 'A')
MAX of month column for group A is 7
for row 1(month 2), month1 is 4 which is less than 7, go to month 4(row 3) update the value of "final" column to 100(0(current "final" value)+100(value from original month)
perform above step to each row in a group.
Expected output:
month flag month1 value Final
0 2 A 4 100 0
1 3 A 4 20 0
2 4 A 7 50 120
3 5 A 15 10 0
4 6 A 11 65 0
5 7 A 13 86 50
6 2 B 6 24 0
7 3 B 5 12 0
8 6 B 6 1000 1024
9 5 B 5 200 212
Define the following functions:
A function to be applied to each row (in the current group):
def fn(row, tbl, maxMonth):
return tbl[tbl.month1 == row.month].value.sum()
A function to be applied to each group:
def fnGrp(grp):
return grp.apply(fn, axis=1, tbl=grp, maxMonth=grp.month.max())
Then, to compute final column, group df by flag and apply
fnGrp to each group and save the result in final column:
df['final'] = df.groupby('flag').apply(fnGrp).reset_index(level=0, drop=True)
The result (df with added column) is:
month flag month1 value final
0 2 A 4 100 0
1 3 A 4 20 0
2 4 A 7 50 120
3 5 A 15 10 0
4 6 A 11 65 0
5 7 A 13 86 50
6 2 B 6 24 0
7 3 B 5 12 0
8 6 B 6 1000 1024
9 5 B 5 200 212
you can groupby 'flag' and 'month1' and get the sum of 'value', then merge this with df plus fillna with 0 such as:
new_df = df.merge(df.groupby(['flag', 'month1'])[['value']].sum(),
left_on=['flag','month'], right_index=True,
how='left', suffixes=('','_final'))\
.fillna({'value_final':0})
print (new_df)
month flag month1 value value_final
0 2 A 4 100 0.0
1 3 A 4 20 0.0
2 4 A 7 50 120.0
3 5 A 15 10 0.0
4 6 A 11 65 0.0
5 7 A 13 86 50.0
6 2 B 6 24 0.0
7 3 B 5 12 0.0
8 6 B 6 1000 1024.0
9 5 B 5 200 212.0
I want to calculate Return, RET, which is Cumulative of 2 periods (now & next period) with groupby(id).
df['RET'] = df.groupby('id')['trt1m1'].rolling(2,min_periods=2).apply(lambda x:x.prod()).reset_index(0,drop=True)
Expected Result:
id datadate trt1m1 RET
1 20051231 1 2
1 20060131 2 6
1 20060228 3 12
1 20060331 4 16
1 20060430 4 20
1 20060531 5 Nan
2 20061031 10 110
2 20061130 11 165
2 20061231 15 300
2 20070131 20 420
2 20070228 21 Nan
Actual Result:
id datadate trt1m1 RET
1 20051231 1 Nan
1 20060131 2 2
1 20060228 3 6
1 20060331 4 12
1 20060430 4 16
1 20060531 5 20
2 20061031 10 Nan
2 20061130 11 110
2 20061231 15 165
2 20070131 20 300
2 20070228 21 420
The code i used calculate cumprod for trailing 2 periods instead of forward.
I've the following dataframe:
car_id time(seconds) is_charging
1 1 65 1
2 1 70 1
3 1 67 1
4 1 71 1
5 1 120 0
6 1 124 0
7 1 117 0
8 1 80 1
9 1 74 1
10 1 62 1
11 1 130 0
12 1 124 0
I want to create new column to enumerate the charging and discharging periods of the 'is_charging' column so later on i can groupby that new column and compute means, max, min values, etc, of each period.
The resulting dataframe should be like this:
car_id time(seconds) is_charging periods_id
1 1 65 1 1
2 1 70 1 1
3 1 67 1 1
4 1 71 1 1
5 1 120 0 2
6 1 124 0 2
7 1 117 0 2
8 1 80 1 3
9 1 74 1 3
10 1 62 1 3
11 1 130 0 4
12 1 124 0 4
I've done this using for statment, like this:
df['periods_ids] = 0
period_id = 1
previous_charging_state = df.at[0,'is_charging']
def computePeriodIDs():
for ind in df.index:
if df.at[index, 'is_charging'] != previous_charging_state:
previous_charging_state = df.at[index, 'is_charging']
period_id = period_id + 1
df.at[index, 'periods_id'] = period_id
else:
df.at[index, 'periods_id'] = period_id
This is way too slow for the amount of rows that i have. I'm trying to use a vectorize function, especially the apply() one but due to my lack of understanding i haven't had much success and i can not find a similar problem online.
Can someone help me optimize this problem?
Try this:
df.is_charging.diff().ne(0).cumsum()
Out[115]:
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 3
11 4
12 4
Name: is_charging, dtype: int32
I have the following test code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'MONTH': [1,2,3,1,1,1,1,1,1,2,3,2,2,3,2,1,1,1,1,1,1,1],
'HOUR': [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
'CIGFT': [np.NaN,12000,2500,73300,73300,np.NaN,np.NaN,np.NaN,np.NaN,12000,100,100,15000,2500,np.NaN,15000,11000,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]})
cigs = pd.DataFrame()
cigs['cigsum'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: (c>=0.0).sum())
cigs['cigcount'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: (c>=0.0).count())
df.fillna(value='-', inplace=True)
cigs['cigminus'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: (c>=0.0).sum())
tfile = open('test_COUNT_manual.txt', 'a')
tfile.write(cigs.to_string())
tfile.close()
I wind up with the following results:
The dataframe:
CIGFT HOUR MONTH
0 NaN 0 1
1 12000.0 0 2
2 2500.0 0 3
3 73300.0 0 1
4 73300.0 0 1
5 NaN 0 1
6 NaN 0 1
7 NaN 0 1
8 NaN 0 1
9 12000.0 0 2
10 100.0 0 3
11 100.0 0 2
12 15000.0 0 2
13 2500.0 0 3
14 NaN 0 2
15 15000.0 0 1
16 11000.0 0 1
17 NaN 0 1
18 NaN 0 1
19 NaN 0 1
20 NaN 0 1
21 NaN 0 1
The results in the write to file:
cigsum cigcount cigminus
MONTH HOUR
1 0 4 14 14
2 0 4 5 5
3 0 3 3 3
My issue is that the .sum() is not summing the values. It is doing a count of the non null values. When I replace the null values with a minus, the .sum()
produces the same result as the count().
So what do I use to get the sum of the values if .sum() does not do it?
Series.sum() -> return the sum of the series values excluding NA/null values by default as mentioned in official docs.
You are getting series in lambda function each time, just apply sum function to series in lambda will give you correct result.
Do this,
cigs['cigsum'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: c.sum())
Result of this code will be,
MONTH HOUR
1 0 172600.0
2 0 39100.0
3 0 5100.0