i work with customers consumptions and sometime didn't have this consumption for month or more
so the first consumption after that need to break it down into those months
example
df = pd.DataFrame({'customerId':[1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'month':['2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01','2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01'],
'consumption':[100,130,0,0,400,140,105,500,0,0,0,0,0,3300]})
bfill() return same value not mean (value/count of null +1)
desired value
'c':[100,130,133,133,133,140,105,500,550,550,550,550,550,550]
You can try something like this:
df = pd.DataFrame({'customerId':[1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'month':['2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01','2021-10-01','2021-11-01','2021-12-01','2022-01-01','2022-02-01','2022-03-01','2022-04-01'],
'consumption':[100,130,0,0,400,140,105,500,0,0,0,0,0,3300]})
df['grp'] = df['consumption'].ne(0)[::-1].cumsum()
df['c'] = df.groupby(['customerId', 'grp'])['consumption'].transform('mean')
df
Output:
customerId month consumption grp c
0 1 2021-10-01 100 7 100.000000
1 1 2021-11-01 130 6 130.000000
2 1 2021-12-01 0 5 133.333333
3 1 2022-01-01 0 5 133.333333
4 1 2022-02-01 400 5 133.333333
5 1 2022-03-01 140 4 140.000000
6 1 2022-04-01 105 3 105.000000
7 2 2021-10-01 500 2 500.000000
8 2 2021-11-01 0 1 550.000000
9 2 2021-12-01 0 1 550.000000
10 2 2022-01-01 0 1 550.000000
11 2 2022-02-01 0 1 550.000000
12 2 2022-03-01 0 1 550.000000
13 2 2022-04-01 3300 1 550.000000
Details:
Create a group by checking for zero, the do a cumsum in reverse order
to group zeroes with the next non-zero value.
Groupby that group and transform mean to distribute that non-zero
value across zeroes.
I am a python beginner.
I have the following pandas DataFrame, with only two columns; "Time" and "Input".
I want to loop over the "Input" column. Assuming we have a window size w= 3. (three consecutive values) such that for every selected window, we will check if all the items/elements within that window are 1's, then return the first item as 1 and change the remaining values to 0's.
index Time Input
0 11 0
1 22 0
2 33 0
3 44 1
4 55 1
5 66 1
6 77 0
7 88 0
8 99 0
9 1010 0
10 1111 1
11 1212 1
12 1313 1
13 1414 0
14 1515 0
My intended output is as follows
index Time Input What_I_got What_I_Want
0 11 0 0 0
1 22 0 0 0
2 33 0 0 0
3 44 1 1 1
4 55 1 1 0
5 66 1 1 0
6 77 1 1 1
7 88 1 0 0
8 99 1 0 0
9 1010 0 0 0
10 1111 1 1 1
11 1212 1 0 0
12 1313 1 0 0
13 1414 0 0 0
14 1515 0 0 0
What should I do to get the desired output? Am I missing something in my code?
import pandas as pd
import re
pd.Series(list(re.sub('111', '100', ''.join(df.Input.astype(str))))).astype(int)
Out[23]:
0 0
1 0
2 0
3 1
4 0
5 0
6 1
7 0
8 0
9 0
10 1
11 0
12 0
13 0
14 0
dtype: int32
I have a time series dataset of product given below:
date product price amount
11/17/2019 A 10 20
11/19/2019 A 15 20
11/24/2019 A 20 30
12/01/2019 C 40 50
12/05/2019 C 45 35
This data has a missing days ("MM/dd/YYYY") between the start and end date of data for each product. I am trying to fill missing date with zero rows and convert to previous table into a table given below:
date product price amount
11/17/2019 A 10 20
11/18/2019 A 0 0
11/19/2019 A 15 20
11/20/2019 A 0 0
11/21/2019 A 0 0
11/22/2019 A 0 0
11/23/2019 A 0 0
11/24/2019 A 20 30
12/01/2019 C 40 50
12/02/2019 C 0 0
12/03/2019 C 0 0
12/04/2019 C 0 0
12/05/2019 C 45 35
To get this conversion, I used the code:
import pandas as pd
import numpy as np
data=pd.read_csv("test.txt", sep="\t", parse_dates=['date'])
data=data.set_index(["date", "product"])
start=data.first_valid_index()[0]
end=data.last_valid_index()[0]
df=data.set_index("date").reindex(pd.date_range(start,end, freq="1D"), fill_values=0)
However the code gives an error. Is there any way to get this conversion efficiently?
If need add 0 for missing Datetimes for each product separately use custom function in GroupBy.apply with DataFrame.reindex by minimal and maximal datetime:
df = pd.read_csv("test.txt", sep="\t", parse_dates=['date'])
f = lambda x: x.reindex(pd.date_range(x.index.min(),
x.index.max(), name='date'), fill_value=0)
df = (df.set_index('date')
.groupby('product')
.apply(f)
.drop('product', axis=1)
.reset_index())
print (df)
product date price amount
0 A 2019-11-17 10 20
1 A 2019-11-18 0 0
2 A 2019-11-19 15 20
3 A 2019-11-20 0 0
4 A 2019-11-21 0 0
5 A 2019-11-22 0 0
6 A 2019-11-23 0 0
7 A 2019-11-24 20 30
8 C 2019-12-01 40 50
9 C 2019-12-02 0 0
10 C 2019-12-03 0 0
11 C 2019-12-04 0 0
12 C 2019-12-05 45 35
one option is to use the complete function from pyjanitor to expose the missing rows per group:
#pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
# build the dates to be applied per group
dates = dict(date = lambda df: pd.date_range(df.min(), df.max(), freq='1D'))
df.complete(dates, by='product', sort = True).fillna(0, downcast='infer')
date product price amount
0 2019-11-17 00:00:00 A 10 20
1 2019-11-18 00:00:00 A 0 0
2 2019-11-19 00:00:00 A 15 20
3 2019-11-20 00:00:00 A 0 0
4 2019-11-21 00:00:00 A 0 0
5 2019-11-22 00:00:00 A 0 0
6 2019-11-23 00:00:00 A 0 0
7 2019-11-24 00:00:00 A 20 30
8 2019-12-01 00:00:00 C 40 50
9 2019-12-02 00:00:00 C 0 0
10 2019-12-03 00:00:00 C 0 0
11 2019-12-04 00:00:00 C 0 0
12 2019-12-05 00:00:00 C 45 35
There's an easier method for this case:
#create the full date range, and then create a DataFrame with the range
#if needed, you can expand the range a bit using datetime.timedelta()
alldates=pd.DataFrame(pd.date_range(data.index.min()-timedelta(1),data.index.max()+timedelta(4), freq="1D",name="newdate"))
#make 'newdate' the index, and you no longer need it as a column
alldates.index=alldates.newdate
alldates.drop(columns="newdate", inplace=True)
#now, join the tables, missing dates in the original table will be filled with NaN
data=alldates.join(data)
I have a df as shown below
df:
ID Limit N_30 N_31_90 N_91_180 N_180_365
1 500 60 15 30 1
2 300 0 15 5 10
3 800 0 0 10 6
4 100 0 0 0 370
5 600 0 6 5 10
6 800 0 0 15 6
7 500 10 10 30 9
8 200 0 0 0 0
About the data
ID - customer ID
Limit - Limit
N_30 - Number of transaction in last 30 days
N_31_90 - Number of transaction in last 31 to 90 days.
N_91_180 - Number of transaction in last 91 to 180 days.
N_180_365 - Number of transaction in last 281 to 365 days.
From the above df I would like to extract a column called Recency.
Explanation:
if df['N_30'] != 0, then Recency = (30/df['N_30'])
elif df['N_31_90'] != 0 then Recency = 30 + (60/df['N_31_90'])
elif df['N_91_180'] != 0 then Recency = 90 + (90/df['N_91_180'])
elif df['N_181_365'] != 0 then Recency = 180 + (185/df['N_181_365'])
else Recency = 730
Expected output:
ID Limit N_30 N_31_90 N_91_180 N_180_365 Recency
1 500 60 15 30 1 (30/60) = 0.5
2 300 0 15 5 10 30+(60/15) = 34
3 800 0 0 10 6 90+90/10 = 100
4 100 0 0 0 370 180+(185/370) = 180.5
5 600 0 6 5 10 30+(60/6) = 36
6 800 0 0 15 6 90+(90/15) = 96
7 500 10 10 30 9 30/10 = 3
8 200 0 0 0 0 730
IIUC, using boolean masking with bfill:
pd.set_option("use_inf_as_na", True)
df2 = df.filter(like="N_")
df["Recency"] = (df2.eq(0) * [30, 60, 90, 180]).sum(1) + ([30, 60, 90, 185] / df2).bfill(1).iloc[:, 0]
print(df)
Output:
ID Limit N_30 N_31_90 N_91_180 N_180_365 Recency
0 1 500 60 15 30 1 0.5
1 2 300 0 15 5 10 34.0
2 3 800 0 0 10 6 99.0
3 4 100 0 0 0 370 180.5
4 5 600 0 6 5 10 40.0
5 6 800 0 0 15 6 96.0
6 7 500 10 10 30 9 3.0
A flag column in a pandas dataframe is populated by 1 or 0
The problem is to identify continuous 1s.
Let t be the number of days thresholds
There are two types of transformations required:
i) If there are more than t 1s together, turn the (t+1)th onwards 1 to 0
ii) If there are more than t 1s together, turn all the 1s to 0s
My approach is to create 2 columns called result1 and result2, and filter using these columns:
Please see image here
I have not been able to think of anything as such, so not posting any code.
A nudge or hint in the right direction would be appreciated.
Use:
#compare 0 values
m = df['Value'].eq(0)
#get cumulative sum and filter only 1 rows
g = m.cumsum()[~m]
#set by condition - 0 or ccounter per groups
df['Result1'] = np.where(m, 0, df.groupby(g).cumcount().add(1))
#get maximum per groups with transform for new Series
df['Result2'] = np.where(m, 0, df.groupby(g)['Result1'].transform('max')).astype(int)
print (df)
Value Result1 Result2
0 1 1 1
1 0 0 0
2 0 0 0
3 1 1 2
4 1 2 2
5 0 0 0
6 1 1 4
7 1 2 4
8 1 3 4
9 1 4 4
10 0 0 0
11 0 0 0
12 1 1 1
13 0 0 0
14 1 1 1
15 0 0 0
16 0 0 0
17 1 1 6
18 1 2 6
19 1 3 6
20 1 4 6
21 1 5 6
22 1 6 6
23 0 0 0
24 1 1 1
25 0 0 0
26 0 0 0
27 1 1 1
28 0 0 0