Counting number of cases available from datetime in pandas - python-3.x

I have a dataframe of start date and closed date of cases. I want to do a count of how many cases are available at the start of each case.
caseNo startDate closedDate
1 2019-01-01 2019-01-03
2 2019-01-02 2019-01-10
3 2019-01-03 2019-01-04
4 2019-01-05 2019-01-10
5 2019-01-06 2019-01-10
6 2019-01-07 2019-01-12
7 2019-01-11 2019-01-15
Output will be:
caseNo startDate closedDate numCases
1 2019-01-01 2019-01-03 0
2 2019-01-02 2019-01-10 1
3 2019-01-03 2019-01-04 1
4 2019-01-05 2019-01-10 1
5 2019-01-06 2019-01-10 2
6 2019-01-07 2019-01-12 3
7 2019-01-11 2019-01-15 1
For example, for case 6, cases 2,4,5 still have not been closed. So there are 3 cases outstanding.
Also, the dates are actually datetimes rather than just date. I have only included the date here for brevity.

Solution in numba should increase performance (best test in real data):
from numba import jit
#jit(nopython=True)
def nb_func(x, y):
res = np.empty(x.size, dtype=np.int64)
for i in range(x.size):
res[i] = np.sum(x[:i] > y[i])
return res
df['case'] = nb_func(df['closedDate'].to_numpy(), df['startDate'].to_numpy())
print (df)
caseNo startDate closedDate case
0 1 2019-01-01 2019-01-03 0
1 2 2019-01-02 2019-01-10 1
2 3 2019-01-03 2019-01-04 1
3 4 2019-01-05 2019-01-10 1
4 5 2019-01-06 2019-01-10 2
5 6 2019-01-07 2019-01-12 3
6 7 2019-01-11 2019-01-15 1

Use:
res = []
temp = pd.to_datetime(df['closedDate'])
for i, row in df.iterrows():
temp_res = np.sum(row['startDate']<temp.iloc[:i])
print(temp_res)
res.append(temp_res)
output:
Then you can add the result as a df column:

Related

How to get the max value from previous N rows of a record in Pandas?

I have the following Pandas DataFrame:
date value
2021-01-01 10
2021-01-02 5
2021-01-03 7
2021-01-04 1
2021-01-05 12
2021-01-06 8
2021-01-07 9
2021-01-08 8
2021-01-09 4
2021-01-10 3
I need to get the max value from the previous N-1 rows (counting the current record) and make an operation. For example:
For N=3 and the operation = current_row / MAX (previous_N-1_rows_and_current), this should be the result:
date value Operation
2021-01-01 10 10/10
2021-01-02 5 5/10
2021-01-03 7 7/10
2021-01-04 1 1/7
2021-01-05 12 12/12
2021-01-06 8 8/12
2021-01-07 9 9/12
2021-01-08 8 8/9
2021-01-09 4 4/9
2021-01-10 3 3/8
If it's possible, in the spirit of the pythonic way.
Thanks and regards.
We can calculate rolling max over the value column then divide value column by this rolling max to get the result
df['op'] = df['value'] / df.rolling(3, min_periods=1)['value'].max()
date value op
0 2021-01-01 10 1.000000
1 2021-01-02 5 0.500000
2 2021-01-03 7 0.700000
3 2021-01-04 1 0.142857
4 2021-01-05 12 1.000000
5 2021-01-06 8 0.666667
6 2021-01-07 9 0.750000
7 2021-01-08 8 0.888889
8 2021-01-09 4 0.444444
9 2021-01-10 3 0.375000
You can use .rolling:
df["Operation"] = df.rolling(3, min_periods=1)["value"].apply(
lambda x: x.iat[-1] / x.max()
)
print(df)
Prints:
date value Operation
0 2021-01-01 10 1.000000
1 2021-01-02 5 0.500000
2 2021-01-03 7 0.700000
3 2021-01-04 1 0.142857
4 2021-01-05 12 1.000000
5 2021-01-06 8 0.666667
6 2021-01-07 9 0.750000
7 2021-01-08 8 0.888889
8 2021-01-09 4 0.444444
9 2021-01-10 3 0.375000

Replace all dates with the minimum date corresponding to the month

I have a dataframe df with columns : Date, Value1, Value. I want to replace all the dates in 'Date' with minimum date corresponding to their month.
The dataframe df that I have:
Date Value1 Value1
1/1/2019 1 4
19/1/2019 3 6
30/1/2019 3 1
5/5/2020 2 10
10/5/2020 6 4
The output that I want:
Date Value1 Value1
1/1/2019 1 4
1/1/2019 3 6
1/1/2019 3 1
5/5/2020 2 10
5/5/2020 6 4
use to_datetime()+groupby()+transform():
df['Date']=df['Date']=pd.to_datetime(df['Date'],format='%d/%m/%Y')
df['Date']=df.groupby(df['Date'].dt.month)['Date'].transform('min')
output of df:
Date Value Value
0 2019-01-01 1 4
1 2019-01-01 3 6
2 2019-01-01 3 1
3 2020-05-05 2 10
4 2020-05-05 6 4
Note: If you want the initial format then use:
df['Date']=df['Date'].dt.strftime('%d/%m/%Y')

Better approach on getting the point on datetime just before the current datetime

Giving the following code where df is item sold table, df1 is user point table, I want to get the user point that with datetime just before the datetime in df table, and the vin value must match
import pandas as pd
import numpy as np
df=pd.DataFrame({
"vin":[1,1,2,2],
"date":['1/1/2021 13:55','1/6/2021 13:55','1/8/2021 13:55','1/10/2021 13:55'],
"quantity_sold":[1,2,3,4]})
df1=pd.DataFrame({
"vin":[1,1,2,2],
"date":['12/1/2020 12:55','1/3/2021 15:55','1/8/2021 14:55','1/10/2021 12:55'],
"user_point":[1,3,5,7]})
df['date']=pd.to_datetime(df['date'])
df1['date']=pd.to_datetime(df1['date'])
print(df)
print(df1)
Table df
vin date quantity_sold
0 1 2021-01-01 13:55:00 1
1 1 2021-01-06 13:55:00 2
2 2 2021-01-08 13:55:00 3
3 2 2021-01-10 13:55:00 4
Table df1
vin date user_point
0 1 2020-12-01 12:55:00 1
1 1 2021-01-03 15:55:00 3
2 2 2021-01-08 14:55:00 5
3 2 2021-01-10 12:55:00 7
expected output:
vin date quantity_sold user_point
0 1 2021-01-01 13:55:00 1 1.0
1 1 2021-01-06 13:55:00 2 3.0
2 2 2021-01-08 13:55:00 3 NaN
3 2 2021-01-10 13:55:00 4 7.0
explanation:
for first date 2021-01-01 13:55:00, the nearest date that happens before is 2020-12-01 12:55:00, so it takes value 1
for third date 2021-01-08 13:55:00, the nearest date that happens before is 2021-01-03 15:55:00 (but its vin 1) (date 2021-01-08 14:55:00 is vin 2 but its happening after), so it takes value nan
Other can reason similarly.
This is my approach, pretty ugly
df=pd.merge(df,df1,on="vin",how="left")
df['time_diff']=df['date_x']-df['date_y']
df.loc[df.time_diff.dt.days<0,'time_diff']=np.nan
df=df.sort_values(['vin','date_x','time_diff'])
df=df.groupby(['vin','date_x']).first().reset_index()
df.loc[df['time_diff'].isnull(),"user_point"]=np.nan
print(df[['vin','date_x','quantity_sold','user_point']].rename(columns={"date_x":"date"}))
Is there a more elegant approach (like merge_asof) to merge df and df1 to product expected outcome?
You can do:
df['nearestbefore']=\
df.date.apply(lambda row: max(df1.date[df1.date < row]))
whichi will give you:
date quantity_sold nearestbefore
0 2021-01-01 13:55:00 1 2020-12-01 12:55:00
1 2021-01-06 13:55:00 2 2021-01-03 15:55:00
2 2021-01-08 13:55:00 3 2021-01-03 15:55:00
3 2021-01-10 13:55:00 4 2021-01-10 12:55:00
Then you can use nearestbefore column to merge with df1.
Then you can do:
df1.rename(columns={'date':'key'}).merge(df,on='key').drop(columns=['key'])
Which will give you:
user_point date quantity_sold
0 1 2021-01-01 13:55:00 1
1 3 2021-01-06 13:55:00 2
2 3 2021-01-08 13:55:00 3
3 7 2021-01-10 13:55:00 4

Month and Date messed up in pandas dataframe

I have a situation where month and date are messed up for few dates in my dataframe. For e.g here is the input:
df['work_date'].head(15)
0 2018-01-01
1 2018-02-01
2 2018-03-01
3 2018-04-01
4 2018-05-01
5 2018-06-01
6 2018-07-01
7 2018-08-01
8 2018-09-01
9 2018-10-01
10 2018-11-01
11 2018-12-01
12 2018-01-13
13 2018-01-14
14 2018-01-15
The date is stored as a string. As you can see, the date is in the format yyyy-dd-mm till 12th of Jan and then becomes yyyy-mm-dd. The dataframe consists of 3 years worth data and this pattern repeats for all months for all years.
My expected output is to standardize the date to format dddd-mm-yy like below.
0 2018-01-01
1 2018-01-02
2 2018-01-03
3 2018-01-04
4 2018-01-05
5 2018-01-06
6 2018-01-07
7 2018-01-08
8 2018-01-09
9 2018-01-10
10 2018-01-11
11 2018-01-12
12 2018-01-13
13 2018-01-14
14 2018-01-15
Below is the code that I wrote and it gets the job done. Basically, I split the date string and do some string manipulations. However, as you can see its not too pretty. I am checking to see if there could be some other elegant solution to this other than doing the df.apply and the loops.
def func(x):
d = x.split('-')
print(d)
if (int(d[1]) <= 12) & (int(d[2]) <= 12) :
d = [d[0],d[2],d[1]]
x = '-'.join(d)
return x
else:
return x
df['work_date'] = df['work_date'].apply(lambda x:func(x))
You could just update the column based on the fact that it is in order and there is only one date and all days of the year are included consecutively:
df['Date'] = pd.date_range(df['work_date'].min(), '2018-01-12', freq='1D')
# you can specify df['work_date'].min() OR df['work_date'].max) OR A STRING. It really depends on what format your minimum and your maximum is
df
Out[1]:
work_date date
0 2018-01-01 2018-01-01
1 2018-02-01 2018-01-02
2 2018-03-01 2018-01-03
3 2018-04-01 2018-01-04
4 2018-05-01 2018-01-05
5 2018-06-01 2018-01-06
6 2018-07-01 2018-01-07
7 2018-08-01 2018-01-08
8 2018-09-01 2018-01-09
9 2018-10-01 2018-01-10
10 2018-11-01 2018-01-11
11 2018-12-01 2018-01-12
12 2018-01-13 2018-01-13
13 2018-01-14 2018-01-14
14 2018-01-15 2018-01-15
To make this more dynamic, you could also do some try / except shown below:
minn = df['work_date'].min()
maxx = df['work_date'].max()
try:
df['Date'] = pd.date_range(minn, maxx, freq='1D')
except ValueError:
s = maxx.split('-')
df['Date'] = pd.date_range(minn, f'{s[0]}-{s[2]}-{s[1]}', freq='1D')
except ValueError:
s = minn.split('-')
df['Date'] = pd.date_range(f'{s[0]}-{s[2]}-{s[1]}', maxx, freq='1D')
df

Pandas - Filling missing dates within groups with different time ranges

I'm working with a dataset which has monthly information about several users. And each user has a different time range. There is also missing "time" data for each user. What I would like to do is fill in the missing month data for each user based on the time range for each user(from min.time to max.time in months)
I've read approaches to similar situation using re-sample, re-index from here, but I'm not getting the desired output/there is row mismatch after filling the missing months.
Any help/pointers would be much appreciated.
-Luc
Tried using re-sample, re-index, but not getting desired output
x = pd.DataFrame({'user': ['a','a','b','b','c','a','a','b','a','c','c','b'], 'dt': ['2015-01-01','2015-02-01', '2016-01-01','2016-02-01','2017-01-01','2015-05-01','2015-07-01','2016-05-01','2015-08-01','2017-03-01','2017-08-01','2016-09-01'], 'val': [1,33,2,1,5,4,2,5,66,7,5,1]})
date id value
0 2015-01-01 a 1
1 2015-02-01 a 33
2 2016-01-01 b 2
3 2016-02-01 b 1
4 2017-01-01 c 5
5 2015-05-01 a 4
6 2015-07-01 a 2
7 2016-05-01 b 5
8 2015-08-01 a 66
9 2017-03-01 c 7
10 2017-08-01 c 5
11 2016-09-01 b 1
What I would like to see is - for each 'id' generate missing months based on min.date and max.date for that id and fill 'val' for those months with 0.
Create DatetimeIndex, so possible use groupby with custom lambda function and Series.asfreq:
x['dt'] = pd.to_datetime(x['dt'])
x = (x.set_index('dt')
.groupby('user')['val']
.apply(lambda x: x.asfreq('MS', fill_value=0))
.reset_index())
print (x)
user dt val
0 a 2015-01-01 1
1 a 2015-02-01 33
2 a 2015-03-01 0
3 a 2015-04-01 0
4 a 2015-05-01 4
5 a 2015-06-01 0
6 a 2015-07-01 2
7 a 2015-08-01 66
8 b 2016-01-01 2
9 b 2016-02-01 1
10 b 2016-03-01 0
11 b 2016-04-01 0
12 b 2016-05-01 5
13 b 2016-06-01 0
14 b 2016-07-01 0
15 b 2016-08-01 0
16 b 2016-09-01 1
17 c 2017-01-01 5
18 c 2017-02-01 0
19 c 2017-03-01 7
20 c 2017-04-01 0
21 c 2017-05-01 0
22 c 2017-06-01 0
23 c 2017-07-01 0
24 c 2017-08-01 5
Or use Series.reindex with min and max datetimes per groups:
x = (x.set_index('dt')
.groupby('user')['val']
.apply(lambda x: x.reindex(pd.date_range(x.index.min(),
x.index.max(), freq='MS'), fill_value=0))
.rename_axis(('user','dt'))
.reset_index())

Resources