Airflow catchup weekdays only - cron

Running a historical import each day in december 2018 requires a Dag that catches up with a chron expression '0 0 12 * * MON-FRI'.
Why does the scheduler run weekends when the dag starts up and catchup=True?
Does the catchup parameter respect the schedule interval?

Your expression doesn't work. But 0 0 * 12 MON-FRI or 0 0 * 12 1-5 would.
Airflow uses croniter and you can play from home with:
$ cal 12 2018
December 2018
Su Mo Tu We Th Fr Sa
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31
$ python -c '
from croniter import croniter as cr
from datetime import datetime as dt
c=cr("0 0 * 12 MON-FRI", dt(2018,12,1))
for i in range(1,31):
print(f"{i:>2}: ", c.get_next(dt))'
1: 2018-12-03 00:00:00
2: 2018-12-04 00:00:00
3: 2018-12-05 00:00:00
4: 2018-12-06 00:00:00
5: 2018-12-07 00:00:00
6: 2018-12-10 00:00:00
7: 2018-12-11 00:00:00
8: 2018-12-12 00:00:00
9: 2018-12-13 00:00:00
...
21: 2018-12-31 00:00:00
22: 2019-12-02 00:00:00
23: 2019-12-03 00:00:00
24: 2019-12-04 00:00:00
25: 2019-12-05 00:00:00
26: 2019-12-06 00:00:00
27: 2019-12-09 00:00:00
28: 2019-12-10 00:00:00
29: 2019-12-11 00:00:00
30: 2019-12-12 00:00:00
It should not "run weekends" but you may find it confusing that the execution_date (determined by the start_date and schedule_interval) is not the date when the the DAG is run. E.G. The dag_run scheduled for #1 above is going to start running when #2 is past, etc. Also, by default these would be UTC, so run #5 there would start at #6 UTC, which in NYC would be: 2018-12-09 19:00:00-05:00
See:
python -c '
from croniter import croniter as cr; from datetime import datetime as dt
from pendulum import datetime as pdt, timezone as ptz
c=cr("0 0 * 12 MON-FRI", pdt(2018,12,1))
for i in range(1,31):
print(f"{i:>2}: ", ptz("America/New_York").convert(c.get_next(dt)))'
1: 2018-12-02 19:00:00-05:00
2: 2018-12-03 19:00:00-05:00
3: 2018-12-04 19:00:00-05:00
4: 2018-12-05 19:00:00-05:00
5: 2018-12-06 19:00:00-05:00
6: 2018-12-09 19:00:00-05:00
7: 2018-12-10 19:00:00-05:00
8: 2018-12-11 19:00:00-05:00
9: 2018-12-12 19:00:00-05:00
10: 2018-12-13 19:00:00-05:00
11: 2018-12-16 19:00:00-05:00
12: 2018-12-17 19:00:00-05:00
13: 2018-12-18 19:00:00-05:00
14: 2018-12-19 19:00:00-05:00
15: 2018-12-20 19:00:00-05:00
16: 2018-12-23 19:00:00-05:00
17: 2018-12-24 19:00:00-05:00
18: 2018-12-25 19:00:00-05:00
19: 2018-12-26 19:00:00-05:00
20: 2018-12-27 19:00:00-05:00
21: 2018-12-30 19:00:00-05:00
22: 2019-12-01 19:00:00-05:00
23: 2019-12-02 19:00:00-05:00
24: 2019-12-03 19:00:00-05:00
25: 2019-12-04 19:00:00-05:00
26: 2019-12-05 19:00:00-05:00
27: 2019-12-08 19:00:00-05:00
28: 2019-12-09 19:00:00-05:00
29: 2019-12-10 19:00:00-05:00
30: 2019-12-11 19:00:00-05:00

Related

Using logical comparison together with groupby in pandas

I have the following dataframe:
{'item': {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'B',
6: 'B',
7: 'B',
8: 'B',
9: 'B',
10: 'B',
11: 'B',
12: 'B',
13: 'C',
14: 'C',
15: 'C',
16: 'C',
17: 'D',
18: 'D'},
'Date': {0: Timestamp('2021-05-02 00:00:00'),
1: Timestamp('2021-05-02 00:00:00'),
2: Timestamp('2021-05-02 00:00:00'),
3: Timestamp('2021-05-03 00:00:00'),
4: Timestamp('2021-06-13 00:00:00'),
5: Timestamp('2021-05-03 00:00:00'),
6: Timestamp('2021-05-04 00:00:00'),
7: Timestamp('2021-05-05 00:00:00'),
8: Timestamp('2021-05-06 00:00:00'),
9: Timestamp('2021-05-07 00:00:00'),
10: Timestamp('2021-05-08 00:00:00'),
11: Timestamp('2021-05-09 00:00:00'),
12: Timestamp('2021-05-10 00:00:00'),
13: Timestamp('2021-06-14 00:00:00'),
14: Timestamp('2021-06-15 00:00:00'),
15: Timestamp('2021-06-16 00:00:00'),
16: Timestamp('2021-07-23 00:00:00'),
17: Timestamp('2021-07-07 00:00:00'),
18: Timestamp('2021-07-08 00:00:00')},
'price': {0: 249,
1: 249,
2: 253,
3: 260,
4: 260,
5: 13,
6: 13,
7: 13,
8: 13,
9: 17,
10: 17,
11: 17,
12: 17,
13: 123,
14: 123,
15: 123,
16: 123,
17: 12,
18: 12}}
which looks like this:
item Date price
0 A 2021-05-02 249
1 A 2021-05-02 249
2 A 2021-05-02 253
3 A 2021-05-03 260
4 A 2021-06-13 260
5 B 2021-05-03 13
6 B 2021-05-04 13
7 B 2021-05-05 13
8 B 2021-05-06 13
9 B 2021-05-07 17
10 B 2021-05-08 17
11 B 2021-05-09 17
12 B 2021-05-10 17
13 C 2021-06-14 123
14 C 2021-06-15 123
15 C 2021-06-16 123
16 C 2021-07-23 123
17 D 2021-07-07 12
18 D 2021-07-08 12
As you might see, the price of an item changes over time. What I want to do is to have a column that indicates when a price changes for each item. Now, My first idea was that I could check if the price in the previous row is the same as in the current row (within) a group.
Now, I was convinced that I could do something like this:
df_changes['changed'] = df_changes.groupby(['item'])['price'].eq(df_changes['price'])
to compare row values within a group (returning a boolean) and then translating this to integers to get:
change_item_num diffsum Step
0 0 0 0
1 1 0 0
2 1 1 1
3 1 1 2
4 1 0 2
5 0 0 0
6 1 0 0
7 1 0 0
8 1 0 0
9 1 1 1
10 1 0 1
11 1 0 1
12 1 0 1
13 0 0 0
14 1 0 0
15 1 0 0
16 1 0 0
17 0 0 0
18 1 0 0
where the step column marks changes.
But I was wrong. Whatever I do, I get the error:
AttributeError: 'SeriesGroupBy' object has no attribute 'eq'
Instead, I found a workaround that I am very unhappy about:
j = df_changes.price
k = df_changes.item_num
df_changes['change_price'] = j.eq(j.shift()).astype(int)
df_changes['change_item_num'] = k.eq(k.shift()).astype(int)
df_changes['diffsum'] = abs(df_changes['change_price']-df_changes['change_item_num'])
df_changes['Step'] = df_changes.groupby('item')['diffsum'].cumsum()+1
which returns:
item Date price item_num change_price change_item_num diffsum \
0 A 2021-05-02 249 1 0 0 0
1 A 2021-05-02 249 1 1 1 0
2 A 2021-05-02 253 1 0 1 1
3 A 2021-05-03 260 1 0 1 1
4 A 2021-06-13 260 1 1 1 0
5 B 2021-05-03 13 2 0 0 0
6 B 2021-05-04 13 2 1 1 0
7 B 2021-05-05 13 2 1 1 0
8 B 2021-05-06 13 2 1 1 0
9 B 2021-05-07 17 2 0 1 1
10 B 2021-05-08 17 2 1 1 0
11 B 2021-05-09 17 2 1 1 0
12 B 2021-05-10 17 2 1 1 0
13 C 2021-06-14 123 3 0 0 0
14 C 2021-06-15 123 3 1 1 0
15 C 2021-06-16 123 3 1 1 0
16 C 2021-07-23 123 3 1 1 0
17 D 2021-07-07 12 4 0 0 0
18 D 2021-07-08 12 4 1 1 0
Step
0 1
1 1
2 2
3 3
4 3
5 1
6 1
7 1
8 1
9 2
10 2
11 2
12 2
13 1
14 1
15 1
16 1
17 1
18 1
Surely, there must be an easier way. If not, can anyone explain WHY I cannot use eq or any other logical comaprison within a groupby?
Thankful for any new knowledge!
Compare the current row with the previous row in the price column to identify the locations where price changes, then group the mask by the item column and calculate cumulative sum to assign the sequence of numbers to each group identifying the change in price column per item
m = df['price'] != df['price'].shift()
df['step'] = m.groupby(df['item']).cumsum()
print(df)
item Date price step
0 A 2021-05-02 249 1
1 A 2021-05-02 249 1
2 A 2021-05-02 253 2
3 A 2021-05-03 260 3
4 A 2021-06-13 260 3
5 B 2021-05-03 13 1
6 B 2021-05-04 13 1
7 B 2021-05-05 13 1
8 B 2021-05-06 13 1
9 B 2021-05-07 17 2
10 B 2021-05-08 17 2
11 B 2021-05-09 17 2
12 B 2021-05-10 17 2
13 C 2021-06-14 123 1
14 C 2021-06-15 123 1
15 C 2021-06-16 123 1
16 C 2021-07-23 123 1
17 D 2021-07-07 12 1
18 D 2021-07-08 12 1

How to iterate over the entire rows of the dataset and fill the additional column?

I am beginner of python. I want to iterate over the entire rows of the dataset and fill the below columns appropriately. I need to use a for loop that starts from the first line of the dataset and do the same process in all of the lines. Could You give me enough instruction to write code by using for loop
date
#0 2016-01-01 05:00:00
#1 2016-01-01 06:00:00
#2 2016-01-01 07:00:00
#3 2016-01-01 08:00:00
#4 2016-01-01 09:00:00
#5 2016-01-01 10:00:00
#6 2016-01-01 11:00:00
#7 2016-01-01 12:00:00
#8 2016-01-01 13:00:00
#9 2016-01-01 14:00:00
#10 2016-01-01 15:00:00
#11 2016-01-01 16:00:00
#12 2016-01-01 17:00:00
#13 2016-01-01 18:00:00
#14 2016-01-01 19:00:00
#15 2016-01-01 20:00:00
#16 2016-01-01 21:00:00
#17 2016-01-01 22:00:00
#18 2016-01-01 23:00:00
#19 2016-01-02 00:00:00
#20 2016-01-02 01:00:00
#21 2016-01-02 02:00:00
#22 2016-01-02 03:00:00
#23 2016-01-02 04:00:00
#24 2016-01-02 05:00:00
#25 2016-01-02 06:00:00
#26 2016-01-02 07:00:00
#27 2016-01-02 08:00:00
#28 2016-01-02 09:00:00
#29 2016-01-02 10:00:00
#30 2016-01-02 11:00:00
#31 2016-01-02 12:00:00
#32 2016-01-02 13:00:00
#33 2016-01-02 14:00:00
#34 2016-01-02 15:00:00
#35 2016-01-02 16:00:00
#36 2016-01-02 17:00:00
#37 2016-01-02 18:00:00
#38 2016-01-02 19:00:00
#39 2016-01-02 20:00:00
#40 2016-01-02 21:00:00
#41 2016-01-02 22:00:00
#42 2016-01-02 23:00:00
#43 2016-01-03 00:00:00
#44 2016-01-03 01:00:00
#45 2016-01-03 02:00:00
#46 2016-01-03 03:00:00
#47 2016-01-03 04:00:00
#48 2016-01-03 05:00:00
#49 2016-01-03 06:00:00
#50 2016-01-03 07:00:00
#51 2016-01-03 08:00:00
#52 2016-01-03 09:00:00
#53 2016-01-03 10:00:00
#54 2016-01-03 11:00:00
#55 2016-01-03 12:00:00
#56 2016-01-03 13:00:00
#57 2016-01-03 14:00:00
#58 2016-01-03 15:00:00
#59 2016-01-03 16:00:00
#60 2016-01-03 17:00:00
#61 2016-01-03 18:00:00
#62 2016-01-03 19:00:00
#63 2016-01-03 20:00:00
#64 2016-01-03 21:00:00
#65 2016-01-03 22:00:00
#66 2016-01-03 23:00:00
#Column name
#day1
#day2
#day3
#day4
#day5
#day6
#day7
i=0
date_cell = dataset['date'][i]
day_cell = date_cell.dayofweek
dataset.iloc[i,day_cell] = 1
You don't need / want to be using a for loop here, instead you can use get_dummies and join the result back to your original dataframe, eg:
Starting with:
import pandas as pd
df = pd.DataFrame({'date': pd.date_range('2016-01-01 05:00:00', end='2016-01-03 23:00:00', freq='1H')})
Apply pd.get_dummies on the dayofweek datetime accessor and re-index columns to make sure all 7 days are present as columns regardless of whether that day is actually in your dates, fill in anything missing with 0, add a prefix to the column names and join back to your original DF:
new_df = df.join(
pd.get_dummies(df['date'].dt.dayofweek + 1)
.reindex(range(1, 8), axis=1, fill_value=0)
.add_prefix('day')
)
Gives you:
date day1 day2 day3 day4 day5 day6 day7
0 2016-01-01 05:00:00 0 0 0 0 1 0 0
1 2016-01-01 06:00:00 0 0 0 0 1 0 0
2 2016-01-01 07:00:00 0 0 0 0 1 0 0
3 2016-01-01 08:00:00 0 0 0 0 1 0 0
4 2016-01-01 09:00:00 0 0 0 0 1 0 0
.. ... ... ... ... ... ... ... ...
62 2016-01-03 19:00:00 0 0 0 0 0 0 1
63 2016-01-03 20:00:00 0 0 0 0 0 0 1
64 2016-01-03 21:00:00 0 0 0 0 0 0 1
65 2016-01-03 22:00:00 0 0 0 0 0 0 1
66 2016-01-03 23:00:00 0 0 0 0 0 0 1
If you absolutely must use a for-loop (not recommended), then you can initialise your columns to 0 in one loop, then loop over the date column and update that row's day column to 1 for that day of the week, eg:
for n in range(1, 8):
df.loc[:, f'day{n}'] = 0
for idx, date in df['date'].iteritems():
df.loc[idx, f'day{date.dayofweek + 1}'] = 1

How to specify the special time in one column using python

Here I have a dataset with on input and date and time. Here I just want to convert time into 00:00:00 for specific value which is contain in input column, and other time will be display as it is.
Then I wrote the code for that.
Then what I want is specify that 00:00:00 only. So I wrote the code for it and got an error `'RangeIndex' object has no attribute 'strftime'"
Can anyone help me to solve this ?
My code :
df['time_diff']= pd.to_datetime(df['date'] + " " + df['time'],
format='%d/%m/%Y %H:%M:%S', dayfirst=True)
mask = df['x3'].eq(5)
df['Duration'] = np.where(df['x3'].eq(5), np.timedelta64(0), pd.to_timedelta(df['time']))
Then I got the output:
date time x3 Duration
0 10/3/2018 6:15:00 0 06:15:00
1 10/3/2018 6:45:00 5 00:00:00
2 10/3/2018 7:45:00 0 07:45:00
3 10/3/2018 9:00:00 0 09:00:00
4 10/3/2018 9:25:00 0 09:25:00
5 10/3/2018 9:30:00 0 09:30:00
6 10/3/2018 11:00:00 0 11:00:00
7 10/3/2018 11:30:00 0 11:30:00
8 10/3/2018 13:30:00 0 13:30:00
9 10/3/2018 13:50:00 5 00:00:00
10 10/3/2018 15:00:00 0 15:00:00
11 10/3/2018 15:25:00 0 15:25:00
12 10/3/2018 16:25:00 0 16:25:00
13 10/3/2018 18:00:00 0 18:00:00
14 10/3/2018 19:00:00 0 19:00:00
15 10/3/2018 19:30:00 0 19:30:00
16 10/3/2018 20:00:00 0 20:00:00
17 10/3/2018 22:05:00 0 22:05:00
18 10/3/2018 22:15:00 5 00:00:00
19 10/3/2018 23:40:00 0 23:40:00
20 10/4/2018 6:58:00 5 00:00:00
21 10/4/2018 13:00:00 0 13:00:00
22 10/4/2018 16:00:00 0 16:00:00
23 10/4/2018 17:00:00 0 17:00:00
Then I want to specify this 00:00:00 time only then :
match_time="00:00:00"
time = data['duration'].loc[data.index.strftime("%H:%M:%S") == match_time]
Got an error :
Expected output :
time
00:00:00
00:00:00
Just read only 00:00:00 time
My csv :
subset:
date time x3
10/3/2018 6:15:00 0
10/3/2018 6:45:00 5
10/3/2018 7:45:00 0
10/3/2018 9:00:00 0
10/3/2018 9:25:00 0
10/3/2018 9:30:00 0
10/3/2018 11:00:00 0
10/3/2018 11:30:00 0
10/3/2018 13:30:00 0
10/3/2018 13:50:00 5
10/3/2018 15:00:00 0
10/3/2018 15:25:00 0
10/3/2018 16:25:00 0
10/3/2018 18:00:00 0
10/3/2018 19:00:00 0
10/3/2018 19:30:00 0
10/3/2018 20:00:00 0
10/3/2018 22:05:00 0
10/3/2018 22:15:00 5
10/3/2018 23:40:00 0
10/4/2018 6:58:00 5
10/4/2018 13:00:00 0
10/4/2018 16:00:00 0
10/4/2018 17:00:00 0
My csv file :
My csv file
Because types of values in column Duration are timedeltas, compare by string converted to timedelta too:
print (data['Duration'].dtype)
#timedelta64[ns]
match_time="00:00:00"
time = data[data['Duration'] == pd.to_timedelta(match_time)]
print (time)
date time x3 Duration
1 10/3/2018 6:45:00 5 0 days
9 10/3/2018 13:50:00 5 0 days
18 10/3/2018 22:15:00 5 0 days
20 10/4/2018 6:58:00 5 0 days
EDIT: If always timedeltas are less like 1 day:
First convert timedeltas to strings - added 0 days:
print (df['Duration'].astype(str))
#0 0 days 06:15:00.000000000
#1 0 days 00:00:00.000000000
#2 0 days 07:45:00.000000000
#3 0 days 09:00:00.000000000
#4 0 days 09:25:00.000000000
#5 0 days 09:30:00.000000000
#6 0 days 11:00:00.000000000
#7 0 days 11:30:00.000000000
#8 0 days 13:30:00.000000000
#9 0 days 00:00:00.000000000
#10 0 days 15:00:00.000000000
#11 0 days 15:25:00.000000000
#12 0 days 16:25:00.000000000
#13 0 days 18:00:00.000000000
#14 0 days 19:00:00.000000000
#15 0 days 19:30:00.000000000
#16 0 days 20:00:00.000000000
#17 0 days 22:05:00.000000000
#18 0 days 00:00:00.000000000
#19 0 days 23:40:00.000000000
#20 0 days 00:00:00.000000000
#21 0 days 13:00:00.000000000
#22 0 days 16:00:00.000000000
#23 0 days 17:00:00.000000000
#Name: Duration, dtype: object
And then remove first and last part of strings by slicing:
print (df['Duration'].astype(str).str[-18:-10])
#0 06:15:00
#1 00:00:00
#2 07:45:00
#3 09:00:00
#4 09:25:00
#5 09:30:00
#6 11:00:00
#7 11:30:00
#8 13:30:00
#9 00:00:00
#10 15:00:00
#11 15:25:00
#12 16:25:00
#13 18:00:00
#14 19:00:00
#15 19:30:00
#16 20:00:00
#17 22:05:00
#18 00:00:00
#19 23:40:00
#20 00:00:00
#21 13:00:00
#22 16:00:00
#23 17:00:00
#Name: Duration, dtype: object
df['Duration'] = df['Duration'].astype(str).str[-18:-10]
match_time="00:00:00"
time = df[df['Duration'] == match_time]
print (time)
date time x3 Duration
1 10/3/2018 6:45:00 5 00:00:00
9 10/3/2018 13:50:00 5 00:00:00
18 10/3/2018 22:15:00 5 00:00:00
20 10/4/2018 6:58:00 5 00:00:00
Solution for all timedeltas:
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
df['Duration'] = df['Duration'].apply(f)
match_time="00:00:00"
time = df[df['Duration'] == match_time]
You are trying to convert the dataframe index (0,1,2,...23) to a string time format object, and not the items content in the 'Duration' column.
First convert each item in the column 'Duration' then compare it to 'match_time' and finally save the resulting sliced frame, all at once:
match_time="00:00:00"
df=data.loc[data['Duration'].apply(lambda x: x.strftime("%H:%M:%S"))==match_time]
Then you will get all indexes which match your desired 'match_time' as follows:
date time x3 Duration
1 2018-10-03 2018-10-03 00:00:00 5 00:00:00
9 2018-10-03 2018-10-03 00:00:00 5 00:00:00
18 2018-10-03 2018-10-03 00:00:00 5 00:00:00
20 2018-10-04 2018-10-04 00:00:00 5 00:00:00

Python 3 - how to handle a 53 week years when using timedelta()

I am trying to pull the last 12 full (Monday to Sunday) weeks, but it is failing to do so because Monday 2018-12-31 is week 53 of 2018.
I am deriving the start and end dates of the last full 12 weeks:
### determine local time day of and day of week
today = dt.utcnow()
today = today.replace(tzinfo=timezone.utc).astimezone(tz.gettz(timezone_id))
### get the last 12 full Monday to Sunday weeks
timeKey1 = (today - datetime.timedelta(days=today.weekday()))- datetime.timedelta(weeks=12)
timeKey2 = (today - datetime.timedelta(days=today.weekday()))- datetime.timedelta(days=1)
timeKey1 = datetime.datetime.strptime(''.join(str(timeKey1).rsplit(':', 1)), '%Y-%m-%d %H:%M:%S.%f%z').strftime('%Y-%m-%d')
timeKey2 = datetime.datetime.strptime(''.join(str(timeKey2).rsplit(':', 1)), '%Y-%m-%d %H:%M:%S.%f%z').strftime('%Y-%m-%d')
print(timeKey1)
print(timeKey2)
Which returns the date range 2018-12-03 to 2019-02-24 which is great:
2018-12-03
2019-02-24
So when I use this to pull the data I need for that time period I group the weeks together:
### Convert timekey to week of year
df['week'] = df['timekey'].astype(str).apply(lambda x: dt.strptime(x, "%Y%m%d").strftime("%W"))
### group the weeks of year together
df['weekCumulative'] = df['week'].ne(df['week'].shift()).cumsum()
Then I want my function to continue if the max in df['weekCumulative'].max() == 12:
###Check that 12 weeks is available
if df['weekCumulative'].max() == 12:
But it fails here because Monday 2018-12-31 turns out to be week 53 of 2018. The below table shows the following:
weekCumulative = week of year grouped by weeks 1 to 12
week = week of year
startDate = date of the Monday in each week
endDate = date of the Sunday in each week
Table:
weekCumulative week startDate endDate
1 49 2018-12-03 2018-12-09
2 50 2018-12-10 2018-12-16
3 51 2018-12-17 2018-12-23
4 52 2018-12-24 2018-12-30
5 53 2018-12-31 2018-12-31
6 00 2019-01-01 2019-01-06
7 01 2019-01-07 2019-01-13
8 02 2019-01-14 2019-01-20
9 03 2019-01-21 2019-01-27
10 04 2019-01-28 2019-02-03
11 05 2019-02-04 2019-02-10
12 06 2019-02-11 2019-02-17
13 07 2019-02-18 2019-02-24
Now what we can see is df['weekCumulative'].max() actually equals 13 because Monday 2018-12-31 turns out to be week 53 of 2018, so it has been grouped into its own group where weekCumulative = 5. When what I actually want to see is this:
weekCumulative week startDate endDate
1 49 2018-12-03 2018-12-09
2 50 2018-12-10 2018-12-16
3 51 2018-12-17 2018-12-23
4 52 2018-12-24 2018-12-30
5 00 2018-12-31 2019-01-06
6 01 2019-01-07 2019-01-13
7 02 2019-01-14 2019-01-20
8 03 2019-01-21 2019-01-27
9 04 2019-01-28 2019-02-03
10 05 2019-02-04 2019-02-10
11 06 2019-02-11 2019-02-17
12 07 2019-02-18 2019-02-24
Where Monday 2018-12-31 is grouped into week 0 of 2019.
So my questions is, how can this be handled in a way where I don't have to pull the data and then replace week 53 with 00? It would be more efficient to handle it programmatically.
Any suggestions would be greatly appreciated.

add rows for all dates between two columns?

add rows for all dates between two columns?
ID Initiation_Date Step Start_Date End_Date Days
P-03 29-11-2018 3 2018-11-29 2018-12-10 11.0
P-04 29-11-2018 4 2018-12-03 2018-12-07 4.0
P-05 29-11-2018 5 2018-12-07 2018-12-07 0.0
Use:
mydata = [{'ID' : '10', 'Entry Date': '10/10/2016', 'Exit Date': '15/10/2016'},
{'ID' : '20', 'Entry Date': '10/10/2016', 'Exit Date': '18/10/2016'}]
df = pd.DataFrame(mydata)
#convert columns to datetimes
df[['Entry Date','Exit Date']] = df[['Entry Date','Exit Date']].apply(pd.to_datetime)
#repeat index by difference of dates
df = df.loc[df.index.repeat((df['Exit Date'] - df['Entry Date']).dt.days + 1)]
#add counter duplicated rows to day timedeltas to new column
df['Date'] = df['Entry Date'] + pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
#default RangeIndex
df = df.reset_index(drop=True)
print (df)
Entry Date Exit Date ID Date
0 2016-10-10 2016-10-15 10 2016-10-10
1 2016-10-10 2016-10-15 10 2016-10-11
2 2016-10-10 2016-10-15 10 2016-10-12
3 2016-10-10 2016-10-15 10 2016-10-13
4 2016-10-10 2016-10-15 10 2016-10-14
5 2016-10-10 2016-10-15 10 2016-10-15
6 2016-10-10 2016-10-18 20 2016-10-10
7 2016-10-10 2016-10-18 20 2016-10-11
8 2016-10-10 2016-10-18 20 2016-10-12
9 2016-10-10 2016-10-18 20 2016-10-13
10 2016-10-10 2016-10-18 20 2016-10-14
11 2016-10-10 2016-10-18 20 2016-10-15
12 2016-10-10 2016-10-18 20 2016-10-16
13 2016-10-10 2016-10-18 20 2016-10-17
14 2016-10-10 2016-10-18 20 2016-10-18

Resources