Airflow schedule a task to run on the Monday before the 15th of the month - cron

Is it possible to schedule an airflow DAG to run at a specific time on the Monday directly before the 15th of each month? I think this cron string might do it but I'm not sure that I have understood correctly
0 10 8-14 * MON
So I think that this should run at 10:00 on a Monday only between the 8th and the 14th of each month. As there can only be one Monday between the 8th and the 14th, this should run only once a month and it will be the Monday preceding the 15th of the month.
Is that correct?

The croniter module (which Airflow uses for the execution date/time calculations) supports the hash symbol for the day-of-week field which would allow you to schedule, what I believe will work, the second Monday of each month.
For example, "30 7 * * 1#2" says to run at 7:30AM, every month, on the second Monday. Using this code to test it:
from croniter import croniter
from datetime import datetime
cron = croniter("30 7 * * 1#2")
for i in range(10):
print(cron.get_next(datetime))
yields:
datetime.datetime(2018, 10, 8, 7, 30)
datetime.datetime(2018, 11, 12, 7, 30)
datetime.datetime(2018, 12, 10, 7, 30)
datetime.datetime(2019, 1, 14, 7, 30)
datetime.datetime(2019, 2, 11, 7, 30)
datetime.datetime(2019, 3, 11, 7, 30)
datetime.datetime(2019, 4, 8, 7, 30)
datetime.datetime(2019, 5, 13, 7, 30)
datetime.datetime(2019, 6, 10, 7, 30)
datetime.datetime(2019, 7, 8, 7, 30)

Related

print all possible routes of a conditional binary tree by python3

I want to print a conditioned binary tree.
Take an example of five different lists a b c d e:
1
2, 4
3, 5, 7
6, 8
9
The condition is that the following number must be larger than the previous number, so printing 1, 4, 3, 6, 9 is wrong.
The desired result is:
1, 2, 3, 6, 9
1, 2, 5, 6, 9
1, 4, 5, 8, 9
1, 4, 7, 8, 9
How to get those lists by python3?
Thank you very much.

Compute new pandas column for the number of time a date intersects a list of date ranges

I have actually solved the problem, but I am looking for advice for a more elegant / pandas-orientated solution.
I have a pandas dataframe of linkedin followers with a date field. The data looks like this:
Date Sponsored followers Organic followers Total followers
0 2021-05-30 0 105 105
1 2021-05-31 0 128 128
2 2021-06-01 0 157 157
3 2021-06-02 0 171 171
4 2021-06-03 0 133 133
I have a second dataframe that contains the start and end dates for paid social campaigns. What I have done is create a list of tuples from this dataframe, where the first element in the tuple is the start date, and the second is the end date, i converted these dates to datetimes as such:
[(datetime.date(2021, 7, 8), datetime.date(2021, 7, 9)),
(datetime.date(2021, 7, 12), datetime.date(2021, 7, 13)),
(datetime.date(2021, 7, 13), datetime.date(2021, 7, 14)),
(datetime.date(2021, 7, 14), datetime.date(2021, 7, 15)),
(datetime.date(2021, 7, 16), datetime.date(2021, 7, 18)),
(datetime.date(2021, 7, 19), datetime.date(2021, 7, 21)),
(datetime.date(2021, 7, 30), datetime.date(2021, 8, 2)),
(datetime.date(2021, 7, 30), datetime.date(2021, 8, 2)),
(datetime.date(2021, 7, 30), datetime.date(2021, 8, 2)),
(datetime.date(2021, 8, 9), datetime.date(2021, 8, 12)),
(datetime.date(2021, 8, 12), datetime.date(2021, 8, 15)),
(datetime.date(2021, 9, 3), datetime.date(2021, 9, 7)),
(datetime.date(2021, 10, 22), datetime.date(2021, 11, 21)),
(datetime.date(2021, 10, 29), datetime.date(2021, 11, 10)),
(datetime.date(2021, 10, 29), datetime.date(2021, 11, 2)),
(datetime.date(2021, 11, 3), datetime.date(2021, 11, 4)),
(datetime.date(2021, 11, 5), datetime.date(2021, 11, 8)),
(datetime.date(2021, 11, 9), datetime.date(2021, 11, 12)),
(datetime.date(2021, 11, 12), datetime.date(2021, 11, 16)),
(datetime.date(2021, 11, 11), datetime.date(2021, 11, 12)),
(datetime.date(2021, 11, 25), datetime.date(2021, 11, 27)),
(datetime.date(2021, 11, 26), datetime.date(2021, 11, 28)),
(datetime.date(2021, 12, 8), datetime.date(2021, 12, 11))]
In order to create a new column in my main dataframe (which is a count of how many campaigns falls on any given day), I loop through each row in my dataframe, and then each element in my list using the following code:
is_campaign = []
for date in df['Date']:
count = 0
for date_range in campaign_dates:
if date_range[0] <= date <= date_range[1]:
count += 1
is_campaign.append(count)
df['campaign'] = is_campaign
Which gives the following result:
df[df['campaign']!=0]
Date Sponsored followers Organic followers Total followers campaign
39 2021-07-08 0 160 160 1
40 2021-07-09 17 166 183 1
43 2021-07-12 0 124 124 1
44 2021-07-13 16 138 154 2
45 2021-07-14 22 158 180 2
... ... ... ... ... ...
182 2021-11-28 31 202 233 1
192 2021-12-08 28 357 385 1
193 2021-12-09 29 299 328 1
194 2021-12-10 23 253 276 1
195 2021-12-11 25 163 188 1
Any advice on how this could be done in a more efficient way, and specifically using pandas functionality would be appreciated.
My idea would be to use your second DataFrame alone to count the number of campaigns by date, and finally put the numbers into your first DataFrame. In this way you only go through your list of date-ranges once (or twice if you also take the counting step into account).
Expand your list of date-ranges into list of dates. Note that dates that occur N times represents N campaigns on that date.
dates = [
start_date + datetime.timedelta(day)
for start_date, end_date in date_ranges
for day in range((end_date - start_date).days + 1)
]
Then do the counting.
from collections import Counter
date_counts = Counter(dates)
Finally, put the numbers in.
df1['campaign'] = df1['Date'].map(pd.Series(date_counts))

How to spread Time value with Date pair using Pandas

I am trying to figure out how to spread the time value in the date value. My Date value looks like this:
date_list = ['2017-01-07',
'2017-01-08',
'2017-01-04',
'2017-01-05',
'2017-01-03',
'2017-01-04'
.... ]
Here, as you can see the date are in somewhat pair format in order. For Example:
'2017-01-07' and '2017-01-08' or '2017-01-04' and '2017-01-05' etcs...
Basically, every two date pair value are one day apart.
I also have a time value:
time_list = [
datetime.time(23, 0),
datetime.time(0, 0),
datetime.time(1, 0),
.... ]
What I am looking to do is to spread the time from 23 to 1 or basically form 11 PM to 1 AM with the two pair date '2017-01-07' and '2017-01-08' or '2017-01-04' and '2017-01-05' etcs... by preserving the original order of date_list with corresponding time_list
So the new df will look like this:
DateTimeList
2017-01-07 23:00:00
2017-01-08 00:00:00
2017-01-08 01:00:00
2017-01-04 23:00:00
2017-01-05 00:00:00
2017-01-05 01:00:00
2017-01-03 23:00:00
2017-01-04 00:00:00
2017-01-04 01:00:00
What did I do?
I put the time in between using:
time = df.between_time('23:00:00','01:00:00')
and then time[time.index.normalize().isin(date_list)]
however, this does not work because it does not spread the time_list after midnight on two date pair. It spreads the entire time from 22 to 01 on a single day. It also sorts the data.
But what I want is to spread the time value into two date pair by preserving the original order of date_list with corresponding time_list. Can you please help solve it?
How about using datetime.datetime.combine() with some modulo logic?
import datetime
def combine_pairs(date_list, time_list):
for i, x in enumerate(date_list):
dt = datetime.date.fromisoformat(x)
if not i % 2:
yield datetime.datetime.combine(dt, time_list[0])
else:
yield datetime.datetime.combine(dt, time_list[1])
yield datetime.datetime.combine(dt, time_list[2])
Demo:
>>> from pprint import pprint
>>> date_list = ['2017-01-07',
... '2017-01-08',
... '2017-01-04',
... '2017-01-05',
... '2017-01-03',
... '2017-01-04',]
>>> time_list = [
... datetime.time(23, 0),
... datetime.time(0, 0),
... datetime.time(1, 0),]
>>> pprint(list(combine_pairs(date_list, time_list)))
[datetime.datetime(2017, 1, 7, 23, 0),
datetime.datetime(2017, 1, 8, 0, 0),
datetime.datetime(2017, 1, 8, 1, 0),
datetime.datetime(2017, 1, 4, 23, 0),
datetime.datetime(2017, 1, 5, 0, 0),
datetime.datetime(2017, 1, 5, 1, 0),
datetime.datetime(2017, 1, 3, 23, 0),
datetime.datetime(2017, 1, 4, 0, 0),
datetime.datetime(2017, 1, 4, 1, 0)]

Creating a vector containing the next 10 row-column values for each pandas row

I am trying to create a vector of the previous 10 values from a pandas column and insert it back into the pandas data frame as a list in a cell.
The below code works but I need to do this for a dataframe of over 30 million rows so it will take too long to do it in a loop.
Can someone please help me convert this to a numpy function that I can apply. I would also like to be able to apply this function in a groupby.
import pandas as pd
df = pd.DataFrame(list(range(1,20)),columns = ['A'])
df.insert(0,'Vector','')
df['Vector'] = df['Vector'].astype(object)
for index, row in df.iterrows():
df['Vector'].iloc[index] = list(df['A'].iloc[(index-10):index])
I have tried in multiple ways but have not been able to get it to work. Any help would be appreciated.
IIUC
df['New']=[df.A.tolist()[max(0,x-10):x] for x in range(len(df))]
df
Out[123]:
A New
0 1 []
1 2 [1]
2 3 [1, 2]
3 4 [1, 2, 3]
4 5 [1, 2, 3, 4]
5 6 [1, 2, 3, 4, 5]
6 7 [1, 2, 3, 4, 5, 6]
7 8 [1, 2, 3, 4, 5, 6, 7]
8 9 [1, 2, 3, 4, 5, 6, 7, 8]
9 10 [1, 2, 3, 4, 5, 6, 7, 8, 9]
10 11 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
11 12 [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
12 13 [3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
13 14 [4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
14 15 [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
15 16 [6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
16 17 [7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
17 18 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
18 19 [9, 10, 11, 12, 13, 14, 15, 16, 17, 18]

Reverse a rolling total based on historic data

Say I have a list of rolling x-day page view totals. That is, each data point is the sum of the previous x days of page views, but I do not have each individual day's page view total. Would it be possible to get the individual values?
For example, say someone gathers the following page view metrics:
{4 days before Day 1: {1,2,3,8}, Day 1: 4, Day 2: 2, Day 3: 5, Day 4: 2, Day 5: 9, Day 6: 8, Day 7: 10, Day 8: 10, Day 9: 7, Day 10: 6}
They provide me with the following list of 5-day running totals:
{Day 1: 18 (1+2+3+8+4), Day 2: 19 (2+3+8+4+2), Day 3: 22 (3+8+4+2+5), Day 4: 21 (etc.), Day 5: 22, Day 6: 26, Day 7: 34, Day 8: 39, Day 9: 44, Day 10: 41}
Would it be possible for me to take only the second dataset and determine at least some of the values in the first dataset?
In your example, the history
{1, 2, 3, 8, 4, 2, 5, 2, 9, 8, 10, 10, 7, 6}
gives the following 5-day running totals:
{18, 19, 22, 21, 22, 26, 34, 39, 44, 41}
But so would the history:
{3, 8, 1, 3, 3, 4, 11, 0, 4, 7, 12, 16, 5, 1}
So no, in general you can't reconstruct any of the values.
...Unless you have five days in a row with no views, giving you a zero in the list of running totals. If that happens, you can reconstruct the entire history before and after.

Resources