With an Azure Data Factory "Tumbling Window" trigger, is it possible to limit the hours of each day that it triggers during (adding a window you might say)?
For example I have a Tumbling Window trigger that runs a pipeline every 15 minutes. This is currently running 24/7 but I'd like it to only run during business hours (0700-1900) to reduce costs.
Edit:
I played around with this, and found another option which isn't ideal from a monitoring perspective, but it appears to work:
Create a new pipeline with a single "If Condition" step with a dynamic Expression like this:
#and(greater(int(formatDateTime(utcnow(),'HH')),6),less(int(formatDateTime(utcnow(),'HH')),20))
In the true case activity, add an Execute Pipeline step executing your original pipeline (with "Wait on completion" ticked)
In the false case activity, add a wait step which sleeps for X minutes
The longer you sleep for, the longer you can possibly encroach on your window, so adjust that to match.
I need to give it a couple of days before I check the billing on the portal to see if it has reduced costs. At the moment I'm assuming a job which just sleeps for 15 minutes won't incur the costs that one running and processing data would.
there is no easy way but you can create two deployment pipelines for the same job in Azure devops and as soon as your winodw 0700 to 1900 expires you replace that job with a dummy job using azure dev ops pipeline.
Related
Context
I have created a streaming job using Azure portal which aggregates data using a day wise TUMBLINGWINDOW. Have attached a code snippet below, modified from the docs, which shows similar logic.
SELECT
DATEADD(day, -1, System.Timestamp()) AS WindowStart
System.Timestamp() AS WindowEnd,
TollId,
COUNT(*)
FROM Input TIMESTAMP BY EntryTime
GROUP BY TumblingWindow(day, 1), TollId
Now that the job has been running and can see it producing output I want to be able to reduce the costs ideally by setting some sort of time scheduling so that the job can run and still produce the same output without being on all the time.
The only real constraint being that the aggregated output at the end of each TUMBLINGWINDOW has to remain the same as if it were running all the time (no impact of stop-starting on output).
This then brings me to my question.
Update: 2021-02-28
Before going into the question another thing that drove me was that through Azure portal you can manually start and stop a job. When you start/restart a job you can set a custom start time for the job/query. With this level of control say I start a job (or have a job running) and then decide to stop it for majority of the day and then turn it on at say 11:30pm each day with a custom start time of midnight of the current day then it would be able to be on for approx 30min before it would output the results (yet still to my understanding produce the same aggregation results/effect compared to if it was on the whole day up until that point). This job could then be paused again at 00:30am ( the next day for which it stays paused for the majority of the day (1380min total until 11:30pm again) upon which the same above logic is applied.
This way it remains off the majority of the day yet still can produce the same output for each day wise window (correct me if I am wrong in my thinking). The only issue with this to me seems to be the fact someone would have to manually perform this. Thus I was driven to the docs looking for a way to automate this.
Question
How can I start and stop a job in an automated fashion such that the required output would still remain intact but so that the job doesn't have to remain on all the time (like it currently is)?
Does the documentation linked above suffice given the context above, if so what are some possible arrangements for the N minutes (on) and M minutes (off) time variables for this to work?
Is this possible given the scenario that I want to aggregate on a one day TUMBLINGWINDOW window (whereby I want each window to start and end at midnight of each day, as per its default behaviour.)?
Eg
Window start: 2022-02-20 00:00:00 Window end: 2022-02-21 00:00:00 (aggregation performed),
Window start: 2022-02-21 00:00:00 Window end: 2022-02-22 00:00:00 (aggregation performed),
Window start: 2022-02-22 00:00:00 Window end: 2022-02-23 00:00:00 (aggregation performed),
....so on
Thoughts
I found this documentation from Microsoft regarding auto-pausing jobs using a few methods
However came across a paragraph (quoted below) which made me doubtful whether it seems reasonable in my particular use case (TUMBLING 1 day window as described in my question section).
Note
There are downsides to auto-pausing a job. The main ones being the loss of the low latency /real time capabilities, and the potential risks from allowing the input event backlog to grow unsupervised while a job is paused. Auto-pausing should not be considered for most production scenarios running at scale.
Could this method
There are 3 ways to lower costs:
downscale your job, you will have higher latency but for a lower cost, up to a point where your job crashes because it runs out of memory over time and/or can't catch up with its backlog. Here you need to keep an eye on your metrics to make sure you can react before it's too late
going further, you can regroup multiple queries into a single job. This job most likely won't be aligned in partitions, so it won't be able to scale linearly (adding SUs is not guaranteed to give you better performance). Same comment as above, plus you need to remember that when you need to scale back up, you probably will have to break down that job into multiple jobs to again be able to scale in a linear fashion
finally you can auto-pause a job, one way to implement that being explained in the doc you linked. I wrote that doc, and what I meant by that comment is that here again you are taking the risk of overloading the job if it can't run long enough to process the backlog of events. This is a risky proposition for most production scenarios
But if you know what you are doing, and are monitoring closely the appropriate metrics (as explained in the doc), this is definitely something you should explore.
Finally, all of these approaches, including the auto-pause one, will deal with tumbling windows transparently for you.
Update: 2022-03-03 following comments here
Update: 2022-03-04 following comments there
There are 3 time dimensions here:
When the job is running or not: the wall clock
When the time window is expected to output results: Tumbling(day,1) -> 00:00AM every day, this is absolute (on the day, on the hour, on the minute...) and independent of the job start time below
What output you want produced from the job, via the job start time
Let's say you have the job running 24/7 for multiple months, and decide to stop it at noon (12:00PM) on the 1st day of March.
It already has generated an output for the last day of February, at 00:00AM Mar1.
You won't see a difference in output until the following day, 00:00AM Mar2, when you expect to see the daily window of Mar1, but it's not output because the job is stopped.
Let's start the job at 01:00AM Mar2 wall clock time. If you want the missing time window, you should either pick a start time at 'when last stopped' (noon the day before), or a custom time any time before 23:59PM Mar1. What you are driving is the output window you want. Here you are telling ASA you want all the windows from that point onward.
ASA will then reload all the data it needs to generate that window (make sure the event hub has enough retention for that, we don't cache data between restarts in the job): Azure Stream Analytics will automatically look back at the data in the input source. For instance, if you start a job “Now” and if your query uses a 5-minutes Tumbling Window, Azure Stream Analytics will seek data from 5 minutes ago in the input. The first possible output event would have a timestamp equal to or greater than the current time, and ASA guarantees that all input events that may logically contribute to the output has been accounted for.
I need to run a GitLab pipeline at four specific times each day, which I have solved by setting up four schedules, one for each desired point in time. All pipelines run on the same branch, master.
In the list of pipelines, I get the following information for each pipeline:
status (success or not)
pipeline ID, a label indicating the pipeline was triggered by a schedule, and a label indicating the pipeline was run on the latest commit on that branch
the user that triggered the pipeline
branch and commit on which the pipeline was run
status (success, warning, failure) for each stage
duration and time at which the pipeline was run (X hours/days/... ago)
This seems optimized to pipelines which typically run no more than once after each commit: in such a scenario, it is relatively easy to identify a particular pipeline.
In my case, however, the code itself has relatively little changes (the main purpose of the pipeline is to verify against external data which changes several times a day). As a result, I end up with a list of near-identical entries. In fact, the only difference is the time at which the pipeline was run, though for anything older than 24 hours I will get 4 pipelines that ran “2 days ago”.
Is there any way I can customize these entries? For a scheduled pipeline, I would like to have an indicator of the schedule which triggered the pipeline or the time of day (even for pipelines older than 24 hours), optionally a date (e.g. “August 16” rather than “5 days ago”).
To enable the use of absolute times in GitLab:
Click your Avatar in the top right corner.
Click Preferences.
Scroll to Time preferences and uncheck the box next to Use relative times.
Your pipelines will now show the actual date and time at which they were triggered rather than a relative time.
More info here: https://gitlab.com/help/user/profile/preferences#time-preferences
Lets assume a scenario where pipeline A runs every day and pipeline B runs once in every month and it is dependent on pipeline A (pipeline B should trigger after successful completion of pipeline A).
Using scheduled trigger, we cannot have hard dependencies between 2 pipelines, where as with tumbling window, we cannot exactly specify the day which the pipeline B should run(it has only two options, minutes and hours where as scheduled trigger has months and weeks also)
Both the triggers has its disadvantages with respect to this scenario.
What could be the best possible solution for this scenario?
You can Run Pipeline A everyday, and have an IF check that checks if its a specific date today, then run Pipeline B if TRUE and nothing if FALSE.
For the settings of If Condition, you can use this as variable, if you want to run it every 1st of every month:
#Contains('01',Substring(formatDateTime(utcnow()),8,2))
Doing some tests, I could see that having an Azure Integration Runtime (AIR) allowed us to reduce considerably the amount of time required to finish a pipeline.
To fully understand the use of this configuration and its billing as well, I've got these questions. Let's assume I've got two independent pipelines, all of their Data Flow activities use the same AIR with a TTL = 10 minutes.
The first pipeline takes 7 minutes to finish. The billing will be (if I understand well):
billing: time to acquire cluster + job execution time + TTL (7 + 10)
Five minutes later, I trigger the second pipeline. It will take only 3 minutes to finish (I understand it will also use the same pool as the first one). After it concludes, the TTL is setting up to 10 minutes again or is equal to 2 minutes
10 - 5 -3 (original TTL - start time second pipe - runtime second pipe), in this case, what will happen if I trigger a third pipeline that could take more than 2 minutes?
What about the billing, how is it going to be calculated?
Regards.
Look at the ADF pipeline monitoring view and find all of your data flow activity executions.
Add up that total data flow activity execution time.
Now add the TTL value for that Azure IR you were using to that total.
That is the total time you will be billed.
An ADF pipeline needs to be executed on a daily basis, lets say at 03:00 h AM.
But prior execution we also need to check if the data sources are available.
Data is provided by an external agent, it periodically loads the corresponding data into each source table and let us know when this process is completed using a flag-table: if data source 1 is ready it set flag to 1.
I don't find a way to implement this logic with ADF.
We would need something that, for instance, at 03.00 h would trigger an 'element' that checks the flags, if the flags are not up don't launch the pipeline. Past, lets say, 10 minutes, check again the flags, and be like this for at most X times OR until the flags are up.
If the flags are up, launch the pipeline execution and stop trying to launch the pipeline any further.
How would you do it?
The logic per se is not complicated in any way, but I wouldn't know where to implement it. Should I develop an Azure Funtions that launches the Pipeline or is there a way to achieve it with an out-of-the-box AZDF activity?
There is a UNTIL iteration activity where you can check if your clause.
Example:
Your azure function (AF) checking the flag and returns 0 or 1.
Build ADF pipeline with UNTIL activity where you check the output of AF (if its 1 do something). In UNTIL activity you can have your process step. For example, you have a variable flag that will before until activity is 0. In your until you check if it's 1. if it is do your processing step, if its not, put WAIT activity on 10 min or so.
So you have the ability in ADF to iterate until something it's not satisfied.
Hope that this will help you :)