We would like to set up Azure auto-scaling based on specific time of the day. E.g. on 7:00 we would like to increase number of instances and at 17:00 we would like to decrease them.
We are aware that we can set to scale up by some other metrics (CPU, number of messages in queue, etc.), but this has some negative impacts for us - starting of a new instance takes some time and also w3wp warm-up takes some time too. And we need to have instances ready immediately when high load comes.
Is there any way to set auto-scaling on specific time of the day (from 7:00 to 17:00) and specific days of week (working days).
You could inculcate the following general guidelines based on your requirement:
Scale based on a schedule
In addition to scale based on CPU, you can set your scale differently for specific days of the week.
Click Add a scale condition.
Setting the scale mode and the rules is the same as the default condition.
Select Repeat specific days for the schedule.
Select the days and the start/end time for when the scale condition should be applied.
Scale differently on specific dates
In addition to scale based on CPU, you can set your scale differently for specific dates.
Click Add a scale condition.
Setting the scale mode and the rules is the same as the default condition.
Select Specify start/end dates for the schedule.
Select the start/end dates and the start/end time for when the scale condition should be applied.
Refer Get started with Autoscale in Azure for more details.
As general Autoscaling guidelines:
When you can predict the load on the application well enough to use scheduled autoscaling, adding and removing instances to meet anticipated peaks in demand. If this isn't possible, use reactive autoscaling based on runtime metrics, in order to handle unpredictable changes in demand. Typically, you can combine these approaches. For example, create a strategy that adds resources based on a schedule of the times when you know the application is most busy. This helps to ensure that capacity is available when required, without any delay from starting new instances. For each scheduled rule, define metrics that allow reactive autoscaling during that period to ensure that the application can handle sustained but unpredictable peaks in demand.
Related
I created a pool in Azure Batch (from the Azure portal) with Auto scale activated.
I also defined a formula where the initial number of node is set to 0. This number will ramp up according to number of active tasks and will go back to 0 if there no task remaining.
My problem is that the minimum evaluation interval for the formula is 5 minutes, which means that in the worst case I have to wait at least 5 minutes (plus the time it takes for the nodes to boot and execute the start task) before a task can be assigned to a node.
I would like to apply the formula on the pool on demand by using the REST API (for instance right after adding a job).
According to the API documentation:
https://learn.microsoft.com/en-us/rest/api/batchservice/pool/evaluate-auto-scale
You can evaluate a formula but it will not be applied on the pool.
https://learn.microsoft.com/en-us/rest/api/batchservice/pool/enable-auto-scale
You can enable automatic scaling for a pool but if it is already enabled you have to specify a new auto scale formula and/or a new evaluation interval.
If you specify a new interval, then the existing auto scale evaluation schedule will be stopped and a new auto scale evaluation schedule will be started, with its starting time being the time when this request was issued.
You can disable and then re-enable the autoscale formula, but note the call limits on the enable API. However note that if you are trying to do this frequently on the order of less than the minimum evaluation period that thrashing a pool faster than the underlying infrastructure can allocate resources does not provide any benefits.
Context
I have created a streaming job using Azure portal which aggregates data using a day wise TUMBLINGWINDOW. Have attached a code snippet below, modified from the docs, which shows similar logic.
SELECT
DATEADD(day, -1, System.Timestamp()) AS WindowStart
System.Timestamp() AS WindowEnd,
TollId,
COUNT(*)
FROM Input TIMESTAMP BY EntryTime
GROUP BY TumblingWindow(day, 1), TollId
Now that the job has been running and can see it producing output I want to be able to reduce the costs ideally by setting some sort of time scheduling so that the job can run and still produce the same output without being on all the time.
The only real constraint being that the aggregated output at the end of each TUMBLINGWINDOW has to remain the same as if it were running all the time (no impact of stop-starting on output).
This then brings me to my question.
Update: 2021-02-28
Before going into the question another thing that drove me was that through Azure portal you can manually start and stop a job. When you start/restart a job you can set a custom start time for the job/query. With this level of control say I start a job (or have a job running) and then decide to stop it for majority of the day and then turn it on at say 11:30pm each day with a custom start time of midnight of the current day then it would be able to be on for approx 30min before it would output the results (yet still to my understanding produce the same aggregation results/effect compared to if it was on the whole day up until that point). This job could then be paused again at 00:30am ( the next day for which it stays paused for the majority of the day (1380min total until 11:30pm again) upon which the same above logic is applied.
This way it remains off the majority of the day yet still can produce the same output for each day wise window (correct me if I am wrong in my thinking). The only issue with this to me seems to be the fact someone would have to manually perform this. Thus I was driven to the docs looking for a way to automate this.
Question
How can I start and stop a job in an automated fashion such that the required output would still remain intact but so that the job doesn't have to remain on all the time (like it currently is)?
Does the documentation linked above suffice given the context above, if so what are some possible arrangements for the N minutes (on) and M minutes (off) time variables for this to work?
Is this possible given the scenario that I want to aggregate on a one day TUMBLINGWINDOW window (whereby I want each window to start and end at midnight of each day, as per its default behaviour.)?
Eg
Window start: 2022-02-20 00:00:00 Window end: 2022-02-21 00:00:00 (aggregation performed),
Window start: 2022-02-21 00:00:00 Window end: 2022-02-22 00:00:00 (aggregation performed),
Window start: 2022-02-22 00:00:00 Window end: 2022-02-23 00:00:00 (aggregation performed),
....so on
Thoughts
I found this documentation from Microsoft regarding auto-pausing jobs using a few methods
However came across a paragraph (quoted below) which made me doubtful whether it seems reasonable in my particular use case (TUMBLING 1 day window as described in my question section).
Note
There are downsides to auto-pausing a job. The main ones being the loss of the low latency /real time capabilities, and the potential risks from allowing the input event backlog to grow unsupervised while a job is paused. Auto-pausing should not be considered for most production scenarios running at scale.
Could this method
There are 3 ways to lower costs:
downscale your job, you will have higher latency but for a lower cost, up to a point where your job crashes because it runs out of memory over time and/or can't catch up with its backlog. Here you need to keep an eye on your metrics to make sure you can react before it's too late
going further, you can regroup multiple queries into a single job. This job most likely won't be aligned in partitions, so it won't be able to scale linearly (adding SUs is not guaranteed to give you better performance). Same comment as above, plus you need to remember that when you need to scale back up, you probably will have to break down that job into multiple jobs to again be able to scale in a linear fashion
finally you can auto-pause a job, one way to implement that being explained in the doc you linked. I wrote that doc, and what I meant by that comment is that here again you are taking the risk of overloading the job if it can't run long enough to process the backlog of events. This is a risky proposition for most production scenarios
But if you know what you are doing, and are monitoring closely the appropriate metrics (as explained in the doc), this is definitely something you should explore.
Finally, all of these approaches, including the auto-pause one, will deal with tumbling windows transparently for you.
Update: 2022-03-03 following comments here
Update: 2022-03-04 following comments there
There are 3 time dimensions here:
When the job is running or not: the wall clock
When the time window is expected to output results: Tumbling(day,1) -> 00:00AM every day, this is absolute (on the day, on the hour, on the minute...) and independent of the job start time below
What output you want produced from the job, via the job start time
Let's say you have the job running 24/7 for multiple months, and decide to stop it at noon (12:00PM) on the 1st day of March.
It already has generated an output for the last day of February, at 00:00AM Mar1.
You won't see a difference in output until the following day, 00:00AM Mar2, when you expect to see the daily window of Mar1, but it's not output because the job is stopped.
Let's start the job at 01:00AM Mar2 wall clock time. If you want the missing time window, you should either pick a start time at 'when last stopped' (noon the day before), or a custom time any time before 23:59PM Mar1. What you are driving is the output window you want. Here you are telling ASA you want all the windows from that point onward.
ASA will then reload all the data it needs to generate that window (make sure the event hub has enough retention for that, we don't cache data between restarts in the job): Azure Stream Analytics will automatically look back at the data in the input source. For instance, if you start a job “Now” and if your query uses a 5-minutes Tumbling Window, Azure Stream Analytics will seek data from 5 minutes ago in the input. The first possible output event would have a timestamp equal to or greater than the current time, and ASA guarantees that all input events that may logically contribute to the output has been accounted for.
I have an Azure Batch job where each node has a single task to run. Some of the tasks finish quickly, some run for much longer. I am using an auto-pool with around 100 nodes, so I want to give back the unneeded nodes ASAP.
Is there a built-in way to accomplish this pattern of node usage, or should I use auto-scaling where I set the number of dedicated/low-pri nodes to the number of pending tasks? If I do use auto-scaling, will Azure Batch always remove the idle nodes first? I don't want the active nodes to be disturbed. Thanks.
Use the auto scaling rules, this scenario is exactly what they're for. There is an option to only remove the node once it's tasks are complete - see NodeDeallocationOption. Note that the minimum autoscale rule evaluation interval is 5 minutes so you will have a slight delay between adding your tasks and the scaling to kick in.
Check out https://learn.microsoft.com/en-us/azure/batch/batch-automatic-scaling for some of the autoscale formula.
In the new Azure portal under
Cloud services > Scaling blade > Scale Profile > choosing Recurrence
There is a range of instances to choose, if I understood correctly.
Can someone explain the meaning of a "range", since in certain hour I would expect to have a fixed number of instances.
New portal
The range defines the minimum and maximum number of instances that will be running at any given time. For example, if you're configuring a scale profile based on CPU load you will have CPU percentage rules in place that will add/remove instances based on CPU thresholds you define. The scale profile will add/remove instances as needed within the range you specified.
The Range in the new Portal has 2 sliders Minimum and Max Number of Instances
Depending on the trigger you will set for this to Happen
Manual Scale : The Nummber of Instances will be set based on your Choice
CPU : You set the Minimum and the Max for Instances and Besed on the CPU Untalization Azure will Set the Number of instances
Other Measures : Same as CPU but for Queuing, Traffic, Disk I/O Etc.
There is "Scale Profile" and also a sub scaling call "rule".
The rule determine the scale within the range of the "Scale Profile".
target range for the metric. For example, if you chose CPU percentage, you can set a target for the average CPU across all of the instances in your service. A scale out will happen when the average CPU exceeds the maximum you define, likewise, a scale in will happen whenever the average CPU drops below the minimum
https://learn.microsoft.com/en-us/azure/monitoring-and-diagnostics/insights-how-to-scale
Is it anyhow possible to get the count, how often DynamoDB throughput (write units/read units) was downscaled within the last 24 hours?
My idea is to downscale as soon as an hugo drop e.g. 50% in the needed provisioned write units occur. I have really peaky traffic. Thus it is interessting to me to downscale after every peak. However I have a analytics jobs running at night which is provisioning a huge amount of read units making it necessary to be able to downscale after it. Thus I need to limit downscales to 3 times within 24 hours.
The number of decreases is returned in a DescribeTable result as part of the ProvisionedThroughputDescription.