AWS Sagemaker inference endpoint doesn't scale in with autoscaling

AWS Sagemaker inference endpoint doesn't scale in with autoscaling - autoscaling

I have an AWS Sagemaker inference endpoint with autoscaling enabled with SageMakerVariantInvocationsPerInstance target metric. When I send a lot of requests to the endpoint the number of instances correctly scales out to the maximum instance count. But after I stop sending the requests the number of instances doesn't scale in to 1, minimum instance count. I waited for many hours. Is there a reason for this behaviour?
Thanks

AutoScaling requires a cloudwatch alarm to trigger to scale in. Sagemaker doesn't push 0 value metrics when there's no activity (it just doesn't push anything). This leads to the alarm being put into insufficient data and not triggering the autoscaling scale in action when your workload suddenly ends.
Workarounds are either:
Have a step scaling policy using the cloudwatch metric math FILL() function for your scale in. This way you can tell CloudWatch "if there's no data, pretend this was the metric value when evaluating the alarm. This is only possible with step scaling since target tracking creates the alarms for you (and AutoScaling will periodically recreate them, so if you make manual changes they'll get deleted)
Have scheduled scaling set the size back down to 1 every evening
Make sure traffic continues at a low level for some times

Related

How to configure commands to get executed automatically when AWS ec2 instance cpu utilization goes above certain percentage

I am running an application on aws ec2 instance(Linux). Its cpu utilization is going to 100% and the instance is becoming unhealthy. I am not sure which request is causing the spike in the CPU utilization. I want to configure some commands to get executed automatically whenever AWS ec2 instance CPU utilization goes above 80%. Is there any way to do it?

Yes, you can do this with a combination of CloudWatch Alarm, CloudWatch Event, and Lambda.
First, create a CloudWatch Alarm that will trigger whenever the EC2's CPUUtilization metric satisfies your condition.
From here, you can configure your Alarm to send an SNS message, perform an Auto Scaling action, or perform an EC2 action (you can choose to stop/ reboot/ terminate the instance).
If you need to execute a custom code when the Alarm triggers, then you can create a CloudWatch Event that matches CloudWatch Alarm State Change to trigger a Lambda function, and finally specify your custom action in your Lambda function.

One Instance-One Request at a time App Engine Flexible

I am using
App Engine Flexible, custom runtime.
nodejs, as base Image.
express
Cloud Tasks for queuing the requests
puppeteer job
My Requirements
20GB RAM
long-running process
because of my unique requirement, I want 1 request to be handled by only 1 instance. when it gets free or the request gets timed-out, only then it should get a new request.
I have managed to reject other requests while the instance is processing 1 request, but not able to figure out the appropriate automatic scaling settings.
Please suggest the best way to achieve this.
Thanks in advance!

In your app.yaml try restricting the max_instances and max_concurrent_requests.
I also recommend looking into rate limiting your Cloud Tasks queue in order to reduce unnecessary attempts to send requests. Also you may want to increase your MIN_INTERVAL for retry attempts to spread out requests as well.
Your task queue will continue to process and send tasks by the rate you have set, so if your instance rejects the request it will go into a retry pattern. It seems like you're focused on the scaling of App Engine but your issue is with Cloud Tasks. You may want to schedule your tasks so they fire at the interval you want.

You could set readiness checks on your app.
When an instance is handling a request, set the readiness check to return a non-ready status. 429 (too many requests) seems like a good option.
This should avoid traffic to that specific instance.
Once the request is finished, return a 200 from the readiness endpoint to signal that the instance is ready to accept a new request.
However, I'm not sure how will this work with auto-scaling options. Since the app will only scale up once the average CPU is over the threshold defined, if all instances are occupied but do not reach that threshold, the load balancer won't know where to route requests (no instances are ready), and it won't scale up.
You could play around a little bit with this idea and manual scaling, or by programatically changing min_instances (in automatic scaling) through the GAE admin API.
Be sure to always return a 200 for the liveness check, or the instance will be killed as it will be considered unhealthy.

Can I use a retry policy in an azure function?

I'm using event hubs to temporary store data which will first be saved to azure table storage and then indexed to elasticsearch.
I was thinking that I should do the storage saving calls in an azure function, and do the same for the elasticsearch indexing using NEST.
It is important that the data is processed, so I was thinking that I'll use Polly as a retry policy in case the elasticsearch server is failing. However, won't a retry policy potentially make the azure function expensive?
Is azure functions even the right way to go?

Yes, you can use Polly for retries inside your Azure Functions. Some further considerations:
Yes, you will pay for the retry time. But given that your Elastic Search is "mostly up", the extra price for occasional retries should not be too high.
If you want to retry saving to Table Storage too, you will have to write calls decorated with Polly yourself instead of otherwise preferred output binding
Make sure to check if order of writes is important to you and whether you should retry Table Storage writes to completion before you start writing to Elastic, or vice versa. Otherwise you can do them in parallel with async and then Task.WaitAll
The maximum execution time of a Function is 5 minutes by default, you can configure it up to 10 minutes max. If you need to handle outages longer than that, you probably need a plan B. E.g. start copying the events that are failing for longer than 4 (or 9) minutes to a dedicated Queue, and retry from there. Or disabling the Function for such periods of downtime.

Yes it is. You could use a library or better just write a simple linear backoff strategy —
like try 5 times with 5 seconds sleep in between — and do something like
context.log.error({
message: `Transient failure. This is Retry number ${retryCount}.`,
errorCode: errorCodeFromCallingElasticSearch,
errorDetails: moreContextMaybeSomeStack
});
every time you hit the retry logic so it goes to App Insights (make sure you integrate with App Insights, else you have no ops or it's completely dark ops).
You can then query for how often is it really a miss and get an idea on how well things go at the 95% percentile.
Occasionally running 10 seconds over the normal 1 second execution time for your function is going to cost extra, but probably nowhere near a full dedicated App Service Plan. If it comes close, just switch to that, it means your function is mostly on rather than off - which is still a perfectly good case for running a function.
App Insights can also trigger alerts if some metric goes haywire, like your retry count goes up to 11 for 24 hours, you probably want to know about that deviation. You'll need to send the retry count as a custom metric to trigger an alert off of it:
context.log.metric("CallElasticSearchRetryCount", retryCount);

Azure Functions Event Hub trigger bindings

Just have a couple of questions regarding the usage of Azure Functions with an EventHub in an IoT scenario.
EventHub has partitions. Typically messages from a specific device go to the same partition. How are the instances of an Azure Function distributed across EventHub partitions? Is it based on the performance? In case one instance of an Azure Function manages to process events from all partitions then it is enough otherwise one might end up with one instance of an Azure Function per EventHub partition?
What about the read-offset? Does this binding somehow records where it stopped reading the event stream? I thought the functions are meant to be stateless and here we have some state.
Thanks

Each instance of an Event Hub-Triggered Function is backed by only 1 EventProcessorHost(EPH) instance. Event Hub ensures that only 1 EPH can get a lease on a given partition.
Answer to Question 1:
Let's elaborate on this with a contrived example. Suppose we begin with the following setup and assumptions for an EventHub:
10 partitions.
1000 events distributed evenly across all partitions => 100 messages in each partition.
When your Function is first enabled, there is only 1 instance of the Function. Let's call this Function instance Function_0. Function_0 will have 1 EPH that manages to get a lease on all 10 partitions. Let this EPH be called EPH_0, and it will start reading events from partitions 0-9. From this point forward, one of the following will happen:
Only 1 Function instance is needed - Function_0 is able to process all 1000 before the Azure Functions' scaling logic kicks in.
Hence, all 1000 messages are processed by Function_0.
Add 1 more Function instance - Azure Functions' scaling logic determines that Function_0 seems sluggish, so a new instance
Function_1 is created, resulting in EPH_1. Event Hub detects that a new EPH instance is trying read messages. Event Hub will start load
balancing the partitions across the EPH instances, e.g., partitions
0-4 are assigned to EPH_0 and partitions 5-9 are assigned to EPH_1.
If all Function execution succeed without errors, both EPH_0 and
EPH_1 checkpoints successfully and all 1000 messages are processed. When check-pointing succeeds, all 1000 messages should never be retrieved again.
Add N more function instances - Azure Functions' scaling logic determines that both Function_0 and Function_1 are still sluggish and
will repeat workflow 2 again for Function_2...N, where N>9. Event Hub will load balance the partitions across Function_0...9 instances.
Unique to Azure Functions' current scaling logic is the fact that N is >(number of partitions). This is done to ensure
that there are always instances of EPH readily available to quickly
get a lock on the partition(s). As a customer, you are only charged for the resources used when your Function instance executes, but you are not charged for this over-provisioning.
Answer to Question 2:
EPH uses a check-pointing mechanism to mark the last known successfully read message. An EventHub-Triggered Function can be setup to process 1 message or a batch of messages at a time. The option you choose needs to consider the following:
1. Speed of message processing - Processing messages in batches instead of a single message at a time is one of the factors that will speed up the ability of your Azure Function workflow to keep up with the incoming messages in your Event Hub.
2. Tolerance for duplicates - If check-pointing fails due to errors in your Function code/(Updated Aug 24th, 2017) timeout/partition least lost, then the next EPH that gets a lease on that partition will start retrieving messages from the last known checkpoint. Event Hub guarantees at-least-once delivery but not at-most-once delivery. Azure Functions will not attempt to change that behavior. If not having duplicate messages is a priority, then you will need to mitigate it in your workflow. As such, when check-pointing fails, there are more duplicate messages to manage if your Function is processing messages at batch level.

Function Apps are based on WebJobs SDK, which use EventHostProcessor to consume events from Event Hubs. So you can lookup information about EventHostProcessor and it will be applicable to your Function App.
Particularly, you can find the implementation of IEventProcessor
here.
To your questions:
Not sure what you mean by "one instance". One listener will be created per partition, but they can be both hosted inside a single App Plan Instance if the load is low. On the high level, you should not care much: in Consumption Plan you pay per execution time, no matter how many servers/processes/threads are running. Of course, you should care whether the auto-scaling works good enough for high load, but that needs to be tested anyway.
Functions are stateless in a sense that you can't save anything in-memory between two function executions. You are totally fine to save state in external storage. Function App will use PartitionContext.CheckpointAsync() for checkpointing of the current offset. Azure Storage is used internally; again you can read more about how it works in Event Hubs and EventHostProcessor docs, e.g. here.

AWS AutoScaling Not Scaling Up

I've setup an AWS AutoScaling group. Have 2 alarms to increase the number of servers if the average load is above 65% and decrease if it's less than 35%. Not sure what the final numbers will be, but this is what I initially used. I ran a yes >& /dev/null command on the linux server and the load very quickly went up to 100% (as reported by linux top command), but no new instances were being launched, because I think the alarms were not triggering. How exactly is the cpu load average computed/retrieved by the Auto Scaler?
I also, as an experiment, killed responding to the AWS ping commands from the server and thus, it was deemed not healthy by the AWS. The server was terminated and a new one was started up. So, I know that launching/terminating of servers is working in the Auto Scaler due to "health" reason.
What else should I look at to diagnose the problem?
Is my way of stressing the server not the "right" way as far as the Auto Scaler is concerned?
Is it using a different benchmark?

[This is a comment not an answer]
You can use set-alarm-state in aws cli to trigger your alarms
aws cloudwatch set-alarm-state --alarm-name "myalarm" --state-value ALARM --state-reason "testing purposes"
This way you can easily test them out. If you still have problems then maybe you can post the output of
aws cloudwatch describe-alarms --alarm-names "myalarm"

NOTE: Your Average load from both the instances should cross 65% only then a new instance is launched. So, in your case the load on both the instances must cross 65%. Only then AutoScaling Group launches a new instance.
You can use tools such as BeesWithMachineGuns, Loadrunner and other Load testing tools to increase load of your server such that it goes above 65%.
Suggestion: Check your server load on Cloudwatch metrics rather than from inside the server( using top). This will give you a clear picture of how AWS is calculating your Instance load.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string