How to configure commands to get executed automatically when AWS ec2 instance cpu utilization goes above certain percentage - linux

I am running an application on aws ec2 instance(Linux). Its cpu utilization is going to 100% and the instance is becoming unhealthy. I am not sure which request is causing the spike in the CPU utilization. I want to configure some commands to get executed automatically whenever AWS ec2 instance CPU utilization goes above 80%. Is there any way to do it?

Yes, you can do this with a combination of CloudWatch Alarm, CloudWatch Event, and Lambda.
First, create a CloudWatch Alarm that will trigger whenever the EC2's CPUUtilization metric satisfies your condition.
From here, you can configure your Alarm to send an SNS message, perform an Auto Scaling action, or perform an EC2 action (you can choose to stop/ reboot/ terminate the instance).
If you need to execute a custom code when the Alarm triggers, then you can create a CloudWatch Event that matches CloudWatch Alarm State Change to trigger a Lambda function, and finally specify your custom action in your Lambda function.

Related

AWS Sagemaker inference endpoint doesn't scale in with autoscaling

I have an AWS Sagemaker inference endpoint with autoscaling enabled with SageMakerVariantInvocationsPerInstance target metric. When I send a lot of requests to the endpoint the number of instances correctly scales out to the maximum instance count. But after I stop sending the requests the number of instances doesn't scale in to 1, minimum instance count. I waited for many hours. Is there a reason for this behaviour?
Thanks
AutoScaling requires a cloudwatch alarm to trigger to scale in. Sagemaker doesn't push 0 value metrics when there's no activity (it just doesn't push anything). This leads to the alarm being put into insufficient data and not triggering the autoscaling scale in action when your workload suddenly ends.
Workarounds are either:
Have a step scaling policy using the cloudwatch metric math FILL() function for your scale in. This way you can tell CloudWatch "if there's no data, pretend this was the metric value when evaluating the alarm. This is only possible with step scaling since target tracking creates the alarms for you (and AutoScaling will periodically recreate them, so if you make manual changes they'll get deleted)
Have scheduled scaling set the size back down to 1 every evening
Make sure traffic continues at a low level for some times

AWS EC2 boots via scheduled Lambda, how to alert of errors?

My EC2 instance boots daily for 5 minutes before shutting down.
On bootup, a NodeJS script is executed. Usually this script will complete long before the 5 minutes are up, but I'd like to be notified (SMS/email) whenever it doesn't.
What is the correct approach? I can try to send a notification within my NodeJS code after 5 minutes if execution wasn't finished, but Lambda could shut down the instance before this occurs.
I'm quite new to AWS so I apologize if this is rather basic, I haven't had luck on Google with this issue.
Can you check if whatever Node script is doing when EC2 instance is up could be replicated with one or more lambda functions.
Think about serverless and microservices architecture. Theoretically any workflow which need servers could be achived via AWS Lambda functions and various triggers. In you case I can think of the following:
SES to send out email messages
API gateway to expose your Lambda function for trigger
Cloud watch events to trigger lambda function like a cronjob.
I would be surprise to learn if Serverless won't work here. Please do share the case so that I can brainstorm more and share a solution.

Notify Lambda on CloudFront Distribution Creation End

At the moment, we are calling cloudfront.listDistributions() every minute to identify a change in the status of the distribution we are deploying. This cause Lambda to timeout because CloudFront never deploys faster than 30 minutes (where Lambda timeouts after 15 min).
I would like to notify a Lambda function after a CloudFront Distribution is successfully created. This would allow us to execute the post-creation actions while saving valuable Lambda exec time.
Creating a Rule on CloudWatch does not offer the option to chose CloudFront. Nevertheless, it seems to accept creating a Custom Event Pattern with the source aws.cloudformation.
Considering options:
Trigger a lambda every 5 minutes to list distributions and compare states with previous states stored in DynamoDB.
Anybody with an idea to overcome this lack of feature from AWS?
If you want and have time, there's a trickier and a bit more complex solution for doing that leveraging CloudTrail.
Disclaimer
CloudTrail is not a real-time log system, but ensure that all API calls will be reported on the console within 15 minutes (as stated here under the CloudTrail FAQs). Due to this, what's following makes sense only for long-running tasks like creating a CloudFront distribution, taking up an Aurora DB ans so on.
You can create a CloudWatch event based rule (let's call it CW-r1)
on specific pattern like CreateDistribution or
UpdateDistribution.
CW-r1 triggers a Lambda (LM-1) which enables another CloudWatch
event base rule (CW-r2).
CW-r2 on a scheduled base, triggers a Lambda (LM-2) which via API
request the state of specific distribution. Once distribution is
"Deployed", LM-2 can send a notification via SNS for example (you can
send EMAIL, SMS, Push Notification whatever is supported on SNS).
Once everything is finished, LM-2 can disable the CW-r2 rule in order
to stop processing information.
In this way you can have an automatic notification system based on which API call you desire.

Schedule a task to run at some point in the future (architecture)

So we have a Python flask app running making use of Celery and AWS SQS for our async task needs.
One tricky problem that we've been facing recently is creating a task to run in x days, or in 3 hours for example. We've had several needs for something like this.
For now we create events in the database with timestamps that store the time that they should be triggered. Then, we make use of celery beat to run a scheduled task every second to check if there are any events to process (based on the trigger timestamp) and then process them. However, this is querying the database every second for events which we feel could be bettered somehow.
We looked into using the eta parameter in celery (http://docs.celeryproject.org/en/latest/userguide/calling.html) that lets you schedule a task to run in x amount of time. However it seems to be bad practice to have large etas and also AWS SQS has a visibility timeout of about two hours and so anything more than this time would cause a conflict.
I'm scratching my head right now. On the one had this works, and pretty decent in that things have been separated out with SNS, SQS etc. to ensure scaling-tolerance. However, it just doesn't feel write to query the database every second for events to process. Surely there's an easier way or a service provided by Google/AWS to schedule some event (pub/sub) to occur at some time in the future (x hours, minutes etc.)
Any ideas?
Have you taken a look at AWS Step Functions, specifically Wait State? You might be able to put together a couple of lambda functions with the first one returning a timestamp or the number of seconds to wait to the Wait State and the last one adding the message to SQS after the Wait returns.
Amazon's scheduling solution is the use of CloudWatch to trigger events. Those events can be placing a message in an SQS/SNS endpoint, triggering an ECS task, running a Lambda, etc. A lot of folks use the trick of executing a Lambda that then does something else to trigger something in your system. For example, you could trigger a Lambda that pushes a job onto Redis for a Celery worker to pick up.
When creating a Cloudwatch rule, you can specify either a "Rate" (I.e., every 5 minutes), or an arbitrary time in CRON syntax.
So my suggestion for your use case would be to drop a cloudwatch rule that runs at the time your job needs to kick off (or a minute before, depending on how time sensitive you are). That rule would then interact with your application to kick off your job. You'll only pay for the resources when CloudWatch triggers.
Have you looked into Amazon Simple Notification Service? It sounds like it would serve your needs...
https://aws.amazon.com/sns/
From that page:
Amazon SNS is a fully managed pub/sub messaging service that makes it easy to decouple and scale microservices, distributed systems, and serverless applications. With SNS, you can use topics to decouple message publishers from subscribers, fan-out messages to multiple recipients at once, and eliminate polling in your applications. SNS supports a variety of subscription types, allowing you to push messages directly to Amazon Simple Queue Service (SQS) queues, AWS Lambda functions, and HTTP endpoints. AWS services, such as Amazon EC2, Amazon S3 and Amazon CloudWatch, can publish messages to your SNS topics to trigger event-driven computing and workflows. SNS works with SQS to provide a powerful messaging solution for building cloud applications that are fault tolerant and easy to scale.
You could start the job with apply_async, and then use a countdown, like:
xxx.apply_async(..., countdown=TTT)
It is not guaranteed that the job starts exactly at that time, depending on how busy the queue is, but that does not seem to be an issue in your use case.

AWS AutoScaling Not Scaling Up

I've setup an AWS AutoScaling group. Have 2 alarms to increase the number of servers if the average load is above 65% and decrease if it's less than 35%. Not sure what the final numbers will be, but this is what I initially used. I ran a yes >& /dev/null command on the linux server and the load very quickly went up to 100% (as reported by linux top command), but no new instances were being launched, because I think the alarms were not triggering. How exactly is the cpu load average computed/retrieved by the Auto Scaler?
I also, as an experiment, killed responding to the AWS ping commands from the server and thus, it was deemed not healthy by the AWS. The server was terminated and a new one was started up. So, I know that launching/terminating of servers is working in the Auto Scaler due to "health" reason.
What else should I look at to diagnose the problem?
Is my way of stressing the server not the "right" way as far as the Auto Scaler is concerned?
Is it using a different benchmark?
[This is a comment not an answer]
You can use set-alarm-state in aws cli to trigger your alarms
aws cloudwatch set-alarm-state --alarm-name "myalarm" --state-value ALARM --state-reason "testing purposes"
This way you can easily test them out. If you still have problems then maybe you can post the output of
aws cloudwatch describe-alarms --alarm-names "myalarm"
NOTE: Your Average load from both the instances should cross 65% only then a new instance is launched. So, in your case the load on both the instances must cross 65%. Only then AutoScaling Group launches a new instance.
You can use tools such as BeesWithMachineGuns, Loadrunner and other Load testing tools to increase load of your server such that it goes above 65%.
Suggestion: Check your server load on Cloudwatch metrics rather than from inside the server( using top). This will give you a clear picture of how AWS is calculating your Instance load.

Resources