how to find reason for 100% CPU utilization on AWS EC2?

how to find reason for 100% CPU utilization on AWS EC2? - linux

I've fleet of EC2 instances : A and B (both are in same AWS account, same Linux OS version, same region, but different AZ and under different load balances ).
when i give same load to fleet of EC2 instances A and B ; both behave differently.
EC2 A works normally with average CPU utilization upto 60% ; on other hand EC2 B shows spike in CPU utilization upto 100% then it start again from 0 and same effort found in other instances in fleet.
Anyone experienced this in past?

ssh to the host B, see the system activity via top, look for the process consuming most of the CPU.
also you can inspect the process with "lsof" command or
ps -fp "PID of the process"

After analysis it was found that couple of security patches getting executed; which was causing these spikes.

This has happened to me twice now with a MS Server instance running EC2. In both cases it was the MS Update Service which had taken 100% of CPU and burnt off all my CPU credits.
The only way to get back on and fix it was to set "Instance T2/T3 Unlimited" and stop/disable the MS Update Service.

Related

Debugging "mem_used_percent" on EC2 instance

We are facing a peculiar issue in our AWS service running on EC2 instances. In CW metrics "mem_used_percent" is gradually going up with time and eventually goes to 90% followed by system failure. We have verified that failure is happening due to OOM error and restarting the hosts is fixing it by bringing the "mem_used_percent" down to around 20%.
While doing the investigation on new and old running EC2 instances, we are seeing that only around 20% of the RAM usage is accounted for in "top" command output(sort by %MEM). We are not able to actually pin-point the processes using rest of the unaccounted physical memory.
Is there a better way to do memory usage analysis on EC2 instances(Linux) that will sum up to "mem_used_percent" in CW metrics?
Please let me know if any other details are required.
Thanks!!

Take a look at "procstat plugin". It might help your investigation
"The procstat plugin enables you to collect metrics from individual processes".
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-procstat-process-metrics.html#CloudWatch-Agent-procstat-process-metrics-collected

DigitalOcean Server CPU 100% without app running

htop command shows the CPU is 100% used even tho I do not have the app running or anything else. The DigitalOcean dashboard metric shows this same data (100% usage) as well.
The top tasks on the htop list take less than 10% CPU usage. The biggest is pm2 taking ~5.2 % usage.
Is it possible that there are hidden tasks that are not displaying on the list and, in general, how I can start investigating what's going on?
My droplet used this one-click installation:
https://marketplace.digitalocean.com/apps/nodejs
Thanks in advance!
Update 1)
The droplet has a lot of free disk space

I ran pm2 save --force to sync running processes and the CPU went back to normal.
I guess there was an app stuck or something that ate all the CPU.

Reserve CPU and Memory for Linux Host in Docker

I am running several docker containers via docker-compose on a server.
Problem is, that the load of the containers for some reason always crashes my server after a while...
I can only find resources and answered questions on how to limit a containers cpu/memory usage, but what I want to achieve is giving all containers in total let's say a CPU or Memory usage of like 85% and reserve the rest for the Linux Host so that the server itself doesn't crash.
Does anyone have an idea how to achieve this?

You could use docker-machine, I guess... Then you would define a VM within which all containers would run, and you limit the VM's total memory, leaving the rest for the host.
Otherwise, Docker is running as a native process on the machine, and there isn't a way to place a total limit on "all Docker processes"

The best idea I have right now is to set the cpu limit of each service/container so that sum never reaches 85% but in the long run you should investigate why the server crashes. Maybe it is a cooling or PSU issue?

Amazon EC2 boot time

Our web app performs a random number of tasks for a user initiated action. We have built a small system where a master server calculates the number of worker servers that are needed to complete the task, and the same number of EC2 instances are "Turned On" which pick up the tasks and perform the same.
"Turned On" because the time taken to span an instance from an AMI is extremely high. So the idea is have a pool of worker instances and start and stop them as per requirement.
Also considering how amazon charges when you start up an instance (You are billed for 1 hour every time you Turn on an instance). The workers once spawned will be active for an hour and will accept other tasks during this period.
We have managed to get this architecture up and running, however the boot up time still bothers us as it fluctuates between 40 to 80 seconds. Is there some way we can reduce the same.
Below is the stack information of the things running on the worker instance
Ubuntu AMI
Node JS (using forever-service for auto startup on boot)
Docker (the tasks are performed inside individual docker containers)

Have you taken a look at AWS lambda ? (https://aws.amazon.com/lambda ).
Lambda supports node.js and will automatically manage the scaling of required worker infrastructure, depending on the number of requests. This will avoid your "one hour bill" problem. You only pay for used processing time.

Role cannot be reached by the host system Azure- WorkerRole

I'm using the Worker Role machines (Medium -> 2 Cores with 3,5 GB of Ram) to do massive work, and I'm able to use 100% of the CPU (of both cores) and 85% of RAM.
During this work, each takes around 20 minutes/ 40 minutes the Azure thinks the machine is unhealthy and stops all my work.
In the Portal I see my worker instance are getting the message "Waiting for the status (Role cannot be reached by the host system).
Can anyone know a work around that doesn't include:
1) Use a more power full Role with cores that I will not use
2) Try to reduce the CPU usage by my application (100% CPU usage is what we want to use)
Thanks in advance
Rui

try this:
Thread.CurrentThread.Priority = ThreadPriority.BelowNormal
maybe some other things(processes, threads) need lower priority's also but this should keep the cpu utilization at 100%
for (external) processes start them with the following code(this is vb but you should be able to covert it to your language
Dim myprocess As New System.Diagnostics.Process()
myprocess.StartInfo.FileName = "C:\the\path\to\the\the\process.exe"
myprocess.Start()
myprocess.PriorityClass = ProcessPriorityClass.BelowNormal
you could set the priority of the current process of the worker role but this might be dependent of other processes so watch out, its better to set the priority of the demanding process lower this won't slow it down unless there is other work to be proformed
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.AboveNormal

This is something that is affecting a service I'm running in a Windows Azure as well.
I have just tried manually setting the Priority of WaAppAgent to High. Hopefully that helps.
But really this is shouldn't be my problem. Sometimes my database is running at 100% CPU and really this is the WORST possible time for a restart.
I really don't want to over provision resources just so some heart beat will be happy. Do the VM instances have a heart beat event as well? Maybe the solution is to switch to using a VM instead of using a PaaS role?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to find reason for 100% CPU utilization on AWS EC2? - linux

ssh to the host B, see the system activity via top, look for the process consuming most of the CPU. also you can inspect the process with "lsof" command or ps -fp "PID of the process"

After analysis it was found that couple of security patches getting executed; which was causing these spikes.

This has happened to me twice now with a MS Server instance running EC2. In both cases it was the MS Update Service which had taken 100% of CPU and burnt off all my CPU credits. The only way to get back on and fix it was to set "Instance T2/T3 Unlimited" and stop/disable the MS Update Service.

Related

Debugging "mem_used_percent" on EC2 instance

DigitalOcean Server CPU 100% without app running

Reserve CPU and Memory for Linux Host in Docker

Amazon EC2 boot time

Role cannot be reached by the host system Azure- WorkerRole

Categories

Resources