We are facing a peculiar issue in our AWS service running on EC2 instances. In CW metrics "mem_used_percent" is gradually going up with time and eventually goes to 90% followed by system failure. We have verified that failure is happening due to OOM error and restarting the hosts is fixing it by bringing the "mem_used_percent" down to around 20%.
While doing the investigation on new and old running EC2 instances, we are seeing that only around 20% of the RAM usage is accounted for in "top" command output(sort by %MEM). We are not able to actually pin-point the processes using rest of the unaccounted physical memory.
Is there a better way to do memory usage analysis on EC2 instances(Linux) that will sum up to "mem_used_percent" in CW metrics?
Please let me know if any other details are required.
Thanks!!
Take a look at "procstat plugin". It might help your investigation
"The procstat plugin enables you to collect metrics from individual processes".
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-procstat-process-metrics.html#CloudWatch-Agent-procstat-process-metrics-collected
Related
I have an e2-medium GCP Compute Engine instance running my application, for a while now. It has worked just well, however recently I have experienced frequent auto restarts (3 times in the space of 2 weeks), nothing in the logs points me to the reason for this, but it keeps repeating itself. Can someone please tell me what could be the problem?
VM's in GCP are not guaranteed to be up 100% of the time. Restarts could be cause by anything from hardware failures to Google performing some maintenance on it's physical servers (hardware repairs or Software patching).
I've got an app service plan with 14gb of memory - it should be plenty for my application's needs. There are two application services running on it, each identical - the private memory consumption of these hovers around 1gb but can spike to 4gb during periods of high usage. One app has a heavier usage pattern than the other.
Lately, during periods of high usage, I've noticed that the heavily used service can become unresponsive, and memory usage stays at 100% in the App Service Plan.
The high traffic service is using 4gb of private memory and starting to massively slow down. When I head over to the /scm.../ProcessExplorer/ page, I can see that the low traffic service has 1gb private memory used and 10gb of 'Working Set'.
As I understand it, on a single machine at least, the working set should be freed up when that memory is needed on another process. Does this happen naturally when two App Services share a single Plan?
It looks to me like the working set on the low-traffic instance is not being freed up to supply the needs of the high-traffic App Service.
If this is indeed the case, the simple fix is to move them to separate App Service Plans, each with 7gb of memory. However this seems like it might potentially be just shifting the problem around - has anyone else noticed similar issues with multiple Apps on a single App Service Plan? As far as I understand it, these shouldn't interfere with one another to the extent that they all need to be separated. Or have I got the wrong diagnosis?
In some high memory-consumption scenarios, your app might truly require more computing resources. In that case, consider scaling to a higher service tier so the application gets all the resources it needs. Other times, a bug in the code might cause a memory leak. A coding practice also might increase memory consumption. Getting insight into what's triggering high memory consumption is a two-part process. First, create a process dump, and then analyze the process dump. Crash Diagnoser from the Azure Site Extension Gallery can efficiently perform both these steps. For more information.
refer Capture and analyze a dump file for intermittent high memory for Web Apps.
In the end we solved this one via mitigation, rather than getting to the root cause.
We found a mitigation strategy to our previous memory issues several months ago, which was just to restart the server each night using a powershell script. This seems to prevent the memory just building up over time, and only costs us a few seconds of downtime. Our system doesn't have much overnight traffic as our users are all based in the same geographic location.
However we recently found that the overnight restart was reporting 'success' but actually failing each night due to expired credentials. Which meant that the memory issues we were having in the question I posted were actually exacerbated by server uptimes of several weeks. Restoring the overnight restart resolved the memory issues we were seeing, and we certainly don't see our system ever using 10gb+ again.
We'll investigate the memory issues if they rear their heads again. KetanChawda-MSFT's suggestion of using memory dumps to analyse the memory usage will be employed for this investigation when it's needed.
I've fleet of EC2 instances : A and B (both are in same AWS account, same Linux OS version, same region, but different AZ and under different load balances ).
when i give same load to fleet of EC2 instances A and B ; both behave differently.
EC2 A works normally with average CPU utilization upto 60% ; on other hand EC2 B shows spike in CPU utilization upto 100% then it start again from 0 and same effort found in other instances in fleet.
Anyone experienced this in past?
ssh to the host B, see the system activity via top, look for the process consuming most of the CPU.
also you can inspect the process with "lsof" command or
ps -fp "PID of the process"
After analysis it was found that couple of security patches getting executed; which was causing these spikes.
This has happened to me twice now with a MS Server instance running EC2. In both cases it was the MS Update Service which had taken 100% of CPU and burnt off all my CPU credits.
The only way to get back on and fix it was to set "Instance T2/T3 Unlimited" and stop/disable the MS Update Service.
I have set up an Azure WebApp (Linux) to run a WordPress and an other handmade PHP app on it. All works fine but I get this weird CPU usage graph (see below).
Both apps are PHP7.0 containers.
SSHing in to the two containers and using top I see no unusual CPU hogging processes.
When I reset both apps the CPU goes back to normal and then starts to raise slowly as shown below.
The amount of HTTP requests to the apps has not relation to the CPU usage at all.
I tried to use apache2ctl to see if there are any pending requests but that seems not possible to do inside a docker container.
Anybody got an idea how to track down the cause of this?
This is the top output. The instance has 2 cores. Lots of idle time but still over 100% load and none of the processes use the CPU ...
After handling with MS Support on that issue it seems to have boiled down to the WordPress theme being to slow or inefficient. Each request took very long and hogged CPU resources. All following requests started queuing up and thus increasing the CPU load.
Why that would not show as %CPU in top I was not explained.
They proposed to use a different theme or upscale to a multi core instance.
I am unsatisfied with that solution and will monitor further and try to find the real culprit.
I had almost exactly the same CPU Percentage graph as you did, although a Node.JS app instead of PHP. Disabling Diagnostic Logs > Docker Container Logging seems to have solved the problem for me.
I do not need those logs because I am logging to application insights.
But, in your case you might need more of those logs. I have no solution for that, but I am guessing that heavier log rotation or reducing the sizes of the logs by other means might help
I have very unusual peaks at my Amazon EC2 CPU usage and network in/out simultaneously.
This is the CPU usage:
THis is the Network in:
And my network out:
How can I know what process, file or command is doing this?
How can I recognize the problem and solve it?
This is something that I would use top/htop for and/or ps aux for. However, with your issue, you will likely have to be watching the server at the times in which these spike occur so top or htop are probably better options. I would likely think that there is a cron that would be causing this as it appears that you have something occurring every hour. so take a look in /etc/cron.hourly as once you see the spike see if it references anything from that directory.
References:
https://superuser.com/questions/117913/ps-aux-output-meaning
http://linux.die.net/man/1/top
http://linux.die.net/man/1/ps
http://linux.die.net/man/1/htop