We are facing a peculiar issue in our AWS service running on EC2 instances. In CW metrics "mem_used_percent" is gradually going up with time and eventually goes to 90% followed by system failure. We have verified that failure is happening due to OOM error and restarting the hosts is fixing it by bringing the "mem_used_percent" down to around 20%.
While doing the investigation on new and old running EC2 instances, we are seeing that only around 20% of the RAM usage is accounted for in "top" command output(sort by %MEM). We are not able to actually pin-point the processes using rest of the unaccounted physical memory.
Is there a better way to do memory usage analysis on EC2 instances(Linux) that will sum up to "mem_used_percent" in CW metrics?
Please let me know if any other details are required.
Thanks!!
Take a look at "procstat plugin". It might help your investigation
"The procstat plugin enables you to collect metrics from individual processes".
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-procstat-process-metrics.html#CloudWatch-Agent-procstat-process-metrics-collected
I've been struggling to run multiple instances of Puppeteer on DigitalOcean for quite some time with little luck. I'm able to run ~5 concurrently using tools like puppeteer-cluster, but for some reason the whole thing just chokes with little helpful messaging. So, I switched to spawning ~5 child processes without any additional library -- just Puppeteer itself. Same issue. Chokes with no helpful errors.
I'm able to run all of these jobs just fine locally, but after I deploy, I hit these walls. So, my hunch is that it's a resource/performance issue, but I can't say for sure.
I'm running a droplet with 1GB and 3CPUs on Digital Ocean.
Basically, I'm just looking for ways to start troubleshooting something like this. is there a way I can know for sure that I'm hitting resource walls? I've tried pm2 and the DO dashboard graphs, but I feel like those are all leaving a lot of information out, or else I'm missing something else altogether.
Author of puppeteer-cluster here. You are right, 1 GB of memory is likely not enough for running 5 browser windows (or tabs) in addition to your operating system and maybe even other background tasks.
Here is a list of resources you should check:
Memory: Use a tool like htop to check your memory usage while your application is running.
CPU: Again, you can use htop for that, 3 vCPUs should be more than enough for 5 windows.
Disk space: Use a tool like df to check if there is enough space on the disk. I know of multiple cases in which there was not enough space on the disk (like some old kernels filling the disk), and Chrome needs at least some space to run.
Network throughput: Rarely the problem, but sometimes the network just does not have the bandwidth to support many open browser. Use a tool like nload to check the network throughput.
To use htop or nload, you start your script in the background (node script.js &) or use a terminal multiplexer (like tmux). Resource problems should then be easy to spot.
Most probably you're running out of memory, 5 puppeteer processes are a lot for a 1GB VM.
You can run
grep -i 'killed process' /var/log/messages
to confirm that the OOM killer terminated your processes.
I have set up an Azure WebApp (Linux) to run a WordPress and an other handmade PHP app on it. All works fine but I get this weird CPU usage graph (see below).
Both apps are PHP7.0 containers.
SSHing in to the two containers and using top I see no unusual CPU hogging processes.
When I reset both apps the CPU goes back to normal and then starts to raise slowly as shown below.
The amount of HTTP requests to the apps has not relation to the CPU usage at all.
I tried to use apache2ctl to see if there are any pending requests but that seems not possible to do inside a docker container.
Anybody got an idea how to track down the cause of this?
This is the top output. The instance has 2 cores. Lots of idle time but still over 100% load and none of the processes use the CPU ...
After handling with MS Support on that issue it seems to have boiled down to the WordPress theme being to slow or inefficient. Each request took very long and hogged CPU resources. All following requests started queuing up and thus increasing the CPU load.
Why that would not show as %CPU in top I was not explained.
They proposed to use a different theme or upscale to a multi core instance.
I am unsatisfied with that solution and will monitor further and try to find the real culprit.
I had almost exactly the same CPU Percentage graph as you did, although a Node.JS app instead of PHP. Disabling Diagnostic Logs > Docker Container Logging seems to have solved the problem for me.
I do not need those logs because I am logging to application insights.
But, in your case you might need more of those logs. I have no solution for that, but I am guessing that heavier log rotation or reducing the sizes of the logs by other means might help
I am working on a project which consists of a kernel distributed network file system.
I reached the point where I am testing my implementation. For that I would like to monitor the CPU load of it and by this I am referring to the kernel load of my module.
As I understood from a similar post there is no way of monitoring the load of a kernel module, therefore I was wondering which would be the best way to do it?
An example of testing my app would be to run the dd command in parallel.
At the moment I am using pidstat -c "dd" -p ALL to monitor the system load of command dd. At the same time I am looking at the top tool (top -d 0.2 to see more 'accurate' values).
With all these I do not feel confident that my way of monitoring is pretty accurate.
Any advice is highly appreciated.
Thank you.
You could use something like collectd to monitor all sorts of metrics, possibly showing it with Graphite for a simple overview with some processing tools (like averages over time).
That said, rather than monitoring the CPU load you could measure the throughput. By loading the system as much as you can you should be easily able to pinpoint which resource is the bottleneck: Disk I/O, network I/O, CPU, memory, or something else. And for a distributed network file system, you'll want to ensure that the bottleneck is very clearly disk or network I/O.
When running any kind of server under load there are several resources that one would like to monitor to make sure that the server is healthy. This is specifically true when testing the system under load.
Some examples for this would be CPU utilization, memory usage, and perhaps disk space.
What other resource should I be monitoring, and what tools are available to do so?
As many as you can afford to, and can then graph/understand/look at the results. Monitoring resources is useful for not only capacity planning, but anomaly detection, and anomaly detection significantly helps your ability to detect security events.
You have a decent start with your basic graphs. I'd want to also monitor the number of threads, number of connections, network I/O, disk I/O, page faults (arguably this is related to memory usage), context switches.
I really like munin for graphing things related to hosts.
I use Zabbix extensively in production, which comes with a stack of useful defaults. Some examples of the sorts of things we've configured it to monitor:
Network usage
CPU usage (% user,system,nice times)
Load averages (1m, 5m, 15m)
RAM usage (real, swap, shm)
Disc throughput
Active connections (by port number)
Number of processes (by process type)
Ping time from remote location
Time to SSL certificate expiry
MySQL internals (query cache usage, num temporary tables in RAM and on disc, etc)
Anything you can monitor with Zabbix, you can also attach triggers to - so it can restart failed services; or page you to alert about problems.
Collect the data now, before performance becomes an issue. When it does, you'll be glad of the historical baselines, and the fact you'll be able to show what date and time problems started happening for when you need to hunt down and punish exactly which developer made bad changes :)
I ended up using dstat which is vmstat's nicer looking cousin.
This will show most everything you need to know about a machine's health,
including:
CPU
Disk
Memory
Network
Swap
"df -h" to make sure that no partition runs full which can lead to all kinds of funky problems, watching the syslog is of course also useful, for that I recommend installing "logwatch" (Logwatch Website) on your server which sends you an email if weird things start showing up in your syslog.
Cacti is a good web-based monitoring/graphing solution. Very complete, very easy to use, with a large userbase including many large Enterprise-level installations.
If you want more 'alerting' and less 'graphing', check out nagios.
As for 'what to monitor', you want to monitor systems at both the system and application level, so yes: network/memory/disk i/o, interrupts and such over the system level. The application level gets more specific, so a webserver might measure hits/second, errors/second (non-200 responses), etc and a database might measure queries/second, average query fulfillment time, etc.
Beware the afore-mentioned slowquerylog in mysql. It should only be used when trying to figure out why some queries are slow. It has the side-effect of making ALL your queries slow while it's enabled. :P It's intended for debugging, not logging.
Think 'passive monitoring' whenever possible. For instance, sniff the network traffic rather than monitor it from your server -- have another machine watch the packets fly back and forth and record statistics about them.
(By the way, that's one of my favorites -- if you watch connections being established and note when they end, you can find a lot of data about slow queries or slow anything else, without putting any load on the server you care about.)
In addition to top and auth.log, I often look at mtop, and enable mysql's slowquerylog and watch mysqldumpslow.
I also use Nagios to monitor CPU, Memory, and logged in users (on a VPS or dedicated server). That last lets me know when someone other than me has logged in.
network of course :) Use MRTG to get some nice bandwidth graphs, they're just pretty most of the time.. until a spammer finds a hole in your security and it suddenly increases.
Nagios is good for alerting as mentioned, and is easy to get setup. You can then use the mrtg plugin to get alerts for your network traffic too.
I also recommend ntop as it shows where your network traffic is going.
A good link to get you going with Munin and Monit: link text
I typically watch top and tail -f /var/log/auth.log.