Node process suddenly hit 100% CPU in EC2 instance - node.js

I am running Node application (Socket.io) in AWS EC2(2CPU-4gbRAM) with Loadbalancer and Autoscaling setup, I've set Systemd for the Node service and it runs fine for a day with only 1% CPU the next day it's not working.
When I check details with TOP Node CPU utilization is 100%, but it shows only 50% usage in AWS Monitor and it stands there.
From various references, I'm not able to find the area of high consumption, root cause.
Please help me to figure out why Node hit only one CPU and what's the cause of high consumption.
From cat /proc/PID/status
Thanks in advance!

Related

AWS-EC2 100% CPU Utilization

I am curious to know about 2 things below about the AWS-EC2 CPU utilization and CPU Credits.
What will happen if my AWS-EC2 instance CPU utilization is constantly 100% and I ran out of CPU credits?
Assume that my EC2 burstable (t-series) instance is in a shared host(which is a 8core CPU) shared by 2 VMs each 4 core. Now both the VMs are utilizing full CPU i.e. 400% each at the same time. Both the VMs are having enough CPU credits to utilize more CPU.
What should be expected result here and what will be the actual result ?
Please comment if more clarification is required.
Trying to understand how the AWS CPU Credits works in the above scenario.

Node web app running in Fargate crashes under load with memory and CPU relatively untaxed

We are running a Koa web app in 5 Fargate containers. They are pretty straightforward crud/REST API's with Koa over Mongo Atlas. We started doing capacity testing, and noticed that the node servers started to slow down significantly with plenty of headroom left on CPU (sitting at 30%), Memory (sitting at or below 20%), and Mongo (still returning in < 10ms).
To further test this, we removed the Mongo operations and just hammered our health-check endpoints. We did see a lot of throughput, but significant degradation occurred at 25% CPU and Node actually crashed at 40% CPU.
Our fargate tasks (containers) are CPU:2048 (2 "virtual CPUs") and Memory 4096 (4 gigs).
We raised our ulimit nofile to 64000 and also set the max-old-space-size to 3.5 GB. This didn't result in a significant difference.
We also don't see significant latency in our load balancer.
My expectation is that CPU or memory would climb much higher before the system began experiencing issues.
Any ideas where a bottleneck might exist?
The main issue here was that we were running containers with 2 CPUs. Since Node only effectively uses 1 CPU, there was always a certain amount of CPU allocation that was never used. The ancillary overhead never got the container to 100%. So node would be overwhelmed on its 1 cpu while the other was basically idle. This resulted in our autoscaling alarms never getting triggered.
So adjusted to 1 cpu containers with more horizontal scale out (ie more instances).

Docker containers freezing

I'm currently trying to deploy a node.js app on docker containers. I need to deploy 30 of them but they begin to have a weird behavior at some point, some of them freeze.
I am currently running Docker version for windows 18.03.0-ce, build 0520e24302, my computer specs (cpu and memory):
I5 4670 K
24 GB of ram
My docker default machine resource allocation is the following :
Allocated RAM : 10 Gb
Allocated vCPUs : 4
My node application is running on apline3.8 and node.js 11.4 and mostly do http requests every 2-3 seconds.
When i try to deploy 20 containers everything is running like a charm, my application do the job and i can see that there is an activity on every on my containers through the logs, activity stats.
The problem comes when i try to deploy more containers, more than 20, i can notice that some of the previously deployed containers start to stop their activities (0% cpu using, logs freezing). When everything is deployed (30 containers), Docker start to block the activity of some of them and unblock them at some point to block some others (blocking/unblocking is random). It seems to be sequential. I tried to wait and see what happened and the result is that some of the containers are able to poursue their activities and some others are stuck forever (still running but no more activity).
It's important to notice that i used the following resources restrictions on each of my containers :
MemoryReservation : 160mb
Memory soft limit : 160mb
NanoCPUs : 250000000 (0.25 cpus)
I had to increase my docker default machine resource allocation and decrease container's ressource allocation because it was using almost 100% of my cpu, maybe i did a mistake in my configuration. I tried to tweak those values, but no success i still have some containers freezing.
I'm kind of lost right know.
Any help would be appreciated even a little one, thank you in advance !

PM2 NodeJs Cluster Mode

I have 4 ec2 instances running on AWS. PM2 is running in cluster mode on all instances. When I get 5K+ Concurrent request, response time of app increases significantly.
All requests fetch redis key, and a normal fetch takes upto 10 seconds which without so many concurrent requests takes only 50ms. What can be issue here?
We need to pinpoint the bottleneck. Let's do some diagnostics:
Are the EC2 instances multicore to take advantage of PM2's clustering?
When you execute pm2 start app.js -i X are you sure X=number_of_vCPUs of EC2 instance?
When you execute pm2 monit do you see all instances of the cluster sharing the equal CPU and memory usage?
When you run htop what is your total CPU and memory usage %?
When you execute iftop what is your total of your RX and TX traffic compared to the maximum available in your machine?

ELK stack performance tuning

I am new to ELK stack, i just installed it to give it a test drive for our production systems logs management and started pushing logs(IIS & Event) from 10 Windows VMs using nxlog.
After the installation, I am receiving around 25K hits/15 minutes as per my Kibana dashboard. The size of /var/lib/elasticsearch/ has been increased to around 15GBs in just 4 days.
I am facing serious performance issues, Elasticsearch process is eating up all my CPU and around 90% of memory.
Elasticsearch service was stuck previously and /etc/init.d/elasticsearch stop/start/restart wasn't even working. The process was running even after trying to kill it with kill command. A system reboot also took the machine to same condition. I just deleted all the indices with curl command and now i am able to restart Elasticsearch.
I am using a standard A3 Azure instance(7GB RAM 4 cores) for this ELK setup.
Please guide me to tune my ELK stack to achieve good performance. Thanks.
your are using 7GB RAM your jvm heap size for elasticsearch should be less than 3.5GB
for more information you can read elasticsearch heap sizing

Resources