Resident (RES) memory bloat in 1 of 15 sidekiq workers - linux

I have a Rails app running on an Ubuntu EC2 instance.
It has 15 Sidekiq workers, and it appears that 1 of those Sidekiq processes always ends up consuming a lot of memory, even when it runs out of jobs to run.
After a fresh Sidekiq restart, the Sidekiq processes each use approx. 500MB of Resident memory. Over the course of a day, I'll end up with one of them up to about 2GB of Resident memory, even when I then go clear the Sidekiq queue.
This eventually results in the machine running out of memory, and restarting Sidekiq resolves the issue.
Is it normal for 1 of the Sidekiq processes to climb higher than the rest? If not, how can I figure out what's causing this?
Thanks

Related

Node process gets killed when Memory Cgroup reports OOM, when running on instances with a high RAM and CPU cores, but works with small instances

When running a job as a pipeline in Gitlab Runner's K8s pod, the job gets completed successfully only when running on a small instance like m5*.large which offers 2 vCPUs and 8GB of RAM. We set a limit for the build, helper, and services containers mentioned below. Still, the job fails with an Out Of Memory (OOM) error, getting the process node killed by cgroup when running on an instance way more powerful like m5d*.2xlarge for example which offers 8 vCPUs and 32GB of RAM.
Note that we tried to dedicate high resources to the containers, especially the build one in which the node process is a child process of this and nothing changed when running on powerful instances; the node process still got killed because of OOM, each time we give it more memory, the node process consumed higher memory and so on.
Also, regarding the CPU usage, in powerful instances, the more vCPUs we gave it, the more is consumed and we noticed that it has CPU Throtelling at ~100% almost all the time, however, in the small instances like m5*.large, the CPU throttling didn't pass the 3%.
Note that we specified a maximum of memory that be used by the node process but it looks like it does not take any effect. We tried to set it to 1GB, 1.5GB and 3GB.
NODE_OPTIONS: "--max-old-space-size=1536"
Node Version
v16.19.0
Platform
amzn2.x86_64
Logs of the host where the job runs
"message": "oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=....
....
"message": "Memory cgroup out of memory: Killed process 16828 (node) total-vm:1667604kB
resources request/limits configuration
memory_request = "1Gi"
memory_limit = "4Gi"
service_cpu_request = "100m"
service_cpu_limit = "500m"
service_memory_request = "250Mi"
service_memory_limit = "2Gi"
helper_cpu_request = "100m"
helper_cpu_limit = "250m"
helper_memory_request = "250Mi"
helper_memory_limit = "1Gi"
Resource consumption of a successful job running on m5d.large
Resource consumption of a failing job running on m5d.2xlarge
When a process in the container tries to consume more than the allowed amount of memory, the system kernel terminates the process that attempted the allocation, with an out of memory (OOM) error.
Check did you enable persistent journaling in your container(s)?
One way: mkdir /var/log/journal && systemctl restart systemd-journald
Other way: in ystemd/man/journald.conf.html
If not and your container uses systemd, it will log to memory with limits derived from the host RAM which can lead to unexpected OOM situations..
Also if possible you can increase the amount of RAM (clamav does use quite a bit).
If the node experiences an out of memory (OOM) event prior to the kubelet being able to reclaim memory, the node depends on the oom_killer to respond.
Node out of memory behavior is well described in Kubernetes best practices: Resource requests and limits. Adjust memory requests (minimal threshold) and memory limits (maximal threshold) in your containers.
Pods crash and OS Syslog shows the OOM killer kills the container process, Pod memory limit and cgroup memory settings. Kubernetes manages the Pod memory limit with cgroup and OOM killer. We need to be careful to separate the OS OOM and the pods OOM.
Try to use the --oom-score-adj option to docker run or even --oom-kill-disable. Refer to Runtime constraints on resources for more info.
Also refer to the similar SO for more related information.

Docker doesn't kill containers on OOM

I made two containers, both malloc in a loop until the server runs out of memory on a remote server running Debian 9 with enabled swap (4 GB RAM 1 GB swap). When running a single one (the host doesn't have any other services running, pretty much only dockerd), it gets killed in a minute or so, and everything is fine. Running 2/3 at the same time cause the server to lock out, making SSH unresponsive. Why don't these containers (I suppose they have really high OOM scores) get killed by OOM?

Possible OOM in GCP container – how to debug?

I have celery running in a docker container on GCP with Kubernetes. Its workers have recently started to get kill -9'd – this looks like it has something to do with OOMKiller. There are no OOM events in kubectl get events, which is something to be expected if these events only appear when a pod has trespassed resources.limits.memory value.
So, my theory is that celery process getting killed is a work of linux' own OOMKiller. This doesn't make sense though: if so much memory is consumed that OOMKiller enters the stage, how is it possible that this pod was scheduled in the first place? (assuming that Kubernetes does not allow scheduling of new pods if the sum of resources.limits.memory exceeds the amount of memory available to the system).
However, I am not aware of any other plausible reason for these SIGKILLs than OOMKiller.
An example of celery error (there is one for every worker):
[2017-08-12 07:00:12,124: ERROR/MainProcess] Process 'ForkPoolWorker-7' pid:16 exited with 'signal 9 (SIGKILL)'
[2017-08-12 07:00:12,208: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Containers can be OOMKilled for two reasons.
If they exceed the memory limits of set for them. Limits are specified on a per container basis and if the container uses more memory than the limit it will be OOMKilled. From the process's point of view this is the same as if the system ran out of memory.
If the system runs out of memory. There are two kinds of resource specifications in Kubernetes: requests and limits. Limits specify the maximum amount of memory the container can use before being OOMKilled. Requests are used to schedule Pods and default to the limits if not specified. Requests must be less than or equal to container limits. That means that containers could be overcommitted on nodes and OOMKilled if multiple containers are using more memory than their respective requests at the same time.
For instance, if both process A and process B have request of 1GB and limit of 2GB, they can both be scheduled on a node that has 2GB of memory because requests are what is used for scheduling. Having requests less than the limit generally means that the container can burst up to 2GB but will usually use less than 1GB. Now, if both burst above 1GB at the same time the system can run out of memory and one container will get OOMKilled while still being below the limit set on the container.
You can debug whether the container is being OOMKilled by examining the containerStatuses field on the Pod.
$ kubectl get pod X -o json | jq '.status.containerStatuses'
If the pod was OOMKilled it will usually say something to that effect in the lastState field. In your case it looks like it may have been an OOM error based on issues filed against celery (like this one).

PM2 NodeJs Cluster Mode

I have 4 ec2 instances running on AWS. PM2 is running in cluster mode on all instances. When I get 5K+ Concurrent request, response time of app increases significantly.
All requests fetch redis key, and a normal fetch takes upto 10 seconds which without so many concurrent requests takes only 50ms. What can be issue here?
We need to pinpoint the bottleneck. Let's do some diagnostics:
Are the EC2 instances multicore to take advantage of PM2's clustering?
When you execute pm2 start app.js -i X are you sure X=number_of_vCPUs of EC2 instance?
When you execute pm2 monit do you see all instances of the cluster sharing the equal CPU and memory usage?
When you run htop what is your total CPU and memory usage %?
When you execute iftop what is your total of your RX and TX traffic compared to the maximum available in your machine?

Sidekiq running multiple jobs in a single heroku dyno?

I have noticed that on some of my sidekiq workers, they appear to be running multiple processes (Working multiple jobs concurrently) in a single dyno (The logs would suggest this).
How many processes could be/are running separate jobs within a single dyno concurrently without using swarming (The enterprise feature)?
I have everything set up to defaults without using swarms, so each sidekiq worker is using 25 threads. What exactly all these threads are used for, however, I have no idea. Can anyone help me understand how this translates into concurrent workers working jobs inside a single Heroku dyno?
You are seeing a single Sidekiq process with 25 threads running jobs concurrently. Each thread will execute a job so you can have up to 25 jobs running at once.
Without swarm, you can only run one process per dyno.
You can run multiple processes in a dyno using swarm but how many depends on the memory requirements of your app and how many cores in the dyno.
This will get you 100 worker threads: 4*25.
SIDEKIQ_COUNT=4 bundle exec sidekiqswarm -e production -c 25

Resources