Kubernetes Pods Terminated - Exit Code 137 - linux

I need some advise on an issue I am facing with k8s 1.14 and running gitlab pipelines on it. Many jobs are throwing up exit code 137 errors and I found that it means that the container is being terminated abruptly.
Cluster information:
Kubernetes version: 1.14
Cloud being used: AWS EKS
Node: C5.4xLarge
After digging in, I found the below logs:
**kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%).
**kubelet: E0114 03:37:08.653132** 4721 kubelet.go:1282] Image garbage collection failed once. Stats initialization may not have completed yet: failed to garbage collect required amount of images. Wanted to free 3022784921 bytes, but freed 0 bytes
**kubelet: W0114 03:37:23.240990** 4721 eviction_manager.go:397] eviction manager: timed out waiting for pods runner-u4zrz1by-project-12123209-concurrent-4zz892_gitlab-managed-apps(d9331870-367e-11ea-b638-0673fa95f662) to be cleaned up
**kubelet: W0114 00:15:51.106881** 4781 eviction_manager.go:333] eviction manager: attempting to reclaim ephemeral-storage
**kubelet: I0114 00:15:51.106907** 4781 container_gc.go:85] attempting to delete unused containers
**kubelet: I0114 00:15:51.116286** 4781 image_gc_manager.go:317] attempting to delete unused images
**kubelet: I0114 00:15:51.130499** 4781 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage
**kubelet: I0114 00:15:51.130648** 4781 eviction_manager.go:362] eviction manager: pods ranked for eviction:
1. runner-u4zrz1by-project-10310692-concurrent-1mqrmt_gitlab-managed-apps(d16238f0-3661-11ea-b638-0673fa95f662)
2. runner-u4zrz1by-project-10310692-concurrent-0hnnlm_gitlab-managed-apps(d1017c51-3661-11ea-b638-0673fa95f662)
3. runner-u4zrz1by-project-13074486-concurrent-0dlcxb_gitlab-managed-apps(63d78af9-3662-11ea-b638-0673fa95f662)
4. prometheus-deployment-66885d86f-6j9vt_prometheus(da2788bb-3651-11ea-b638-0673fa95f662)
5. nginx-ingress-controller-7dcc95dfbf-ld67q_ingress-nginx(6bf8d8e0-35ca-11ea-b638-0673fa95f662)
And then the pods get terminated resulting in the exit code 137s.
Can anyone help me understand the reason and a possible solution to overcome this?
Thank you :)

Exit Code 137 does not necessarily mean OOMKilled. It indicates failure as container received SIGKILL (some interrupt or ‘oom-killer’ [OUT-OF-MEMORY])
If pod got OOMKilled, you will see below line when you describe the pod
State: Terminated
Reason: OOMKilled
Edit on 2/2/2022
I see that you added **kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%). and must evict pod(s) to reclaim ephemeral-storage from the log. It usually happens when application pods are writing something to disk like log files. Admins can configure when (at what disk usage %) to do eviction.

137 mean that k8s kill container for some reason (may be it didn't pass liveness probe)
Cod 137 is 128 + 9(SIGKILL) process was killed by external signal

The typical causes for this error code can be system out of RAM, or a health check has failed

Was able to solve the problem.
The nodes initially had 20G of ebs volume and on a c5.4xlarge instance type. I increased the ebs to 50 and 100G but that did not help as I kept seeing the below error:
"Disk usage on image filesystem is at 95% which is over the high
threshold (85%). Trying to free 3022784921 bytes down to the low
threshold (80%). "
I then changed the instance type to c5d.4xlarge which had 400GB of cache storage and gave 300GB of EBS. This solved the error.
Some of the gitlab jobs were for some java applications that were eating away lot of cache space and writing lot of logs.

Detailed Exit code 137
It denotes that the process was terminated by an external signal.
The number 137 is a sum of two numbers: 128+x, # where x is the signal number sent to the process that caused it to terminate.
In the example, x equals 9, which is the number of the SIGKILL signal, meaning the process was killed forcibly.
Hope this helps better.

Check Jenkins's master node memory and CPU profile. in my case, it was a master under high memory and CPU utilization, and slaves were getting restarted with 137.

Related

Slurm memory resource management

Despite a thorough read of https://slurm.schedmd.com/slurm.conf.html, there is several things I don't understand regarding how Slurm manages the memory resource. My slurm.conf contains
DefMemPerCPU=1000
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
NodeName=orion Gres=gpu:RTX2080Ti:4 MemSpecLimit=2048 RealMemory=95232
When not specifying --mem, jobs are launched with MIN_MEMORY=0 and don't seem to fail when allocating memory. What is the maximum memory job can use and how to display it?
When specifying --mem=0, jobs are pending waiting for resources. How can this be?
The value provided in DefMemPerCPU=1000 doesn't seem to have an effect. Is it related to SelectTypeParameters=CR_Core_Memory? If so, what is the equivalent for CPU cores?
Ultimately, what should be the configuration for having a default memory limit?

Create Pressure on Kubernetes Nodes

I want to test Pod eviction events that caused by memorypressure for taintbasedeviction on my pods, for to do that I created a memory load on my instance that have 2 vcpu and 8GB Ram.
For create a load I have run this command :
stress-ng --vm 2 --vm-bytes 10G --timeout 60s
Output of memory usage
$ free -h
total used free shared buff/cache available
Mem: 7.8Gi 2.7Gi 1.0Gi 3.9Gi 4.1Gi 984Mi
Swap: 0B 0B 0B
But in my nodes states there is no memorypressure I have updated kubelet eviction parameters at below :
evictionHard:
memory.available: "200Mi"
As summary, How Can I create memory pressure on my worker nodes for test the taint based eviction ?
Thanks
You could invoke the stress command multiple times. Check the script here.
The value for memory.available is derived from the cgroupfs instead of tools like free -m. This is important because free -m does not work in a container, and if users use the node allocatable feature, out of resource decisions are made local to the end user Pod part of the cgroup hierarchy as well as the root node. This script reproduces the same set of steps that the kubelet performs to calculate memory.available. The kubelet excludes inactive_file (i.e. # of bytes of file-backed memory on inactive LRU list) from its calculation as it assumes that memory is reclaimable under pressure.

Multiple Cassandra node goes down

WE have a 12 node cassandra cluster across 2 different datacenter. We are migrating the data from sql DB to cassandra through a net application and there is another .net app thats reads data from the cassandra. Off recently we are seeing one or the other node going down (nodetool status shows DN and the service is stopped on it). Below is the output of the nodetool status. WE have to start the service to again get it working but it stops again.
https://ibb.co/4P1T453
Path to the log: https://pastebin.com/FeN6uDGv
So in looking through your pastebin, I see a few things that can be adjusted.
First I'm reasonably sure that this is your primary issue:
Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out,
especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.
From GNU Error Codes:
Macro: int ENOMEM
“Cannot allocate memory.” The system cannot allocate more virtual
memory because its capacity is full.
-Xms12G, -Xmx12G, -Xmn3000M,
How much RAM is on your instance? From what I'm seeing your node is dying from an OOM (Out of Memory error). My guess is that you're designating too much RAM to the heap, and there isn't enough for the OS/page-cache. In fact, I wouldn't designate much more than 50%-60% of RAM to the heap.
For example, I mostly build instances on 16GB of RAM, and I've found that a 10GB max heap is about as high as you'd want to go on that.
-XX:+UseParNewGC, -XX:+UseConcMarkSweepGC
In fact, as you're using CMS GC, I wouldn't go higher than 8GB for max heap size.
Maximum number of memory map areas per process (vm.max_map_count) 65530 is too low,
recommended value: 1048575, you can change it with sysctl.
This means you haven't adjusted your limits.conf or sysctl.conf. Check through the guide (DSE 6.0 - Recommended Production Settings), but generally it's a good idea to add the following to these files:
/etc/limits.conf
* - memlock unlimited
* - nofile 100000
* - nproc 32768
* - as unlimited
/etc/sysctl.conf
vm.max_map_count = 1048575
Note: After adjusting sysctl.conf, you'll want to run a sudo sysctl -p or reboot.
Is swap disabled? : false,
You will want to disable swap. If Cassandra starts swapping contents of RAM to disk, things will get really slow. Run a swapoff -a and then edit /etc/fstab and remove any swap entries.
tl;dr; Summary
Set your initial and max heap sizes to 8GB (heap new size is fine).
Modify your limits.conf an sysctl.conf files appropriately.
Disable swap.
It's also a good idea to get on the latest version of 3.11 (3.11.4).
Hope this helps!

Filebeat - Failed to publish events caused by: read tcp x.x.x.x:36196->x.x.x.x:5045: i/o timeout

Hi i'm running into a problem while sending logs via filebeat to logstash:
In short - Can't see logs in kibana - when tailing the filebeat log I see a lot of these:
ERROR logstash/async.go:235 Failed to publish events caused by: read tcp x.x.x.x:36246->y.y,y.y:5045: i/o timeout (while y.y,y.y is logstash address and 5045 is the open beat port)
More details:
I have ~60 machines with filebeat 6.1.1 installed and one logstash machine with logstash 6.2.3 installed.
Some filebeats successfully sends their logs while some throws the error I mentioned above.
those non-errors filebeats sends old logs - means I can see in logstash debug logs that some logs timestamp are 2 or 3 days ago
Logstash usage memory is 35% and cpu usage near 75% on peaks,
in netstat -tupn output in the filebeat machines I can see that the established connections to logstash from filebeat.
Can someone help me find the problem ?
It looks like logstash performance issue. Cpu usage its probabbly too high Memory could be more. increase the minimum (Xms) and maximum (Xmx) heap allocation size to =[Total amount in the host - 1], (live 1 G to the Os) and set it equals (xms=xmx)
Also you can run another logstash instance and balance the filebeat output to these 2 and see what happen.
More things to consider:
Performance Checklist
Check the performance of input sources and output destinations:
Logstash is only as fast as the services it connects to. Logstash can only consume and produce data as fast as its input and output destinations can!
Check system statistics:
CPU
Note whether the CPU is being heavily used. On Linux/Unix, you can run top -H to see process statistics broken out by thread, as well as total CPU statistics.
If CPU usage is high, skip forward to the section about checking the JVM heap and then read the section about tuning Logstash worker settings.
Memory
Be aware of the fact that Logstash runs on the Java VM. This means that Logstash will always use the maximum amount of memory you allocate to it.
Look for other applications that use large amounts of memory and may be causing Logstash to swap to disk. This can happen if the total memory used by applications exceeds physical memory.
I/O Utilization
Monitor disk I/O to check for disk saturation.
Disk saturation can happen if you’re using Logstash plugins (such as the file output) that may saturate your storage.
Disk saturation can also happen if you’re encountering a lot of errors that force Logstash to generate large error logs.
On Linux, you can use iostat, dstat, or something similar to monitor disk I/O.
Monitor network I/O for network saturation.
Network saturation can happen if you’re using inputs/outputs that perform a lot of network operations.
On Linux, you can use a tool like dstat or iftop to monitor your network.
Check the JVM heap:
Often times CPU utilization can go through the roof if the heap size is too low, resulting in the JVM constantly garbage collecting.
A quick way to check for this issue is to double the heap size and see if performance improves. Do not increase the heap size past the amount of physical memory. Leave at least 1GB free for the OS and other processes.
You can make more accurate measurements of the JVM heap by using either the jmap command line utility distributed with Java or by using VisualVM. For more info, see Profiling the Heapedit.
Always make sure to set the minimum (Xms) and maximum (Xmx) heap allocation size to the same value to prevent the heap from resizing at runtime, which is a very costly process.
Tune Logstash worker settings:
Begin by scaling up the number of pipeline workers by using the -w flag. This will increase the number of threads available for filters and outputs. It is safe to scale this up to a multiple of CPU cores, if need be, as the threads can become idle on I/O.
You may also tune the output batch size. For many outputs, such as the Elasticsearch output, this setting will correspond to the size of I/O operations. In the case of the Elasticsearch output, this setting corresponds to the batch size.
More info here.

Docker not reporting memory usage correctly?

Through some longevity testing with docker (docker 1.5 and 1.6 with no memory limit) on (centos 7 / rhel 7) and observing the systemd-cgtop stats for the running containers, I noticed what appeared to be very high memory use. Typically the particular application running in a non-containerized state only utilizes around 200-300Meg of memory. Over a 3 day period I ended up seeing systemd-cgtop reporting that my container was up to 13G of memory used. While I am not an expert Linux admin by any means, I started digging in to this which pointed me to the following articles:
https://unix.stackexchange.com/questions/34795/correctly-determining-memory-usage-in-linux
http://corlewsolutions.com/articles/article-6-understanding-the-free-command-in-ubuntu-and-linux
So basically what I am understanding is to determine the actual free memory within the system unit would be to look at the -/+ buffers/cache: within "free -m" and not the top line, as I also noticed that the top line within "free -m" would constantly increase with memory used and constantly show a decreased amount of free memory just like what I am observing with my container through systemd-cgtop. If I observe the -/+ buffers/cache: line I will see the actual stable amounts of memory being used / free. Also, if I observe the actual process within top on the host, I can see the process itself is only ever using less then 1% of memory (0.8% of 32G).
I am a bit confused as to whats going on here. If I set a memory limit of 500-1000M for a container (I believe it would turn out to be twice as much due to the swap) would my process eventually stop when I reach my memory limit, even though the process itself is not using anywhere near that much memory? If anybody out there has any feedback on the former, that would be great. Thanks!
I used docker in CentOS 7 for a while, and got the same confused by these. Checking the github issues link, it looks like docker stats in this release is kind of mislead.
https://github.com/docker/docker/issues/10824
so I just ignored memory usage getting from docker stats.
A year since you asked, but adding an answer here for anyone else interested. If you set a memory limit, I think it would not be killed unless it fails to reclaim unused memory. the cgroups metrics and consequently docker stats shows page cache+RES. You could look at the cgroups detailed metrics to see the breakup
I had a similar issue and when I tested with a memory limit, I saw that the container is not killed. Rather the memory is reclaimed and reused.

Resources