What does `--oom-kill-disable` do for a Docker container? - linux

I have understood that docker run -m 256m --memory-swap 256m will limit a container so that it can use at most 256 MB of memory and no swap. If it allocates more, then a process in the container (not "the container") will be killed. For example:
$ sudo docker run -it --rm -m 256m --memory-swap 256m \
stress --vm 1 --vm-bytes 2000M --vm-hang 0
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [1] (415) <-- worker 7 got signal 9
stress: WARN: [1] (417) now reaping child worker processes
stress: FAIL: [1] (421) kill error: No such process
stress: FAIL: [1] (451) failed run completed in 1s
Apparently one of the workers allocates more memory than is allowed and receives a SIGKILL. Note that the parent process stays alive.
Now if the effect of -m is to invoke the OOM killer if a process allocates too much memory, then what happens when specifying -m and --oom-kill-disable? Trying it like above has the following result:
$ sudo docker run -it --rm -m 256m --memory-swap 256m --oom-kill-disable \
stress --vm 1 --vm-bytes 2000M --vm-hang 0
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
(waits here)
In a different shell:
$ docker stats
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
f5e4c30d75c9 0.00% 256 MiB / 256 MiB 100.00% 0 B / 508 B 0 B / 0 B 2
$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19391 root 20 0 2055904 262352 340 D 0.0 0.1 0:00.05 stress
I see the docker stats shows a memory consumption of 256 MB, and top shows a RES of 256 MB and a VIRT of 2000 MB. But, what does that actually mean? What will happen to a process inside the container that tries to use more memory than allowed? In which sense it is constrained by -m?

As i understand the docs --oom-kill-disable is not constrained by -m but actually requires it:
By default, kernel kills processes in a container if an out-of-memory
(OOM) error occurs. To change this behaviour, use the
--oom-kill-disable option. Only disable the OOM killer on containers where you have also set the -m/--memory option. If the -m flag is not
set, this can result in the host running out of memory and require
killing the host’s system processes to free memory.
A developer stated back in 2015 that
The host can run out of memory with or without the -m flag set. But
it's also irrelevant as --oom-kill-disable does nothing unless -m is
passed.
In regard to your update, what happens when OOM-killer is disabled and yet the memory limit is hit (intresting OOM article), id say that new calls to malloc and such will just fail as described here but it also depends on the swap configuration and the hosts available memory. If your -m limit is above the actual available memory, the host will start killing processes, one of which might be the docker daemon (which they try to avoid by changing its OOM priority).
The kernel docs (cgroup/memory.txt) say
If OOM-killer is disabled, tasks under cgroup will hang/sleep in
memory cgroup's OOM-waitqueue when they request accountable memory
For the actual implementation (which docker utilizes as well) of cgroups, youd have to check the sourcecode.

The job of the 'oom killer' in Linux is to sacrifice one or more processes in order to free up memory for the system when all else fails. OOM killer is only enabled if the host has memory overcommit enabled
The setting of --oom-kill-disable will set the cgroup parameter to disable the oom killer for this specific container when a condition specified by -m is met. Without the -m flag, oom killer will be irrelevant.
The -m flag doesn’t mean stop the process when it uses more than xmb of ram, it’s only that you’re ensuring that docker container doesn’t consume all host memory, which can force the kernel to kill its process. With -m flag, the container is not allowed to use more than a given amount of user or system memory.
When container hits OOM, it won’t be killed but it can hang and stay in defunct state hence processes inside the container can’t respond until you manually intervene and do a restart or kill the container. Hope this helps clear your questions.
For more details on how kernel act on OOM, check Linux OOM management and Docker memory Limitations page.

Related

Is there a way to read the memory counter used by cgroups to kill processes?

I am running a process under a cgroup with an OOM Killer. When it performs a kill, dmesg outputs messages such as the following.
[9515117.055227] Call Trace:
[9515117.058018] [<ffffffffbb325154>] dump_stack+0x63/0x8f
[9515117.063506] [<ffffffffbb1b2e24>] dump_header+0x65/0x1d4
[9515117.069113] [<ffffffffbb5c8727>] ? _raw_spin_unlock_irqrestore+0x17/0x20
[9515117.076193] [<ffffffffbb14af9d>] oom_kill_process+0x28d/0x430
[9515117.082366] [<ffffffffbb1ae03b>] ? mem_cgroup_iter+0x1db/0x3c0
[9515117.088578] [<ffffffffbb1b0504>] mem_cgroup_out_of_memory+0x284/0x2d0
[9515117.095395] [<ffffffffbb1b0f95>] mem_cgroup_oom_synchronize+0x305/0x320
[9515117.102383] [<ffffffffbb1abf50>] ? memory_high_write+0xc0/0xc0
[9515117.108591] [<ffffffffbb14b678>] pagefault_out_of_memory+0x38/0xa0
[9515117.115168] [<ffffffffbb0477b7>] mm_fault_error+0x77/0x150
[9515117.121027] [<ffffffffbb047ff4>] __do_page_fault+0x414/0x420
[9515117.127058] [<ffffffffbb048022>] do_page_fault+0x22/0x30
[9515117.132823] [<ffffffffbb5ca8b8>] page_fault+0x28/0x30
[9515117.330756] Memory cgroup out of memory: Kill process 13030 (java) score 1631 or sacrifice child
[9515117.340375] Killed process 13030 (java) total-vm:18259139756kB, anon-rss:2243072kB, file-rss:30004132kB
I would like to be able to tell how much memory the cgroups OOM Killer believes the process is using at any given time.
Is there a way to query for this quantity?
I found the following in the official documentation for cgroup-v1, which shows how to query current memory usage, as well as altering limits:
a. Enable CONFIG_CGROUPS
b. Enable CONFIG_MEMCG
c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
# mount -t tmpfs none /sys/fs/cgroup
# mkdir /sys/fs/cgroup/memory
# mount -t cgroup none /sys/fs/cgroup/memory -o memory
3.2. Make the new group and move bash into it
# mkdir /sys/fs/cgroup/memory/0
# echo $$ > /sys/fs/cgroup/memory/0/tasks
Since now we're in the 0 cgroup, we can alter the memory limit:
# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)
NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).
NOTE: We cannot set limits on the root cgroup any more.
# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
4194304
We can check the usage:
# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
1216512

Elasticsearch process memory locking failed

I have set boostrap.memory_lock=true
Updated /etc/security/limits.conf added memlock unlimited for elastic search user
My elastic search was running fine for many months. Suddenly it failed 1 day back. In logs I can see below error and process never starts
ERROR: bootstrap checks failed
memory locking requested for elasticsearch process but memory is not locked
I hit ulimit -as and I can see max locked memory set to unlimited. What is going wrong here? I have been trying for hours but all in vain. Please help.
OS is RHEL 7.2
Elasticsearch 5.1.2
ulimit -as output
core file size (blocks -c) 0
data seg size (kbytes -d) unlimited
scheduling policy (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 83552
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -q) 8
POSIX message queues (bytes,-q) 819200
real-time priority (-r) 0
stack size kbytes, -s) 8192
cpu time seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Here is what I have done to lock the memory on my ES nodes on RedHat/Centos 7 (it will work on other distributions if they use systemd).
You must make the change in 4 different places:
1) /etc/sysconfig/elasticsearch
On sysconfig: /etc/sysconfig/elasticsearch you should have:
ES_JAVA_OPTS="-Xms4g -Xmx4g"
MAX_LOCKED_MEMORY=unlimited
(replace 4g with HALF your available RAM as recommended here)
2) /etc/security/limits.conf
On security limits config: /etc/security/limits.conf you should have
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited
3) /usr/lib/systemd/system/elasticsearch.service
On the service script: /usr/lib/systemd/system/elasticsearch.service you should uncomment:
LimitMEMLOCK=infinity
you should do systemctl daemon-reload after changing the service script
4) /etc/elasticsearch/elasticsearch.yml
On elasticsearch config finally: /etc/elasticsearch/elasticsearch.yml you should add:
bootstrap.memory_lock: true
Thats it, restart your node and the RAM will be locked, you should notice a major performance improvement.
OS = Ubuntu 16
ElasticSearch = 5.6.3
I also used to have the same problem.
I set in elasticsearch.yml
bootstrap.memory_lock: true
and i got in my logs:
memory locking requested for elasticsearch process but memory is not locked
i tried several things, but actually you need to do only one thing (according to https://www.elastic.co/guide/en/elasticsearch/reference/master/setting-system-settings.html );
file:
/etc/systemd/system/elasticsearch.service.d/override.conf
add
[Service]
LimitMEMLOCK=infinity
A little bit explanation.
The really funny thing is that systemd does not really care about ulimit settings at all. ( https://fredrikaverpil.github.io/2016/04/27/systemd-and-resource-limits/ ). You can easily check this fact.
Set in /etc/security/limits.conf
elasticsearch - memlock unlimited
check that for elasticsearch max locked memory is unlimited
$ sudo su elasticsearch -s /bin/bash
$ ulimit -l
disable bootstrap.memory_lock: true in /etc/elasticsearch/elasticsearch.yml
# bootstrap.memory_lock: true
start service elasticsearch via systemd
# service elasticsearch start
check what max memory lock settings has service elasticsearch after it is
started
# systemctl show elasticsearch | grep -i limitmemlock
OMG! In spite we have set unlimited max memlock size via ulimit , systemd
completely ignores it.
LimitMEMLOCK=65536
So, we come to conclusion.
To start elasticsearch via systemd with enabled
bootstrap.memory_lock: true
we dont need to care about ulimit settings but we need
explecitely set it in systemd config file.
the end of story.
try setting
in /etc/sysconfig/elasticsearch file
set MAX_LOCKED_MEMORY=unlimited
in /usr/lib/systemd/system/elasticsearch.service
set LimitMEMLOCK=infinity
Make sure that your elasticsearch start process is configured to unlimited. For if e.g. you start elasticsarch with another user as the one configured in /etc/security/limits.conf or as root while defining a wildcard entry in limits.conf (which is not for root) it won't work.
Test itto be sure:
you could e.g. put ulimit -a ; exit just after the "#Start Daemon" in /etc/init.d/elasticsearch and start with bash /etc/init.d/elasticsearch start (adapt accordingly to your start mechanism).
check for the actual limit when the process is running (albeit short) with:
cat /proc/<pid>/limits
You will find lines similar to this:
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
<truncated>
Then depend on the runner or container (in my case it was supervisord's minfds value), you can lift the actual limitation configuration.
I hope it gives a little hint for more general cases.
Followed this post
On ubuntu 18.04 with elasticsearch 6.x, there wasn't entry LimitMEMLOCK=infinity in file /usr/lib/systemd/system/elasticsearch.service.
So adding that in that file and setting MAX_LOCKED_MEMORY=unlimited in /etc/default/elasticsearch did the trick.
The jvm options can be added in /etc/elasticsearch/jvm.options file.
If you use the tar distribution and want to monitor it with monit you
have to tell monit to use unlimited - all other places for this configuration are ignored.
Add ulimit -s unlimited at the beginning of /etc/init.d/monit, then do systemctl daemon-reload and then service monit restart and monit start $yourMonitLabel.
One thing it "can" be is that your /tmp is mounted with noexec https://discuss.elastic.co/t/not-able-to-start-elasticsearch-due-to-failed-memory-lock/158009/6 check your logs and see if it complains about .UnsatisfiedLinkError: Native library
especially CentOS/RedHat but maybe others? Might be fixed in ES 7?

How to limit CPU and RAM resources for mongodump?

I have a mongod server running. Each day, I am executing the mongodump in order to have a backup. The problem is that the mongodump will take a lot of resources and it will slow down the server (that by the way already runs some other heavy tasks).
My goal is to somehow limit the mongodump which is called in a shell script.
Thanks.
You should use cgroups. Mount points and details are different on distros and a kernels. I.e. Debian 7.0 with stock kernel doesn't mount cgroupfs by default and have memory subsystem disabled (folks advise to reboot with cgroup_enabled=memory) while openSUSE 13.1 shipped with all that out of box (due to systemd mostly).
So first of all, create mount points and mount cgroupfs if not yet done by your distro:
mkdir /sys/fs/cgroup/cpu
mount -t cgroup -o cpuacct,cpu cgroup /sys/fs/cgroup/cpu
mkdir /sys/fs/cgroup/memory
mount -t cgroup -o memory cgroup /sys/fs/cgroup/memory
Create a cgroup:
mkdir /sys/fs/cgroup/cpu/shell
mkdir /sys/fs/cgroup/memory/shell
Set up a cgroup. I decided to alter cpu shares. Default value for it is 1024, so setting it to 128 will limit cgroup to 11% of all CPU resources, if there are competitors. If there are still free cpu resources they would be given to mongodump. You may also use cpuset to limit numver of cores available to it.
echo 128 > /sys/fs/cgroup/cpu/shell/cpu.shares
echo 50331648 > /sys/fs/cgroup/memory/shell/memory.limit_in_bytes
Now add PIDs to the cgroup it will also affect all their children.
echo 13065 > /sys/fs/cgroup/cpu/shell/tasks
echo 13065 > /sys/fs/cgroup/memory/shell/tasks
I run couple of tests. Python that tries to allocate bunch of mem was Killed by OOM:
myaut#zenbook:~$ python -c 'l = range(3000000)'
Killed
I've also run four infinite loops and fifth in cgroup. As expected, loop that was run in cgroup got only about 45% of CPU time, while the rest of them got 355% (I have 4 cores).
All that changes do not survive reboot!
You may add this code to a script that runs mongodump, or use some permanent solution.

How to release hugepages from the crashed application

I have an application that uses hugepage and the application suddenly crashed due to some bug.
After crashing, since the application does not release the hugepage properly, the free hugepage number is not increased in sys filesystem.
$ sudo cat /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages
0
$ sudo cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
1024
Is there a way to release the hugepages by force?
Sometimes need to check all directory that hugetlbfs has been mounted.
So,
find mounted directory by command mount | grep huge.
check every directory except especially /dev/hugepages.
delete all 2M-sized files. (2M is the size of hugepage)
Use ipcs -m to list the shared memory segments.
Use ipcrm to remove the left over shared memory segments.
Edit on 06/24/2019:
Ok, so, the above answer, while correct as far as it goes, was a bit brief. In particular, if you have a host with multiple DB instances, and only one is crashed how can you determine which (if any) memory segments should be cleaned up?
Well, this too, can be done. For each running instance, connect w/ / as sysdba, then do oradebug setmypid (any pid will do, as all Oracle PIDs connect to the SGA). Then do oradebug ipc. That will (hopefully) return IPC information written to the trace file. So, go to the udump (or diag_dest) directory, and look for your trace file. It will contain all the IPC information for the instance. This will include ShmId. Look through the file for the ShmId(s) that this instance is using. Now look at the output of ipcs -m.
When you have done that for all the running instances, any memory segment output by ipcs -m that shows non-zero memory allocation, and that you cannot account for in the oradebug ipc information from any running instance, must be the left over memory segments from the crashed instance. Use ipcrm to remove it/them.
When doing this on a host with multiple running instances, this can be a bit fraught. Please proceed with caution. You don't want to remove the SGA of a running instance!
Hope that helps....
HugeTLB can either be used for shared memory (and Mark J. Bobak's answer would deal with that) or the app mmaps files created in a hugetlb filesystem. If the app crashes without removing those files they survive and keep corresponding memory 'allocated'.
Check hugeTLB filesystem and see if there are any leftover files from the app. Removing them would release the memory.
If you follow the instruction below, you can get rid of the allocated hugepages:
1) Let's check the hugepages which were free at restart
dpdk#dpdkvm:~$ ls /mnt/huge/
empty
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ cat /proc/meminfo
...
HugePages_Total: 256
HugePages_Free: 256
...
2) Starting a dpdk application with wrong parameters, producing an error
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ sudo ./build/kni -c 0x03 -n 2 -- -P -p 0x03 --config="(0,0,1),(1,0,1)"
...
EAL: Error - exiting with code: 1
Cause: No supported Ethernet device found
3) When I check hugepages, there is not any free
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ cat /proc/meminfo
...
HugePages_Total: 256
HugePages_Free: 0
...
4) Now, when I check the mounted hugepage directory, I can see the files which are not given back to OS by dpdk application.
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ ls /mnt/huge/
...
rtemap_0 rtemap_137 rtemap_176 rtemap_214 rtemap_253 rtemap_62
...
5) Finally, if you remove the files starting with rtemap, you can give the hugepages back
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ sudo rm /mnt/huge/*
[sudo] password for dpdk:
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ cat /proc/meminfo
...
HugePages_Total: 256
HugePages_Free: 256
...
your hugetlb may be used by shared memory or mmap files.
try to remove the shared memories or umount the hugetlb fs

Why is the system CPU time (% sy) high?

I am running a script that loads big files. I ran the same script in a single core OpenSuSe server and quad core PC. As expected in my PC it is much more faster than in the server. But, the script slows down the server and makes it impossible to do anything else.
My script is
for 100 iterations
Load saved data (about 10 mb)
time myscript (in PC)
real 0m52.564s
user 0m51.768s
sys 0m0.524s
time myscript (in server)
real 32m32.810s
user 4m37.677s
sys 12m51.524s
I wonder why "sys" is so high when i run the code in server. I used top command to check the memory and cpu usage.
It seems there is still free memory, so swapping is not the reason. % sy is so high, its probably the reason for the speed of server but I dont know what is causing % sy so high. The process that is using highest percent of CPU (99%) is "myscript". %wa is zero in the screenshot but sometimes it gets very high (50 %).
When the script is running, load average is greater than 1 but have never seen to be as high as 2.
I also checked my disc:
strt:~ # hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 16480 MB in 2.00 seconds = 8247.94 MB/sec
Timing buffered disk reads: 20 MB in 3.44 seconds = 5.81 MB/sec
john#strt:~> df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 245G 102G 131G 44% /
udev 4.0G 152K 4.0G 1% /dev
tmpfs 4.0G 76K 4.0G 1% /dev/shm
I have checked these things but I am still not sure what is the real problem in my server and how to fix it. Can anyone identify a probable reason for the slowness? What could be the solution?
Or is there anything else I should check?
Thanks!
You're getting a high sys activity because the load of the data you're doing takes system calls that happen in kernel. To resolve your slowness problems without upgrading the server might be possible. You can modify scheduling priority. See the man pages for nice and renice. See here and especially:
Niceness values range from -20 (the highest priority, lowest niceness) and 19 (the lowest priority, highest niceness).
$ ps -lp 941
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
4 S 0 941 1 0 70 -10 - 1713 poll_s ? 00:00:00 sshd
$ nice -n 19 ./test.sh
My niceness value is 19
$ renice -n 10 -p 941
941 (process ID) old priority -10, new priority 10

Resources