What is docker --kernel-memory - linux

Good day , I know that Docker containers are using the host's kernel (which is why containers are considered as lightweight vms) Here the the source . However, after reading Runtime Options part of a docker documentation I met an option called --kernel-memory. The doc says
The maximum amount of kernel memory the container can use.
I didn't understand what it does. My guess is every container will allocate some memory in host's kernel space .If so then what is the reason , isn't it vulnerable for a user process to allocate memory in kernel space ?

The whole CPU/Memory Limitation stuff is using cgroups.
You can find all settings performed by docker run (either per args or per default) under /sys/fs/cgroup/memory/docker/<container ID> for memory or /sys/fs/cgroup/cpu/docker/<container ID> for cpu.
So the --kernel-memory:
Reading: cat memory.kmem.limit_in_bytes
Writing: sudo -s echo 2167483648 > memory.kmem.limit_in_bytes
And also the benchmarking memory.kmem.max_usage_in_bytes and memory.kmem.usage_in_bytes which shows (rather selfexplaining) the current usage and the highest usage overall.
CGroup docs about Kernel Memory
For the functionality I will recommend reading Kernel Docs for CGroups V1 instead of the docker docs:
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
With the Kernel memory extension, the Memory Controller is able to
limit the amount of kernel memory used by the system. Kernel memory is
fundamentally different than user memory, since it can't be swapped
out, which makes it possible to DoS the system by consuming too much
of this precious resource.
[..]
The memory used is
accumulated into memory.kmem.usage_in_bytes, or in a separate counter
when it makes sense. (currently only for tcp). The main "kmem" counter
is fed into the main counter, so kmem charges will also be visible
from the user counter.
Currently no soft limit is implemented for kernel memory. It is future
work to trigger slab reclaim when those limits are reached.
and
2.7.2 Common use cases
Because the "kmem" counter is fed to the main user counter, kernel
memory can never be limited completely independently of user memory.
Say "U" is the user limit, and "K" the kernel limit. There are three
possible ways limits can be set:
U != 0, K = unlimited:
This is the standard memcg limitation mechanism already present before kmem
accounting. Kernel memory is completely ignored.
U != 0, K < U:
Kernel memory is a subset of the user memory. This setup is useful in
deployments where the total amount of memory per-cgroup is overcommited.
Overcommiting kernel memory limits is definitely not recommended, since the
box can still run out of non-reclaimable memory.
In this case, the admin could set up K so that the sum of all groups is
never greater than the total memory, and freely set U at the cost of his
QoS.
WARNING: In the current implementation, memory reclaim will NOT be
triggered for a cgroup when it hits K while staying below U, which makes
this setup impractical.
U != 0, K >= U:
Since kmem charges will also be fed to the user counter and reclaim will be
triggered for the cgroup for both kinds of memory. This setup gives the
admin a unified view of memory, and it is also useful for people who just
want to track kernel memory usage.
Clumsy Attempt of a Conclusion
Given a running container with --memory="2g" --memory-swap="2g" --oom-kill-disable using
cat memory.kmem.max_usage_in_bytes
10747904
10 MB of Kernel-Memory in normal state. Would make sense to me to limit it, let's say to 20 MB of Kernel-Memory. Then it should kill or limit the container to protect the host. But due to the fact that there is - according to the docs - no possibility to reclaim the memory and the OOM Killer is starting to kill processes on host then even with a plenty of free memory (according to this: https://github.com/docker/for-linux/issues/1001) for me it is rather unpractical to use that.
The quoted option to set it >= memory.limit_in_bytes is not really helpful in that scenario either.
Deprecated
--kernel-memory is deprecated in v20.10, due to the fact someone (=Linux Kernel) realized all that as well..
What we can do then?
ULimit
Docker API exposes HostConfig|Ulimit which writes to /etc/security/limits.conf. For docker run should be --ulimit <type>=<soft>:<hard>. Use cat /etc/security/limits.conf or man setrlimit to see the categories and you can try to protect your system from filling kernel memory by e.g. generate unlimited processes with --ulimit nproc=500:500, but be careful, nproc works for users and not for containers, so count together..
To prevent DDoS (intentionally or unintentionally) i would suggest to limit at least nofile and nproc. Maybe someone can elaborate further..
sysctl:
docker run --sysctl can change kernel variables on message queue and shared memory, also network, e.g. docker run --sysctl net.ipv4.tcp_max_orphans= for orphan tcp connections which defaults on my system to 131072, and by a kernel memory usage of 64 kB each: Bang 8 GB on malfunction or dos. Maybe someone can elaborate further..

Related

What cause kernel to eat CPU on page_fault?

hw/os: linux 4.9, 64G RAM.
16 daemons running. Each reading random short (100 bytes) pieces of 5GiB file accessing it as a memory mapped via mmap() at daemon startup. Each daemon reads its own file, so 16 5GiB files total.
Each daemon making maybe 10 reads per second. Not too much, disk load is rather small.
Sometimes (1 event in 5 minutes, no period, totally random) some random daemon stuck in kernel code with the followind stack (see picture) for 300 milliseconds. This does not corellate with major-faults: the major-faults go at constant rate about 100...200 per second. Disk reads are also constant.
What can cause this?
Text of the image: __list_del_entry isolate_lru_pages.isra.48 shrink_inactive_list shrink_node_memcg shrink_node node_reclaim get_page_from_freelist enqueue_task_fair sched_clock __alloc_pages_nodemask alloc_pages_vma handle_mm_fault __do_page_fault page_fault
You have shrink_node and node_reclaim functions in your stack. They are called to free memory (which is shown as buff/cache by free command-line tool): https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html#reclaim
The process of freeing the reclaimable physical memory pages and repurposing them is called (surprise!) reclaim. Linux can reclaim pages either asynchronously or synchronously, depending on the state of the system. When the system is not loaded, most of the memory is free and allocation requests will be satisfied immediately from the free pages supply. As the load increases, the amount of the free pages goes down and when it reaches a certain threshold (high watermark), an allocation request will awaken the kswapd daemon. It will asynchronously scan memory pages and either just free them if the data they contain is available elsewhere, or evict to the backing storage device (remember those dirty pages?). As memory usage increases even more and reaches another threshold - min watermark - an allocation will trigger direct reclaim. In this case allocation is stalled until enough memory pages are reclaimed to satisfy the request.
So your 64 GB RAM system has situation when there is no free memory left. This amount of memory is enough to hold a copy of 12 files of 5 GB each, and your daemons uses 16 files. Linux may read more data from files than it was required by application with Readahead technique ("Linux readahead: less tricks for more", ols 2007 pp273-284, man 2 readahead). MADV_SEQUENTIAL may also turn on this mechanism, https://man7.org/linux/man-pages/man2/madvise.2.html
MADV_SEQUENTIAL
Expect page references in sequential order. (Hence, pages in
the given range can be aggressively read ahead, and may be
freed soon after they are accessed.)
MADV_RANDOM
Expect page references in random order. (Hence, read ahead
may be less useful than normally.)
Not sure how your daemons did open and read files, was MADV_SEQUENTIAL active for them or not (or was this flag added by glibc or any other library). Also there can be some effect from THP - Transpartent huge pages https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html. Normal 4.9 kernel is from 2016 and thp expansion for filesystems was planned in 2019 https://lwn.net/Articles/789159/, but if you use RHEL/CentOS, some features may be backported into fork of 4.9 kernel.
You should check free and cat /proc/meminfo output periodically to check how your daemons and linux kernel readahead uses memory.

Allocate memory to a process that other process can not use in linux

To limit memory resource for particular process we can use ulimit as well as cgroup.
I want to understand that if using cgroup, I have allocated say ~700 MB of memory to process A, on system having 1 GB of RAM, and some other process say B, requires ~400 MB of memory. What will happen in this case?
If process A is allocated ~750 MB of memory but using only 200 MB of memory, will process B can use memory that allocated to A?
If no then how to achieve the scenario when "fix amount of memory is assigned to a process that other process can not use"?
EDIT
Is it possible to lock physical memory for process? Or only VM can be locked so that no other process can access it?
There is one multimedia application that must remain alive and can use maximum system resource in terms of memory, I need to achieve this.
Thanks.
Processes are using virtual memory (not RAM) so they have a virtual address space. See also setrlimit(2) (called by ulimit shell builtin). Perhaps RLIMIT_RSS & RLIMIT_MEMLOCK are relevant. Of course, you could limit some other process e.g. using RLIMIT_AS or RLIMIT_DATA, perhaps thru pam_limits(8) & limits.conf(5)
You could lock some virtual memory into RAM using mlock(2), this ensures that the RAM is kept for the calling process.
If you want to improve performance, you might also use madvise(2) & posix_fadvise(2).
See also ionice(1) & renice(1)
BTW, you might consider using hypervisors like Xen, they are able to reserve RAM.
At last, you might be wrong in believing that your manual tuning could do better than a carefully configured kernel scheduler.
What other processes will run on the same system, and what kind of thing do you want to happen if the other multimedia program needs memory that other processes are using?
You could weight the multimedia process so the OOM killer only picks it as a last choice after every other non-essential process. You might see a dropped frame if the kernel takes some time killing something to free up memory.
According to this article, adjust the oom-killer weight of a process by writing to /proc/pid/oom_adj. e.g. with
echo -17 > /proc/2592/oom_adj

vm/min_free_kbytes - Why Keep Minimum Reserved Memory?

According to this article:
/proc/sys/vm/min_free_kbytes: This controls the amount of memory that is kept free for use by special reserves including “atomic” allocations (those which cannot wait for reclaim)
My question is that what does it mean by "those which cannot wait for reclaim"? In other words, I would like to understand why there's a need to tell the system to always keep a certain minimum amount of memory free and under what circumstances will this memory be used? [It must be used by something; don't see the need otherwise]
My second question: does setting this memory to something higher than 4MB (on my system) leads to better performance? We have a server which occasionally exhibit very poor shell performance (e.g. ls -l takes 10-15 seconds to execute) when certain processes get going and if setting this number to something higher will lead to better shell performance?
(link is dead, looks like it's now here)
That text is referring to atomic allocations, which are requests for memory that must be satisfied without giving up control (i.e. the current thread can not be suspended). This happens most often in interrupt routines, but it applies to all cases where memory is needed while holding an essential lock. These allocations must be immediate, as you can't afford to wait for the swapper to free up memory.
See Linux-MM for a more thorough explanation, but here is the memory allocation process in short:
_alloc_pages first iterates over each memory zone looking for the first one that contains eligible free pages
_alloc_pages then wakes up the kswapd task [..to..] tap into the reserve memory pools maintained for each zone.
If the memory allocation still does not succeed, _alloc pages will either give up [..] In this process _alloc_pages executes a cond_resched() which may cause a sleep, which is why this branch is forbidden to allocations with GFP_ATOMIC.
min_free_kbytes is unlikely to help much with the described "ls -l takes 10-15 seconds to execute"; that is likely caused by general memory pressure and swapping rather than zone exhaustion. The min_free_kbytes setting only needs to allow enough free pages to handle immediate requests. As soon as normal operation is resumed, the swapper process can be run to rebalance the memory zones. The only time I've had to increase min_free_kbytes is after enabling jumbo frames on a network card that didn't support dma scattering.
To expand on your second question a bit, you will have better results tuning vm.swappiness and the dirty ratios mentioned in the linked article. However, be aware that optimizing for "ls -l" performance may cause other processes to become slower. Never optimize for a non-primary usecase.
All linux systems will attempt to make use of all physical memory available to the system, often through the creation of a filesystem buffer cache, which put simply is an I/O buffer to help improve system performance. Technically this memory is not in use, even though it is allocated for caching.
"wait for reclaim", in your question, refers to the process of reclaiming that cache memory that is "not in use" so that it can be allocated to a process. This is supposed to be transparent but in the real world there are many processes that do not wait for this memory to become available. Java is a good example, especially where a large minimum heap size has been set. The process tries to allocate the memory and if it is not instantly available in one large contiguous (atomic?) chunk, the process dies.
Reserving a certain amount of memory with min_free_kbytes allows this memory to be instantly available and reduces the memory pressure when new processes need to start, run and finish while there is a high memory load and a full buffer cache.
4MB does seem rather low because if the buffer cache is full, any process that wants an immediate allocation of more than 4MB will likely fail. The setting is very tunable and system-specific, but if you have a few GB of memory available it can't hurt to bump up the reserve memory to 128MB. I'm not sure what effect it will have on shell interactivity, but likely positive.
This memory is kept free from use by normal processes. As #Arno mentioned, the special processes that can run include interrupt routines, which must be run now (as it's an interrupt), and finish before any other processes can run (atomic). This can include things like swapping out memory to disk when memory is full.
If the memory is filled an interrupt (memory management) process runs to swap some memory into disk so it can free some memory for use by normal processes. But if vm.min_free_kbytes is too small for it to run, then it locks up the system. This is because this interrupt process must run first to free memory so others can run, but then it's stuck because it doesn't have enough reserved memory vm.min_free_kbytes to do its task resulting in a deadlock.
Also see:
https://www.linbit.com/en/kernel-min_free_kbytes/ and
https://askubuntu.com/questions/41778/computer-freezing-on-almost-full-ram-possibly-disk-cache-problem (where the memory management process has so little memory to work with it takes so long to swap little by little that it feels like a freeze.)

Different between memory usage in WHM/Cpanel and Linux

If I go to WHM and see my server's memory usage, it says that only 16% of memory is in use.
But when I connect to server using SSH and run command "free -m" then it shows that 80% is in use. Why is that? I want to know exact memory usage of all applications running like MySQL, Apache e.t.c.
How do I view that?
Thanks
As they say, "It's Complicated".
Linux uses unused memory for disk buffering and caching. It speeds things up. But you may need to look at the -/+ buffers/cache line of free.
'ps' can show you, for any given process, or for all processes, the %cpu, %mem, cumulative cpu-time, rss (resident set size, the non-swapped physical memory that a process is using), size (very approximate amount of swap space that would be required if the process were to dirty all writable pages and then be swapped out), vsize (virtual memory usage of entire process (vm_lib + vm_exe + vm_data + vm_stack)), and much much more.
For any given process, you can cat /proc/$PID/status -- it's human readable -- and check out the VmSize, VmLck, VmRSS, VmData, VmStk, VmExe, VmLib, and VmPTE values, along with others...
But that's just for starters... Processes can allocate memory but not use it. (Memory can be allocated, but the memory pages are not created/issued until they're actually used. That whole on-demand thing.)
Processes can map in hardware space, showing up as using a large quantity of memory that's not actually coming from system RAM. (X-servers are known to sometimes do this. It's some wonky stuff involved kernel drivers...)
There's the executable, which is usually a memory-mapped file. Meaning that parts that are swapped-in are taking up RAM, but when swapped out it never takes up swapfile space.
Processes can have other memory-mapped files...
There's shared-memory libraries, where the same RAM pages are used by multiple programs concurrently.
So we have to ask, as far as memory goes, what exactly counts and what doesn't?

Allocating memory for process in linux

Dear all, I am using Redhat linux ,How to set maximum memory for particular process. For eg i have to allocate maximum memory usage to eclipse alone .Is it possible to allocate like this.Give me some solutions.
ulimit -v 102400
eclipse
...gives eclipse 100MiB of memory.
You can't control memory usage; you can only control virtual memory size, not the amount of actual memory used, as that is extremely complicated (perhaps impossible) to know for a single process on an operating system which supports virtual memory.
Not all memory used appears in the process's virtual address space at a given instant, for example kernel usage, and disc caching. A process can change which pages it has mapped in as often as it likes (e.g. via mmap() ). Some of a process's address space is also mapped in, but not actually used, or is shared with one or more other processes. This makes measuring per-process memory usage a fairly unachievable goal in practice.
And putting a cap on the VM size is not a good idea either, as that will result in the process being killed if it attempts to use more.
The right way of doing this in this case (for a Java process) is to set the heap maximum size (via various well-documented JVM startup options). However, experience suggests that you should not set it less than 1Gb.

Resources