I want to test Pod eviction events that caused by memorypressure for taintbasedeviction on my pods, for to do that I created a memory load on my instance that have 2 vcpu and 8GB Ram.
For create a load I have run this command :
stress-ng --vm 2 --vm-bytes 10G --timeout 60s
Output of memory usage
$ free -h
total used free shared buff/cache available
Mem: 7.8Gi 2.7Gi 1.0Gi 3.9Gi 4.1Gi 984Mi
Swap: 0B 0B 0B
But in my nodes states there is no memorypressure I have updated kubelet eviction parameters at below :
evictionHard:
memory.available: "200Mi"
As summary, How Can I create memory pressure on my worker nodes for test the taint based eviction ?
Thanks
You could invoke the stress command multiple times. Check the script here.
The value for memory.available is derived from the cgroupfs instead of tools like free -m. This is important because free -m does not work in a container, and if users use the node allocatable feature, out of resource decisions are made local to the end user Pod part of the cgroup hierarchy as well as the root node. This script reproduces the same set of steps that the kubelet performs to calculate memory.available. The kubelet excludes inactive_file (i.e. # of bytes of file-backed memory on inactive LRU list) from its calculation as it assumes that memory is reclaimable under pressure.
Related
Good day , I know that Docker containers are using the host's kernel (which is why containers are considered as lightweight vms) Here the the source . However, after reading Runtime Options part of a docker documentation I met an option called --kernel-memory. The doc says
The maximum amount of kernel memory the container can use.
I didn't understand what it does. My guess is every container will allocate some memory in host's kernel space .If so then what is the reason , isn't it vulnerable for a user process to allocate memory in kernel space ?
The whole CPU/Memory Limitation stuff is using cgroups.
You can find all settings performed by docker run (either per args or per default) under /sys/fs/cgroup/memory/docker/<container ID> for memory or /sys/fs/cgroup/cpu/docker/<container ID> for cpu.
So the --kernel-memory:
Reading: cat memory.kmem.limit_in_bytes
Writing: sudo -s echo 2167483648 > memory.kmem.limit_in_bytes
And also the benchmarking memory.kmem.max_usage_in_bytes and memory.kmem.usage_in_bytes which shows (rather selfexplaining) the current usage and the highest usage overall.
CGroup docs about Kernel Memory
For the functionality I will recommend reading Kernel Docs for CGroups V1 instead of the docker docs:
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
With the Kernel memory extension, the Memory Controller is able to
limit the amount of kernel memory used by the system. Kernel memory is
fundamentally different than user memory, since it can't be swapped
out, which makes it possible to DoS the system by consuming too much
of this precious resource.
[..]
The memory used is
accumulated into memory.kmem.usage_in_bytes, or in a separate counter
when it makes sense. (currently only for tcp). The main "kmem" counter
is fed into the main counter, so kmem charges will also be visible
from the user counter.
Currently no soft limit is implemented for kernel memory. It is future
work to trigger slab reclaim when those limits are reached.
and
2.7.2 Common use cases
Because the "kmem" counter is fed to the main user counter, kernel
memory can never be limited completely independently of user memory.
Say "U" is the user limit, and "K" the kernel limit. There are three
possible ways limits can be set:
U != 0, K = unlimited:
This is the standard memcg limitation mechanism already present before kmem
accounting. Kernel memory is completely ignored.
U != 0, K < U:
Kernel memory is a subset of the user memory. This setup is useful in
deployments where the total amount of memory per-cgroup is overcommited.
Overcommiting kernel memory limits is definitely not recommended, since the
box can still run out of non-reclaimable memory.
In this case, the admin could set up K so that the sum of all groups is
never greater than the total memory, and freely set U at the cost of his
QoS.
WARNING: In the current implementation, memory reclaim will NOT be
triggered for a cgroup when it hits K while staying below U, which makes
this setup impractical.
U != 0, K >= U:
Since kmem charges will also be fed to the user counter and reclaim will be
triggered for the cgroup for both kinds of memory. This setup gives the
admin a unified view of memory, and it is also useful for people who just
want to track kernel memory usage.
Clumsy Attempt of a Conclusion
Given a running container with --memory="2g" --memory-swap="2g" --oom-kill-disable using
cat memory.kmem.max_usage_in_bytes
10747904
10 MB of Kernel-Memory in normal state. Would make sense to me to limit it, let's say to 20 MB of Kernel-Memory. Then it should kill or limit the container to protect the host. But due to the fact that there is - according to the docs - no possibility to reclaim the memory and the OOM Killer is starting to kill processes on host then even with a plenty of free memory (according to this: https://github.com/docker/for-linux/issues/1001) for me it is rather unpractical to use that.
The quoted option to set it >= memory.limit_in_bytes is not really helpful in that scenario either.
Deprecated
--kernel-memory is deprecated in v20.10, due to the fact someone (=Linux Kernel) realized all that as well..
What we can do then?
ULimit
Docker API exposes HostConfig|Ulimit which writes to /etc/security/limits.conf. For docker run should be --ulimit <type>=<soft>:<hard>. Use cat /etc/security/limits.conf or man setrlimit to see the categories and you can try to protect your system from filling kernel memory by e.g. generate unlimited processes with --ulimit nproc=500:500, but be careful, nproc works for users and not for containers, so count together..
To prevent DDoS (intentionally or unintentionally) i would suggest to limit at least nofile and nproc. Maybe someone can elaborate further..
sysctl:
docker run --sysctl can change kernel variables on message queue and shared memory, also network, e.g. docker run --sysctl net.ipv4.tcp_max_orphans= for orphan tcp connections which defaults on my system to 131072, and by a kernel memory usage of 64 kB each: Bang 8 GB on malfunction or dos. Maybe someone can elaborate further..
I've been looking all over StackOverflow for this, but can't find a satisfactory answer.
When running kubectl top nodes <node name> I get a memory utilisation of approx. 69% (Kubernetes showing roughly 21Gi of 32Gi being used). But if I go into the system itself and run the free command, as well as the top command, I see a total of 6GB of used memory (i.e. 20% - this is the information under the used column in the output of free) - way less than 69% of the total system memory of 32GB.
Even accounting for the differences in Gi and GB, there's still more than 40% difference unaccounted for. I know that Kubernetes uses the stats reported by /sys/fs/cgroup/memory/memory.usage_in_bytes to report on memory utilisation, but why would this be different than the utilisation reported by other processes on the system (especially sometimes higher)? Which one should I take as the source of truth?
Found answer to my question here: https://serverfault.com/questions/902009/the-memory-usage-reported-in-cgroup-differs-from-the-free-command. In summary, it seems that Kubernetes uses the cgroup memory utilisation, which is reported in /sys/fs/cgroup/memory/memory.usage_in_bytes. The cgroup memory utilisation calculates not only the currently used memory in RAM, but also the "cached" memory (i.e. any memory no longer required by apps that are subsequently free to be reclaimed by the OS, but hasn't been reclaimed yet). The Linux system commands see "cached" memory as "free" but Kubernetes does not (not sure why).
I have a one a one node Kubernetes cluster and the memory usage reported by the metrics server does not seem to be the same as the memory usage shown with the free command
# kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
<node_ip> 1631m 10% 13477Mi 43%
# free -m
total used free shared buff/cache available
Mem: 32010 10794 488 81 20727 19133
Swap: 16127 1735 14392
And the difference is significant ~ 3 GB.
I have also tested this on a 3 node cluster, and the issue is present there too:
# kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
<node_ip1> 1254m 8% 26211Mi 84%
<node_ip2> 221m 1% 5021Mi 16%
<node_ip3> 363m 2% 8731Mi 28%
<node_ip4> 1860m 11% 20399Mi 66%
# free -m (this is on node 1)
total used free shared buff/cache available
Mem: 32010 5787 369 1676 25853 24128
Swap: 16127 0 16127
Why is there a difference?
The answer for your question can be found here. It is a duplicate so you can remove this post from StackOverflow.
The metrics exposed by the Metrics Server are collected by an instance of cAdvisor on each node. What you see in the output of kubectl top node is how cAdvisor determines the current resource usage.
So, apparently cAdvisor and free determine the resource usage in different ways. To find out why, you would need to dig into internals of how cAdvisor and free work.
I am aware that I can limit the resources allocated to a container while provisioning using docker with the -c and -m flags for CPU and memory.
However, is there a way I can change these allocated resources to containers dynamically (after they have been provisioned) and without redeploying the same container with changed resources?
At the time (Docker v1.11.1) has the command docker update (view docs). With this you can change allocated resources on the fly.
Usage: docker update [OPTIONS] CONTAINER [CONTAINER...]
Update configuration of one or more containers
--blkio-weight Block IO (relative weight), between 10 and 1000
--cpu-shares CPU shares (relative weight)
--cpu-period Limit CPU CFS (Completely Fair Scheduler) period
--cpu-quota Limit CPU CFS (Completely Fair Scheduler) quota
--cpuset-cpus CPUs in which to allow execution (0-3, 0,1)
--cpuset-mems MEMs in which to allow execution (0-3, 0,1)
--help Print usage
--kernel-memory Kernel memory limit
-m, --memory Memory limit
--memory-reservation Memory soft limit
--memory-swap Swap limit equal to memory plus swap: '-1' to enable unlimited swap
--restart Restart policy to apply when a container exits
not at present no - There is a desire to see someone implement it though: https://github.com/docker/docker/issues/6323
That could be coming for docker 1.10 or 1.11 (Q1 2016): PR 15078 is implementing (Dec. 2015) support for changing resources (including CPU) both for stopped and running container.
Update 2016: it is part of docker 1.10 and documented in docker update (PR 15078).
We decided to allow to set what we called resources, which consists of cgroup thingies for now, hence the following PR #18073.
The only allowed mutable elements of a container are in HostConfig and precisely in Resources (see the struct).
resources := runconfig.Resources{
BlkioWeight: *flBlkioWeight,
CpusetCpus: *flCpusetCpus, <====
CpusetMems: *flCpusetMems, <====
CPUShares: *flCPUShares, <====
Memory: flMemory,
MemoryReservation: memoryReservation,
MemorySwap: memorySwap,
KernelMemory: kernelMemory,
CPUPeriod: *flCPUPeriod,
CPUQuota: *flCPUQuota,
}
The command should be set (in the end: update).
The allowed changes are passed as flags : e.g. --memory=1Gb --cpushare=… (as this PR does).
There is one flag for each attribute of the Resources struct (and no more, no less).
Note that making changes via docker set should persist.
I.e., those changes would be permanent (updated in the container's JSON)
When I run MPI job over InfiniBand, I get the following worning. We use Torque Manager.
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory. This can cause MPI jobs to
run with erratic performance, hang, and/or crash.
This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered. You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.
See this Open MPI FAQ item for more information on these Linux kernel module
parameters:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: host1
Registerable memory: 65536 MiB
Total memory: 196598 MiB
Your MPI job will continue, but may be behave poorly and/or hang.
--------------------------------------------------------------------------
I've read the link on the warning message, and I've done so far is;
Append options mlx4_core log_num_mtt=20 log_mtts_per_seg=4 on /etc/modprobe.d/mlx4_en.conf.
Make sure the following lines are written on /etc/security/limits.conf
* soft memlock unlimited
* hard memlock unlimited
Append session required pam_limits.so on /etc/pam.d/sshd
Make sure ulimit -c unlimited is uncommented on /etc/init.d/pbs_mom
Can anyone help me to find out what I'm missing?
Your mlx4_core parameters allow for the registration of 2^20 * 2^4 * 4 KiB = 64 GiB only. With 192 GiB of physical memory per node and given that it is recommended to have at least twice as much registerable memory, you should set log_num_mtt to 23, which would increase the limit to 512 GiB - the closest power of two greater or equal to twice the amount of RAM. Be sure to reboot the node(s) or unload and then reload the kernel module.
You should also submit a simple Torque job script that executes ulimit -l in order to verify the limits on locked memory and make sure there is no such limit. Note that ulimit -c unlimited does not remove the limit on the amount of locked memory but rather the limit on the size of core dump files.