I have a mongod server running. Each day, I am executing the mongodump in order to have a backup. The problem is that the mongodump will take a lot of resources and it will slow down the server (that by the way already runs some other heavy tasks).
My goal is to somehow limit the mongodump which is called in a shell script.
Thanks.
You should use cgroups. Mount points and details are different on distros and a kernels. I.e. Debian 7.0 with stock kernel doesn't mount cgroupfs by default and have memory subsystem disabled (folks advise to reboot with cgroup_enabled=memory) while openSUSE 13.1 shipped with all that out of box (due to systemd mostly).
So first of all, create mount points and mount cgroupfs if not yet done by your distro:
mkdir /sys/fs/cgroup/cpu
mount -t cgroup -o cpuacct,cpu cgroup /sys/fs/cgroup/cpu
mkdir /sys/fs/cgroup/memory
mount -t cgroup -o memory cgroup /sys/fs/cgroup/memory
Create a cgroup:
mkdir /sys/fs/cgroup/cpu/shell
mkdir /sys/fs/cgroup/memory/shell
Set up a cgroup. I decided to alter cpu shares. Default value for it is 1024, so setting it to 128 will limit cgroup to 11% of all CPU resources, if there are competitors. If there are still free cpu resources they would be given to mongodump. You may also use cpuset to limit numver of cores available to it.
echo 128 > /sys/fs/cgroup/cpu/shell/cpu.shares
echo 50331648 > /sys/fs/cgroup/memory/shell/memory.limit_in_bytes
Now add PIDs to the cgroup it will also affect all their children.
echo 13065 > /sys/fs/cgroup/cpu/shell/tasks
echo 13065 > /sys/fs/cgroup/memory/shell/tasks
I run couple of tests. Python that tries to allocate bunch of mem was Killed by OOM:
myaut#zenbook:~$ python -c 'l = range(3000000)'
Killed
I've also run four infinite loops and fifth in cgroup. As expected, loop that was run in cgroup got only about 45% of CPU time, while the rest of them got 355% (I have 4 cores).
All that changes do not survive reboot!
You may add this code to a script that runs mongodump, or use some permanent solution.
Related
I am running a process under a cgroup with an OOM Killer. When it performs a kill, dmesg outputs messages such as the following.
[9515117.055227] Call Trace:
[9515117.058018] [<ffffffffbb325154>] dump_stack+0x63/0x8f
[9515117.063506] [<ffffffffbb1b2e24>] dump_header+0x65/0x1d4
[9515117.069113] [<ffffffffbb5c8727>] ? _raw_spin_unlock_irqrestore+0x17/0x20
[9515117.076193] [<ffffffffbb14af9d>] oom_kill_process+0x28d/0x430
[9515117.082366] [<ffffffffbb1ae03b>] ? mem_cgroup_iter+0x1db/0x3c0
[9515117.088578] [<ffffffffbb1b0504>] mem_cgroup_out_of_memory+0x284/0x2d0
[9515117.095395] [<ffffffffbb1b0f95>] mem_cgroup_oom_synchronize+0x305/0x320
[9515117.102383] [<ffffffffbb1abf50>] ? memory_high_write+0xc0/0xc0
[9515117.108591] [<ffffffffbb14b678>] pagefault_out_of_memory+0x38/0xa0
[9515117.115168] [<ffffffffbb0477b7>] mm_fault_error+0x77/0x150
[9515117.121027] [<ffffffffbb047ff4>] __do_page_fault+0x414/0x420
[9515117.127058] [<ffffffffbb048022>] do_page_fault+0x22/0x30
[9515117.132823] [<ffffffffbb5ca8b8>] page_fault+0x28/0x30
[9515117.330756] Memory cgroup out of memory: Kill process 13030 (java) score 1631 or sacrifice child
[9515117.340375] Killed process 13030 (java) total-vm:18259139756kB, anon-rss:2243072kB, file-rss:30004132kB
I would like to be able to tell how much memory the cgroups OOM Killer believes the process is using at any given time.
Is there a way to query for this quantity?
I found the following in the official documentation for cgroup-v1, which shows how to query current memory usage, as well as altering limits:
a. Enable CONFIG_CGROUPS
b. Enable CONFIG_MEMCG
c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
# mount -t tmpfs none /sys/fs/cgroup
# mkdir /sys/fs/cgroup/memory
# mount -t cgroup none /sys/fs/cgroup/memory -o memory
3.2. Make the new group and move bash into it
# mkdir /sys/fs/cgroup/memory/0
# echo $$ > /sys/fs/cgroup/memory/0/tasks
Since now we're in the 0 cgroup, we can alter the memory limit:
# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)
NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).
NOTE: We cannot set limits on the root cgroup any more.
# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
4194304
We can check the usage:
# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
1216512
I have understood that docker run -m 256m --memory-swap 256m will limit a container so that it can use at most 256 MB of memory and no swap. If it allocates more, then a process in the container (not "the container") will be killed. For example:
$ sudo docker run -it --rm -m 256m --memory-swap 256m \
stress --vm 1 --vm-bytes 2000M --vm-hang 0
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [1] (415) <-- worker 7 got signal 9
stress: WARN: [1] (417) now reaping child worker processes
stress: FAIL: [1] (421) kill error: No such process
stress: FAIL: [1] (451) failed run completed in 1s
Apparently one of the workers allocates more memory than is allowed and receives a SIGKILL. Note that the parent process stays alive.
Now if the effect of -m is to invoke the OOM killer if a process allocates too much memory, then what happens when specifying -m and --oom-kill-disable? Trying it like above has the following result:
$ sudo docker run -it --rm -m 256m --memory-swap 256m --oom-kill-disable \
stress --vm 1 --vm-bytes 2000M --vm-hang 0
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
(waits here)
In a different shell:
$ docker stats
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
f5e4c30d75c9 0.00% 256 MiB / 256 MiB 100.00% 0 B / 508 B 0 B / 0 B 2
$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19391 root 20 0 2055904 262352 340 D 0.0 0.1 0:00.05 stress
I see the docker stats shows a memory consumption of 256 MB, and top shows a RES of 256 MB and a VIRT of 2000 MB. But, what does that actually mean? What will happen to a process inside the container that tries to use more memory than allowed? In which sense it is constrained by -m?
As i understand the docs --oom-kill-disable is not constrained by -m but actually requires it:
By default, kernel kills processes in a container if an out-of-memory
(OOM) error occurs. To change this behaviour, use the
--oom-kill-disable option. Only disable the OOM killer on containers where you have also set the -m/--memory option. If the -m flag is not
set, this can result in the host running out of memory and require
killing the host’s system processes to free memory.
A developer stated back in 2015 that
The host can run out of memory with or without the -m flag set. But
it's also irrelevant as --oom-kill-disable does nothing unless -m is
passed.
In regard to your update, what happens when OOM-killer is disabled and yet the memory limit is hit (intresting OOM article), id say that new calls to malloc and such will just fail as described here but it also depends on the swap configuration and the hosts available memory. If your -m limit is above the actual available memory, the host will start killing processes, one of which might be the docker daemon (which they try to avoid by changing its OOM priority).
The kernel docs (cgroup/memory.txt) say
If OOM-killer is disabled, tasks under cgroup will hang/sleep in
memory cgroup's OOM-waitqueue when they request accountable memory
For the actual implementation (which docker utilizes as well) of cgroups, youd have to check the sourcecode.
The job of the 'oom killer' in Linux is to sacrifice one or more processes in order to free up memory for the system when all else fails. OOM killer is only enabled if the host has memory overcommit enabled
The setting of --oom-kill-disable will set the cgroup parameter to disable the oom killer for this specific container when a condition specified by -m is met. Without the -m flag, oom killer will be irrelevant.
The -m flag doesn’t mean stop the process when it uses more than xmb of ram, it’s only that you’re ensuring that docker container doesn’t consume all host memory, which can force the kernel to kill its process. With -m flag, the container is not allowed to use more than a given amount of user or system memory.
When container hits OOM, it won’t be killed but it can hang and stay in defunct state hence processes inside the container can’t respond until you manually intervene and do a restart or kill the container. Hope this helps clear your questions.
For more details on how kernel act on OOM, check Linux OOM management and Docker memory Limitations page.
I'm trying to control I/O bandwidth by using cgroup blkio controller.
Cgroup has been setup and mounted successfully, i.e. calling grep cgroup /proc/mounts
returns:
....
cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio 0 0
...
I then make a new folder in the blkio folder and write to the file blkio.throttle.read_bps_device, as follows:
1. mkdir user1; cd user1
2. echo "8:5 10485760" > blkio.throtlle.read_bps_device
----> echo: write error: Invalid argument
My device major:minor number is correct from using df -h and ls -l /dev/sda5 for the storage device.
And I can still write to file that requires no device major:minor number, such as blkio.weight (but the same error is thrown for blkio.weigth_device)
Any idea why I got that error?
Not sure which flavour/version of Linux you are using, on RHEL 6.x kernels, this was did not work for some reason, however it worked when I compiled on a custom kernel on RHEL and on other Fedora versions without any issues.
To check if supported on your kernel, run lssubsys -am | grep blkio. Check the path if you can file the file blkio.throttle.read_bps_device
However, here is an example of how you can do it persistently, set a cgroups to limit the program not to exceed more than 1 Mibi/s:
Get the MARJOR:MINOR device number from /proc/partitions
`cat /proc/partitions | grep vda`
major minor #blocks name
252 0 12582912 vda --> this is the primary disk (with MAJOR:MINOR -> 8:0)
Now if you want to limit your program to 1mib/s (convert the value to bytes/s) as follows. => 1MiB/s => 1024 kiB/1MiB * 1024 B/s = 1048576 Bytes/sec
Edit /etc/cgconfig.conf and add the following entry
group ioload {
blkio.throttle.read_bps_device = "252:0 1048576"
}
}
Edit /etc/cgrules.conf
*: blkio ioload
Restart the required services
`chkconfig {cgred,cgconfig} on;`
`service {cgred,cgconfig} restart`
Refer: blkio-controller.txt
hope this helps!
I have an application that uses hugepage and the application suddenly crashed due to some bug.
After crashing, since the application does not release the hugepage properly, the free hugepage number is not increased in sys filesystem.
$ sudo cat /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages
0
$ sudo cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
1024
Is there a way to release the hugepages by force?
Sometimes need to check all directory that hugetlbfs has been mounted.
So,
find mounted directory by command mount | grep huge.
check every directory except especially /dev/hugepages.
delete all 2M-sized files. (2M is the size of hugepage)
Use ipcs -m to list the shared memory segments.
Use ipcrm to remove the left over shared memory segments.
Edit on 06/24/2019:
Ok, so, the above answer, while correct as far as it goes, was a bit brief. In particular, if you have a host with multiple DB instances, and only one is crashed how can you determine which (if any) memory segments should be cleaned up?
Well, this too, can be done. For each running instance, connect w/ / as sysdba, then do oradebug setmypid (any pid will do, as all Oracle PIDs connect to the SGA). Then do oradebug ipc. That will (hopefully) return IPC information written to the trace file. So, go to the udump (or diag_dest) directory, and look for your trace file. It will contain all the IPC information for the instance. This will include ShmId. Look through the file for the ShmId(s) that this instance is using. Now look at the output of ipcs -m.
When you have done that for all the running instances, any memory segment output by ipcs -m that shows non-zero memory allocation, and that you cannot account for in the oradebug ipc information from any running instance, must be the left over memory segments from the crashed instance. Use ipcrm to remove it/them.
When doing this on a host with multiple running instances, this can be a bit fraught. Please proceed with caution. You don't want to remove the SGA of a running instance!
Hope that helps....
HugeTLB can either be used for shared memory (and Mark J. Bobak's answer would deal with that) or the app mmaps files created in a hugetlb filesystem. If the app crashes without removing those files they survive and keep corresponding memory 'allocated'.
Check hugeTLB filesystem and see if there are any leftover files from the app. Removing them would release the memory.
If you follow the instruction below, you can get rid of the allocated hugepages:
1) Let's check the hugepages which were free at restart
dpdk#dpdkvm:~$ ls /mnt/huge/
empty
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ cat /proc/meminfo
...
HugePages_Total: 256
HugePages_Free: 256
...
2) Starting a dpdk application with wrong parameters, producing an error
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ sudo ./build/kni -c 0x03 -n 2 -- -P -p 0x03 --config="(0,0,1),(1,0,1)"
...
EAL: Error - exiting with code: 1
Cause: No supported Ethernet device found
3) When I check hugepages, there is not any free
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ cat /proc/meminfo
...
HugePages_Total: 256
HugePages_Free: 0
...
4) Now, when I check the mounted hugepage directory, I can see the files which are not given back to OS by dpdk application.
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ ls /mnt/huge/
...
rtemap_0 rtemap_137 rtemap_176 rtemap_214 rtemap_253 rtemap_62
...
5) Finally, if you remove the files starting with rtemap, you can give the hugepages back
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ sudo rm /mnt/huge/*
[sudo] password for dpdk:
dpdk#dpdkvm:~/dpdk-1.8.0/examples/kni$ cat /proc/meminfo
...
HugePages_Total: 256
HugePages_Free: 256
...
your hugetlb may be used by shared memory or mmap files.
try to remove the shared memories or umount the hugetlb fs
I'm using /dev/shm tmpfs for writing lots of temporary files.
A set of 8-10 files/second, with each set containing files which range from 70kB .. 750kB. The file sets are all approximately the same sizes and arrive to be written regularly about once per second.
The code which writes these files is python calling a library which uses fwrite() to do the writing.
When the application starts, the write times take from 30 to over 400ms. Usually it's the largest (700k) files which can take 400ms to write but then it varies.
Here is an example of a set of 8, given in ms: 42,30,320,76,66,72,102,440.
It seems that the standard deviation of writing to /dev/shm is quite large.
After the application runs for a couple of minutes, the time to write plummets and the variance is much smaller (e.g. 7,8,15,23,24,32,51,71) - this behavior is stable and I have run the application for several hours.
There are no other applications of consequence running concurrently and there is plenty of room on /dev/shm.
It seems that the Linux kernel is dynamically adjusting to the application's use of /dev/shm. My question is: is my suspicion about the linux kernel correct? If so, is there any way to configure or notify the kernel ahead of time to use the desired behavior when my application starts? (faster writes to /dev/shm)
I'm using Ubuntu 12.04 LTS
$ uname -a
Linux devsb02 3.2.0-24-generic #39-Ubuntu SMP Mon May 21 16:52:17 UTC 2012 x86_64 x86_64
x86_64 GNU/Linux
$ ls -l /dev/shm
lrwxrwxrwx 1 root root 8 Aug 23 12:19 /dev/shm -> /run/shm
$ mount
....
none on /run/shm type tmpfs (rw,nosuid,nodev)