cgroup blkio files cannot be written - io

I'm trying to control I/O bandwidth by using cgroup blkio controller.
Cgroup has been setup and mounted successfully, i.e. calling grep cgroup /proc/mounts
returns:
....
cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio 0 0
...
I then make a new folder in the blkio folder and write to the file blkio.throttle.read_bps_device, as follows:
1. mkdir user1; cd user1
2. echo "8:5 10485760" > blkio.throtlle.read_bps_device
----> echo: write error: Invalid argument
My device major:minor number is correct from using df -h and ls -l /dev/sda5 for the storage device.
And I can still write to file that requires no device major:minor number, such as blkio.weight (but the same error is thrown for blkio.weigth_device)
Any idea why I got that error?

Not sure which flavour/version of Linux you are using, on RHEL 6.x kernels, this was did not work for some reason, however it worked when I compiled on a custom kernel on RHEL and on other Fedora versions without any issues.
To check if supported on your kernel, run lssubsys -am | grep blkio. Check the path if you can file the file blkio.throttle.read_bps_device
However, here is an example of how you can do it persistently, set a cgroups to limit the program not to exceed more than 1 Mibi/s:
Get the MARJOR:MINOR device number from /proc/partitions
`cat /proc/partitions | grep vda`
major minor #blocks name
252 0 12582912 vda --> this is the primary disk (with MAJOR:MINOR -> 8:0)
Now if you want to limit your program to 1mib/s (convert the value to bytes/s) as follows. => 1MiB/s => 1024 kiB/1MiB * 1024 B/s = 1048576 Bytes/sec
Edit /etc/cgconfig.conf and add the following entry
group ioload {
blkio.throttle.read_bps_device = "252:0 1048576"
}
}
Edit /etc/cgrules.conf
*: blkio ioload
Restart the required services
`chkconfig {cgred,cgconfig} on;`
`service {cgred,cgconfig} restart`
Refer: blkio-controller.txt
hope this helps!

Related

How to change the cpu scaling_governor value via udev at the booting time

everyone. I’m tring to use udev to set my cpu's scaling_governor value from powersave to performance. And here is my udev rule file:
[root#node1 ~]$ cat /etc/udev/rules.d/50-scaling-governor.rules
SUBSYSTEM=="cpu", KERNEL=="cpu[0-9]|cpu[0-9][0-9]", ACTION=="add", ATTR{cpufreq/scaling_governor}="performance"
Before testing my 50-scaling-governor.rules, let's see what the value of scaling_governor is first.
[root#node1 ~]$ cat /sys/devices/system/cpu/cpu16/cpufreq/scaling_governor
powersave
And then I use udevadm command to execute my 50-scaling-governor.rule
[root#node1 ~]$ udevadm test --action="add" /devices/system/cpu/cpu16
calling: test
version 219
This program is for debugging only, it does not run any program
specified by a RUN key. It may show incorrect results, because
some values may be different, or not available at a simulation run.
=== trie on-disk ===
tool version: 219
file size: 8873994 bytes
header size 80 bytes
strings 2300642 bytes
nodes 6573272 bytes
Load module index
Created link configuration context.
timestamp of '/etc/udev/rules.d' changed
# omit some unrelevant messages
...
Reading rules file: /etc/udev/rules.d/50-scaling-governor.rules
...
rules contain 49152 bytes tokens (4096 * 12 bytes), 21456 bytes strings
3908 strings (46431 bytes), 2777 de-duplicated (26107 bytes), 1132 trie nodes used
no db file to read /run/udev/data/+cpu:cpu16: No such file or directory
ATTR '/sys/devices/system/cpu/cpu16/cpufreq/scaling_governor' writing 'performance' /etc/udev/rules.d/50-scaling-governor.rules:1
IMPORT builtin 'hwdb' /usr/lib/udev/rules.d/50-udev-default.rules:11
IMPORT builtin 'hwdb' returned non-zero
RUN 'kmod load $env{MODALIAS}' /usr/lib/udev/rules.d/80-drivers.rules:5
RUN '/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; /usr/bin/systemd-run --no-block /usr/lib/udev/kdump-udev-throttler'' /usr/lib/udev
/rules.d/98-kexec.rules:14
ACTION=add
DEVPATH=/devices/system/cpu/cpu16
DRIVER=processor
MODALIAS=cpu:type:x86,ven0000fam0006mod004F:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000B,000C,000D,000E,000F,0010,0011,0013,0015,0016,
0017,0018,0019,001A,001B,001C,001D,001F,002B,0034,003A,003B,003D,0068,006B,006C,006D,006F,0070,0072,0074,0075,0076,0078,0079,007C,0080,0081,0082,0083,008
4,0085,0086,0087,0088,0089,008B,008C,008D,008E,008F,0091,0092,0093,0094,0095,0096,0097,0098,0099,009A,009B,009C,009D,009E,00C0,00C5,00C8,00E1,00E3,00E4,0
0E6,00E7,00EB,00EC,00F0,00F1,00F3,00F5,00F6,00F9,00FA,00FB,00FD,0100,0101,0102,0103,0104,0111,0120,0121,0123,0124,0125,0127,0128,0129,012A,012B,012C,012D
,012F,0132,0133,0134,0139,0140,0160,0161,0162,0163,0165,01C0,01C1,01C2,01C4,01C5,01C6,024A,025A,025B,025C,025F
SUBSYSTEM=cpu
USEC_INITIALIZED=184210753630
run: 'kmod load cpu:type:x86,ven0000fam0006mod004F:feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000B,000C,000D,000E,000F,0010,0011,0013,001
5,0016,0017,0018,0019,001A,001B,001C,001D,001F,002B,0034,003A,003B,003D,0068,006B,006C,006D,006F,0070,0072,0074,0075,0076,0078,0079,007C,0080,0081,0082,0
083,0084,0085,0086,0087,0088,0089,008B,008C,008D,008E,008F,0091,0092,0093,0094,0095,0096,0097,0098,0099,009A,009B,009C,009D,009E,00C0,00C5,00C8,00E1,00E3
,00E4,00E6,00E7,00EB,00EC,00F0,00F1,00F3,00F5,00F6,00F9,00FA,00FB,00FD,0100,0101,0102,0103,0104,0111,0120,0121,0123,0124,0125,0127,0128,0129,012A,012B,01
2C,012D,012F,0132,0133,0134,0139,0140,0160,0161,0162,0163,0165,01C0,01C1,01C2,01C4,01C5,01C6,024A,025A,025B,025C,025F'
run: '/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; /usr/bin/systemd-run --no-block /usr/lib/udev/kdump-udev-throttler''
Unload module index
Unloaded link configuration context.
And now, the value of cpu16/scaling_governor has changed, so it's nothing wrong with my udev rule.
[root#node1 ~]$ cat /sys/devices/system/cpu/cpu16/cpufreq/scaling_governor
performance
But after rebooting my server, I find that the scaling_governor value of cpu16 is still powersave. I have no idea why my udev rule can work properly by udevadm while failing by rebooting.
Some environment information about my machine is as follows:
OS: CentOS Linux release 7.9.2009 (Core)
kernel version: 5.4.154-1.el7.elrepo.x86_64
udev version: 219
Can anyone give me some hint or advice? Thanks in advance

Is there a way to read the memory counter used by cgroups to kill processes?

I am running a process under a cgroup with an OOM Killer. When it performs a kill, dmesg outputs messages such as the following.
[9515117.055227] Call Trace:
[9515117.058018] [<ffffffffbb325154>] dump_stack+0x63/0x8f
[9515117.063506] [<ffffffffbb1b2e24>] dump_header+0x65/0x1d4
[9515117.069113] [<ffffffffbb5c8727>] ? _raw_spin_unlock_irqrestore+0x17/0x20
[9515117.076193] [<ffffffffbb14af9d>] oom_kill_process+0x28d/0x430
[9515117.082366] [<ffffffffbb1ae03b>] ? mem_cgroup_iter+0x1db/0x3c0
[9515117.088578] [<ffffffffbb1b0504>] mem_cgroup_out_of_memory+0x284/0x2d0
[9515117.095395] [<ffffffffbb1b0f95>] mem_cgroup_oom_synchronize+0x305/0x320
[9515117.102383] [<ffffffffbb1abf50>] ? memory_high_write+0xc0/0xc0
[9515117.108591] [<ffffffffbb14b678>] pagefault_out_of_memory+0x38/0xa0
[9515117.115168] [<ffffffffbb0477b7>] mm_fault_error+0x77/0x150
[9515117.121027] [<ffffffffbb047ff4>] __do_page_fault+0x414/0x420
[9515117.127058] [<ffffffffbb048022>] do_page_fault+0x22/0x30
[9515117.132823] [<ffffffffbb5ca8b8>] page_fault+0x28/0x30
[9515117.330756] Memory cgroup out of memory: Kill process 13030 (java) score 1631 or sacrifice child
[9515117.340375] Killed process 13030 (java) total-vm:18259139756kB, anon-rss:2243072kB, file-rss:30004132kB
I would like to be able to tell how much memory the cgroups OOM Killer believes the process is using at any given time.
Is there a way to query for this quantity?
I found the following in the official documentation for cgroup-v1, which shows how to query current memory usage, as well as altering limits:
a. Enable CONFIG_CGROUPS
b. Enable CONFIG_MEMCG
c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
# mount -t tmpfs none /sys/fs/cgroup
# mkdir /sys/fs/cgroup/memory
# mount -t cgroup none /sys/fs/cgroup/memory -o memory
3.2. Make the new group and move bash into it
# mkdir /sys/fs/cgroup/memory/0
# echo $$ > /sys/fs/cgroup/memory/0/tasks
Since now we're in the 0 cgroup, we can alter the memory limit:
# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)
NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).
NOTE: We cannot set limits on the root cgroup any more.
# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
4194304
We can check the usage:
# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
1216512

Kernel Panic with ramfs on embedded device: No filesystem could mount root

I'm working on an embedded ARM device running Linux (kernel 3.10), with NAND memory for storage. I'm trying to build a minimal linux which will reside on its own partition and carry out updates of the main firmware.
The kernel uses a very minimal root fs which is stored in a ramfs. However, I can't get it to boot. I get the following error:
[ 0.794113] List of all partitions:
[ 0.797600] 1f00 128 mtdblock0 (driver?)
[ 0.802669] 1f01 1280 mtdblock1 (driver?)
[ 0.807697] 1f02 1280 mtdblock2 (driver?)
[ 0.812735] 1f03 8192 mtdblock3 (driver?)
[ 0.817761] 1f04 8192 mtdblock4 (driver?)
[ 0.822794] 1f05 8192 mtdblock5 (driver?)
[ 0.827820] 1f06 82944 mtdblock6 (driver?)
[ 0.832850] 1f07 82944 mtdblock7 (driver?)
[ 0.837876] 1f08 12288 mtdblock8 (driver?)
[ 0.842906] 1f09 49152 mtdblock9 (driver?)
[ 0.847928] No filesystem could mount root, tried: squashfs
[ 0.853569] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(1,0)
[ 0.861806] CPU: 0 PID: 1 Comm: swapper Not tainted 3.10.73 #11
[ 0.867732] [<800133ec>] (unwind_backtrace+0x0/0x12c) from [<80011a50>] (show_stack+0x10/0x14)
(...etc)
The root fs is built by the build process, using the following (simplified for clarity):
# [Copy some things to $(ROOTFS_OUT_DIR)/mini_rootfs]
cd $(ROOTFS_OUT_DIR)/mini_rootfs && find . | cpio --quiet -o -H newc > $(ROOTFS_OUT_DIR)/backup.cpio
gzip -f -9 $(ROOTFS_OUT_DIR)/backup.cpio
This creates $(ROOTFS_OUT_DIR)/backup.cpio.gz
The kernel is then built like this:
#$(MAKE) -C $(LINUX_SRC_DIR) O=$(LINUX_OUT_DIR) \
CONFIG_INITRAMFS_SOURCE="$(ROOTFS_OUT_DIR)/backup.cpio.gz" \
CONFIG_INITRAMFS_ROOT_UID=0 CONFIG_INITRAMFS_ROOT_GID=0
I think this means it uses the same config as the main firmware (built elsewhere), but supplies the minimal ramfs image using CONFIG_INITRAMFS_SOURCE.
From Kernel.Org, the ramfs is always built anyway, and CONFIG_INITRAMFS_SOURCE is all that is needed to specify a pre-made root fs to use. There are no build errors to indicate that there is a problem creating the ramfs, and the size of the resulting kernel looks about right. backup.cpio.gz is about 3.6 MB; the final zImage is 6.1 MB; the image is written to a partition which is 8 MB in size.
To use this image, I set some flags used by the (custom) boot loader which tell it to boot from the minimal partition, and also set a different command line for the kernel. Here is the command line used to boot:
console=ttyS0 rootfs=ramfs root=/dev/ram rw rdinit=/linuxrc mem=220M
Note that the nimimal root fs contains "/linuxrc", which is actually a link to /bin/busybox:
lrwxrwxrwx 1 root root 11 Nov 5 2015 linuxrc -> bin/busybox
Why doesn't this boot? Why is it trying "squashfs" filesystem, and is this wrong?
SOLVED! It turned out that a file name used by the (custom) build system had changed as part of an update, and so it was not putting the correct kernel image into the firmware package. I was actually trying to boot the wrong kernel with the "rootfs=ramfs" parameter, one which didn't have a ramfs.
So, for future reference, this error occurs if you specify "rootfs=ramfs" but your kernel wasn't built with any rootfs built in (CONFIG_INITRAMFS_SOURCE=... NOT specified)

mount: you must specify the filesystem type

I was trying to execute qemu while following the qemu/linaro tutorial,
https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Virtual_ARM_Linux_environment
I was executing the command,
sudo mount -o loop,offset=106496 -t auto vexpress.img /mnt/tmp
mount: you must specify the filesystem type
so i did fdisk on the img file and got the following,
Device Boot Start End Blocks Id System
vexpress.img1 * 63 106494 53216 e W95 FAT16 (LBA)
vexpress.img2 106496 6291455 3092480 83 Linux
The filesystem is Linux according to the fdisk command. But I get error,
sudo mount -o loop,offset=106496 -t Linux vexpress.img /mnt/tmp
mount: unknown filesystem type 'Linux'
Kindly help.
You correctly decide to mount the particular partition by specifying its offset but the offset parameter is in bytes and fdisk shows the offset in blocks (the block size is shown before the partition list --- usually 512). For the block size 512 the command would be:
sudo mount -o loop,offset=$((106496*512)) -t auto vexpress.img /mnt/tmp
If the automatic file system type detection does not still work there is another problem. Linux is not really a file system type. In the partition table it is a collective type used for multiple possible particular file systems. For mount you must specify the particular file system. In Linux you can list the supported ones by cat /proc/filesystems.

How to limit CPU and RAM resources for mongodump?

I have a mongod server running. Each day, I am executing the mongodump in order to have a backup. The problem is that the mongodump will take a lot of resources and it will slow down the server (that by the way already runs some other heavy tasks).
My goal is to somehow limit the mongodump which is called in a shell script.
Thanks.
You should use cgroups. Mount points and details are different on distros and a kernels. I.e. Debian 7.0 with stock kernel doesn't mount cgroupfs by default and have memory subsystem disabled (folks advise to reboot with cgroup_enabled=memory) while openSUSE 13.1 shipped with all that out of box (due to systemd mostly).
So first of all, create mount points and mount cgroupfs if not yet done by your distro:
mkdir /sys/fs/cgroup/cpu
mount -t cgroup -o cpuacct,cpu cgroup /sys/fs/cgroup/cpu
mkdir /sys/fs/cgroup/memory
mount -t cgroup -o memory cgroup /sys/fs/cgroup/memory
Create a cgroup:
mkdir /sys/fs/cgroup/cpu/shell
mkdir /sys/fs/cgroup/memory/shell
Set up a cgroup. I decided to alter cpu shares. Default value for it is 1024, so setting it to 128 will limit cgroup to 11% of all CPU resources, if there are competitors. If there are still free cpu resources they would be given to mongodump. You may also use cpuset to limit numver of cores available to it.
echo 128 > /sys/fs/cgroup/cpu/shell/cpu.shares
echo 50331648 > /sys/fs/cgroup/memory/shell/memory.limit_in_bytes
Now add PIDs to the cgroup it will also affect all their children.
echo 13065 > /sys/fs/cgroup/cpu/shell/tasks
echo 13065 > /sys/fs/cgroup/memory/shell/tasks
I run couple of tests. Python that tries to allocate bunch of mem was Killed by OOM:
myaut#zenbook:~$ python -c 'l = range(3000000)'
Killed
I've also run four infinite loops and fifth in cgroup. As expected, loop that was run in cgroup got only about 45% of CPU time, while the rest of them got 355% (I have 4 cores).
All that changes do not survive reboot!
You may add this code to a script that runs mongodump, or use some permanent solution.

Resources