ionice 'idle' not having the expected effects

ionice 'idle' not having the expected effects - linux

We're working with a reasonably busy web server. We wanted to use rsync to do some data-moving which was clearly going to hammer the magnetic disk, so we used ionice to put the rsync process in the idle class. The queues for both disks on the system (SSD+HDD) are set to use the CFQ scheduler.
The result... was that the disk was absolutely hammered and the website performance was appalling.
I've done some digging to see if any tuning might help with this.
The man page for ionice says:
Idle: A program running with idle I/O priority will only get disk time
when no other program has asked for disk I/O for a defined grace period.
The impact of an idle I/O process on normal system activity should be zero.
This "defined grace period" is not clearly explained anywhere I can find with the help of Google. One posting suggest that it's the value of fifo_expire_async but I can't find any real support for this.
However, on our system, both fifo_expire_async and fifo_expire_sync are set sufficiently long (250ms, 125ms, which are the defaults) that the idle class should actually get NO disk bandwidth at all. Even if the person who believes that the grace period is set by fifo_expire_async is plain wrong, there's not a lot of wiggle-room in the statement "The impact of an idle I/O process on normal system activity should be zero".
Clearly this is not what's happening on our machine so I am wondering if CFQ+idle is simply broken.
Has anyone managed to get it to work? Tips greatly appreciated!
Update:
I've done some more testing today. I wrote a small Python app to read random sectors from all over the disk with short sleeps in between. I ran a copy of this without ionice and set it up to perform around 30 reads per second. I then ran a second copy of the app with various ionice classes to see if the idle class did what it said on the box. I saw no difference at all between the results when I used classes 1, 2, 3 (real-time, best-effort, idle). This, despite the fact that I'm now absolutely certain that the disk was busy.
Thus, I'm now certain that - at least for our setup - CFQ+idle does not work. [see Update 2 below - it's not so much "does not work" as "does not work as expected"...]
Comments still very welcome!
Update 2:
More poking about today. Discovered that when I push the I/O rate up dramatically, the idle-class processes DO in fact start to become starved. In my testing, this happened at I/O rates hugely higher than I had expected - basically hundreds of I/Os per second. I'm still trying to work out what the tuning parameters do...
I also discovered the rather important fact that async disk writes aren't included at all in the I/O prioritisation system! The ionice manpage I quoted above makes no reference to that fact, but the manpage for the syscall ioprio_set() helpfully states:
I/O priorities are supported for reads and for synchronous (O_DIRECT,
O_SYNC) writes. I/O priorities are not supported for asynchronous
writes because they are issued outside the context of the program
dirtying the memory, and thus program-specific priorities do not
apply.
This pretty significantly changes the way I was approaching the performance issues and I will be proposing an update for the ionice manpage.
Some more info on kernel and iosched settings (sdb is the HDD):
Linux 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23) x86_64 GNU/Linux
/etc/debian_version = 9.3
(cd /sys/block/sdb/queue/iosched; grep . *)
back_seek_max:16384
back_seek_penalty:2
fifo_expire_async:250
fifo_expire_sync:125
group_idle:8
group_idle_us:8000
low_latency:1
quantum:8
slice_async:40
slice_async_rq:2
slice_async_us:40000
slice_idle:8
slice_idle_us:8000
slice_sync:100
slice_sync_us:100000
target_latency:300
target_latency_us:300000

AFAIK, the only opportunity to solve your problem is using CGroup v2 (kernel v. 4.5 or newer). Please see the following article:
https://andrestc.com/post/cgroups-io/
Also please note, that you may use the systemd's wrappers to configure CGroup limits on per-service basis:
http://0pointer.de/blog/projects/resources.html

Add nocache to that and you're set (you can join it with ionice and nice):
https://github.com/Feh/nocache
On Ubuntu install with:
apt install nocache
It simply omits cache on IO and thanks to that other processes won't starve when the cache is flushed.
It's like calling the commands with O_DIRECT, so now you can limit the IO for example with:
systemd-run --scope -q --nice=19 -p BlockIOAccounting=true -p BlockIOWeight=10 -p "BlockIOWriteBandwidth=/dev/sda 10M" nocache youroperation_here
I usually use it with:
nice -n 19 ionice -c 3 nocache youroperation_here

Related

Cyclictest for RT patched Linux Kernel

Hello I patched the Linux kernel with the RT-Patch and tested it with the Cyclinctest which monitors latencies. The Kernel isn't doing good and not better than the vanilla kernel.
https://rt.wiki.kernel.org/index.php/Cyclictest
I checked the uname for RT, which looks fine.
So I checked the requirements for the cyclinctest and it states that I have to make sure that the following is configured within the kernel config:
CONFIG_PREEMPT_RT=y
CONFIG_WAKEUP_TIMING=y
CONFIG_LATENCY_TRACE=y
CONFIG_CRITICAL_PREEMPT_TIMING=y
CONFIG_CRITICAL_IRQSOFF_TIMING=y
The Problem now arising is that the config doesn't contain such entries. Maybe there are old and the they may be renamed in the new patch versions (3.8.14)?
I found options like:
CONFIG_PREEMPT_RT_FULL=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_RT_BASE=y
CONFIG_HIGH_RES_TIMERS=y
Is that enought in the 3.x kernel to provide the required from above? Anyone a hint?

There's a lot that must be done to get hard realtime performance under PREEMPT_RT. Here are the things I am aware of. Entries marked with an asterisk apply to your current position.
Patch the kernel with PREEMPT_RT (as you already did), and enable CONFIG_PREEMPT_RT_FULL (which used to be called CONFIG_PREEMPT_RT, as you correctly derived).
Disable processor frequency scaling (either by removing it from the kernel configuration or by changing the governor or its settings). (*)
Reasoning: Changing a core's frequency takes a while, during which the core does no useful work. This causes high latencies.
To remove this, look under the ACPI options in the kernel settings.
If you don't want to remove this capability from the kernel, you can set the cpufreq governor to "performance" to lock it into its highest frequency.
Disable deep CPU sleep states
Reasoning: Like switching frequencies, Waking the CPU from a deep sleep can take a while.
Cyclictest does this for you (look up /dev/cpu_dma_latency to see how to do it in your application).
Alternatively, you can disable the "cpuidle" infrastructure in the kernel to prevent this from ever occurring.
Set a high priority for the realtime thread, above 50 (preferably 99) (*)
Reasoning: You need to place your priority above the majority of the kernel -- much of a PREEMPT_RT kernel (including IRQs) runs at a priority of 50.
For cyclictest, you can do this with the "-p#" option, e.g. "-p99".
Your application's memory must be locked. (*)
Reasoning: If your application's memory isn't locked, then the kernel may need to re-map some of your application's address space during execution, triggering high latencies.
For cyclictest, this may be done with the "-m" option.
To do this in your own application, see the RT_PREEMPT howto.
You must unload the nvidia, nouveau, and i915 modules if they are loaded (or not build them in the first place) (*)
Reasoning: These are known to cause high latencies. Hopefully you don't need them on a realtime system :P
Your realtime task must be coded to be realtime
For example, you cannot do file access or dynamic memory allocation via malloc(). Many system calls are off-limits (it's hard to find which ones are acceptable, IMO).
cyclictest is mostly already coded for realtime operation, as are many realtime audio applications. You do need to run it with the "-n" flag, however, or it will not use a realtime-safe sleep call.
The actual execution of cyclictest should have at least the following set of parameters:
sudo cyclictest -p99 -m -n

Buffering on top of VFS

the problem I try to deal with it is the saving of big number (millions) of small files (up to 50KB), which are sent via network. The saving is done sequential: server receives a file or a dir (via network), it saves it on disk; the next one arrives, it's saved etc.
Apparently, the performance is not acceptable, if multiple server processes coexist (let's say I have 5 processes which all read from network and write at the same time), because the I/O scheduler doesn't manage to merge efficiently the I/O writes.
A suggested solution is to implement some sort of buffering: each server process should have a 50MB cache, in which it should write the current file, do a chdir etc; when the buffer is full, it should be synced to disk, therefore obtaining an I/O burst.
My questions to you:
1) I know that already exists a buffer mechanism (disk buffer); do you think that the above scenario is going to add some improvement? (the design is much more complicated and it's not easy to implement a simple test case)
2) do you have any suggestions, where to look if I would implement this?
Many thanks.

You're going to need to do better than
"apparently the performance is not acceptable".
Specifically
How are you measuring it? Do you have an exact, reproducible figure
What is your target?
In order to do optimisation, you need two things- a method of measuring it (a metric) and a target (so you know when to stop, or how useful or useless a particular technique is).
Without either, you're sunk, I'm afraid.

How important are those writes? I have three suggestions (which can be combined), but one of them is a lot of work, and one of them is less safe...
Journaling
I'm guessing you're seeing some poor performance due in part to the journaling common to most modern Linux filesystems. The journaling causes barriers to be inserted into the IO queue when file metadata is written. You can try turning down the safety (and maybe turning up the speed) with mount(8) options barrier=0 and data=writeback.
But if there is a crash, the journal might not be able to prevent a lengthy fsck(8). And there's a chance the fsck(8) will wind up throwing away your data when fixing the problem. On the one hand, it's not a step to take lightly, on the other hand, back in the old days, we ran our ext2 filesystems in async mode without a journal both ways in the snow and we liked it.
IO Scheduler elevator
Another possibility is to swap the IO elevator; see Documentation/block/switching-sched.txt in the Linux kernel source tree. The short version is that deadline, noop, as, and cfq are available. cfq is the kernel default, and probably what your system is using. You can check:
$ cat /sys/block/sda/queue/scheduler
noop deadline [cfq]
The most important parts from the file:
As of the Linux 2.6.10 kernel, it is now possible to change the
IO scheduler for a given block device on the fly (thus making it possible,
for instance, to set the CFQ scheduler for the system default, but
set a specific device to use the deadline or noop schedulers - which
can improve that device's throughput).
To set a specific scheduler, simply do this:
echo SCHEDNAME > /sys/block/DEV/queue/scheduler
where SCHEDNAME is the name of a defined IO scheduler, and DEV is the
device name (hda, hdb, sga, or whatever you happen to have).
The list of defined schedulers can be found by simply doing
a "cat /sys/block/DEV/queue/scheduler" - the list of valid names
will be displayed, with the currently selected scheduler in brackets:
# cat /sys/block/hda/queue/scheduler
noop deadline [cfq]
# echo deadline > /sys/block/hda/queue/scheduler
# cat /sys/block/hda/queue/scheduler
noop [deadline] cfq
Changing the scheduler might be worthwhile, but depending upon the barriers inserted into the queue by the journaling requirements, there might not be much reordering possible. Still, it is less likely to lose your data, so it might be the first step.
Application changes
Another possibility is to drastically change your application to bundle files itself, and write fewer, larger, files to disk. I know it sounds strange, but (a) the iD development team packaged their maps, textures, objects, etc., into giant zip files that they would read into the program with a few system calls, unpack, and run with, because they found the performance much better than reading a few hundred or few thousand smaller files. Load times between levels was drastically shorter. (b) The Gnome desktop team and KDE desktop teams took different approaches to loading their icons and resource files: the KDE team packages their many small files into larger packages of some sort, and the Gnome team did not. The Gnome team had longer startup delays and were hoping the kernel could make some efforts to improve their startup time. The kernel team kept suggesting the fewer, larger, files approach.

Creating/renaming a file, syncing it, having lots of files in a directory and having lots of files (with tail waste) are some of the slow operations in your scenario. However to avoid them it would only help to write lesser files (for example writing out archives, concatenated file or similiar). I would actually try a (limited) parallel async or sync approach. The IO scheduler and caches are typically quite good.

Can I tell Linux not to swap out a particular processes' memory?

Is there a way to tell Linux that it shouldn't swap out a particular processes' memory to disk?
Its a Java app, so ideally I'm hoping for a way to do this from the command line.
I'm aware that you can set the global swappiness to 0, but is this wise?

You can do this via the mlockall(2) system call under Linux; this will work for the whole process, but do read about the argument you need to pass.
Do you really need to pull the whole thing in-core? If it's a java app, you would presumably lock the whole JVM in-core. I don't know of a command-line method for doing this, but you could write a trivial program to call fork, call mlockall, then exec.
You might also look to see if one of the access pattern notifications in madvise(2) meets your needs. Advising the VM subsystem about a better paging strategy might work out better if it's applicable for you.
Note that a long time ago now under SunOS, there was a mechanism similar to madvise called vadvise(2).

If you wish to change the swappiness for a process add it to a cgroup and set the value for that cgroup:
https://unix.stackexchange.com/questions/10214/per-process-swapiness-for-linux#10227

There exist a class of applications in which you never want them to swap. One such class is a database. Databases will use memory as caches and buffers for their disk areas, and it makes absolutely no sense that these are ever put to swap. The particular memory may hold some relevant data that is not needed for a week until one day when a client asks for it. Without the caching/swapping, the database would simply find the relevant record on disk, which would be quite fast; but with swapping, your service might suddenly be taking a long time to respond.
mysqld includes code to use the OS / system call memlock. On Linux, since at least 2.6.9, this system call will work for non-root processes that have the CAP_IPC_LOCK capability[1]. When using memlock(), the process must still work within the bounds of the LimitMEMLOCK limit. [2]. One of the (few) good things about systemd is that you can grant the mysqld process these capabilities, without requiring a special program. If can also set the rlimits as you'd expect with ulimit. Here is an override file for mysqld that does the requisite steps, including a few others that you might need for a process such as a database:
[Service]
# Prevent mysql from swapping
CapabilityBoundingSet=CAP_IPC_LOCK
# Let mysqld lock all memory to core (don't swap)
LimitMEMLOCK=-1
# do not kills this process if low on memory
OOMScoreAdjust=-900
# Use higher io scheduling
IOSchedulingClass=realtime
Type=simple
ExecStart=
ExecStart=/usr/sbin/mysqld --memlock $MYSQLD_OPTS
Note The standard community mysql currently ships with Type=forking and adds --daemonize in the option to the service on the ExecStart line. This is inherently less stable than the above method.
UPDATE I am not 100% happy with this solution. After several days of runtime, I noticed the process still had enormous amounts of swap! Examining /proc/XXXX/smaps, I note the following:
The largest contributor of swap is from a stack segment! 437 MB and fluctuating. This presents obvious performance issues. It also indicates stack-based memory leak.
There are zero Locked pages. This indicates the memlock option in MySQL (or Linux) is broken. In this case, it wouldn't matter much because MySQL can't memlock stack.

You can do that by the mlock family of syscalls. I'm not sure, however, if you can do it for a different process.

As super user you can 'nice' it to the highest priority level -20 and hope that's enough to keep it from being swapped out. It usually is. Positive numbers lower scheduling priority. Normal users cannot nice upwards (negative nos.)

Except in extremely unusual circumstances, asking this question means that You're Doing It Wrong(tm).
Seriously, if Linux wants to swap and you're trying to keep your process in memory then you're putting an unreasonable demand on the OS. If your app is that important then 1) buy more memory, 2) remove other apps/daemons from the machine, or dedicate a machine to your app, and/or 3) invest in a really fast disk subsystem. These steps are reasonable for an important app. If you can't justify them, then you probably can't justify wiring memory and starving other processes either.

Why do you want to do this?
If you are trying to increase performance of this app then you are probably on the wrong track. The OS will swap out a process to increase memory for disk cache - even if there is free RAM, the kernel knows best (actauly the samrt guys that wrote the scheduler know best).
If you have a process that needs responsiveness (it's swapped out while not used and you need it to restart quickly) then nice it to high priority, mlock, or using a real time kernel might help.

How to make Linux GUI "usable" when lots of disk activity is happening

If I start copying a huge file tree from one position to another or if some other process starts doing lots of disk activity, the foreground app (GUI) slows way down. For example, take a 2gb file tree with 100k files in it. Open a console and do cp -r bigtree bigtree2. Then go to firefox and start browsing. Firefox is almost unusable. Even if I set firefox's nice level to really high priority (-20), it's still super slow with huge delays.
I remember some years ago when I worked on a Solaris box, the system behaved much better in similar circumstances.
My HD is using DMA, not PIO. It's SATA. Not mounted with the atime flag.

Linux has long had a problem with programs that hog all the system's "dirty" cache memory. What is happening is that the copy process is filling the write cache with the file data it is copying and it is doing it very quickly. So when Firefox comes along and needs to write it must first wait for dirty buffer space or an available disk queue write slot. While waiting it is competing with the copy process and the kernel's pdflush thread, which moves data from dirty buffers to the disk write queue.
Firefox has yet another problem in this scenario. It uses SQLite to store its bookmarks, history and other things. SQLite is a ACID compliant database and it uses a transaction system with its disk writes flushed to disk. So not only does it have to wait for buffer space, it must wait for the disk queue, which is full of copied file, to clear out before it can acknowledge a successful write.
There has been a lot of tweaking done to the Linux disk queuing and buffering system. There are changes in almost every kernel release. Try one of the newer releases. You can also try tweaking the sysctl values. I sort of like these:
vm.dirty_writeback_centisecs = 100
vm.dirty_expire_centisecs = 9000
vm.dirty_background_ratio = 4
vm.dirty_ratio = 80
You can also try tweaking the number of slots in the disk queue. This value is in /sys/block/sda/queue/nr_requests. You need to substitute sda with whatever your drive really is. More slots means more chances to merge IO requests and the CFQ IO scheduler can do a better job with priorities. Fewer slots usually means a shorter wait to get written to disk for synchronous IO like SQLite's transactions. Fewer slots also means a shorter wait to get read IO into the disk queue if a write-heavy process completely stuffs the queue with write IO.

Try ionice-ing or nice-ing the copy process. The issue is due to the fact that IO gets the same priority as the GUI, which for a desktop, affects perceived responsiveness.
There's an Ubuntu brainstorm about this currently.

You're not the first to notice this problem. Former kernel developer [Con Kolivas] (http://en.wikipedia.org/wiki/Con_Kolivas) found that a lot of companies are paying to improve linux server performance at the expense of desktop performance. Con had an impressive set of patches for making the desktop more responsive. Unfortunately there was some sort of code war and eventually Con dropped out.
I would love to know how to petition the Linux kernel developers for better desktop performance. In the meantime, if you are willing to run kernel 2.6.22, you can run with the -ck patch set.

Make sure that DMA is enabled on all your drives that support it. Depending on your distribution this may not be the default. Read man hdparm, and look into your systems init mechanism.

Using "top" in Linux as semi-permanent instrumentation

I'm trying to find the best way to use 'top' as semi-permanent instrumentation in the development of a box running embedded Linux. (The instrumentation will be removed from the final-test and production releases.)
My first pass is to simply add this to init.d:
top -b -d 15 >/tmp/toploop.out &
This runs top in "batch" mode every 15 seconds. Let's assume that /tmp has plenty of space…
Questions:
Is 15 seconds a good value to choose for general-purpose monitoring?
Other than disk space, how seriously is this perturbing the state of the system?
What other (perhaps better) tools could be used like this?

Look at collectd. It's a very light weight system monitoring framework coded for performance.

We use sysstat to monitor things like this.

You might find that vmstat and iostat with a delay and no repeat counter is a better option.

I suspect 15 seconds would be more than adequate unless you actually want to watch what's happening in real time, but that doesn't appear to be the case here.
As far as load, on an idling PIII 900Mhz w/ 768MB of RAM running Ubuntu (not sure which version, but not more than a year old) I have top updating every 0.5 seconds and it's about 2% CPU utilization. At 15s updates, I'm seeing 0.1% CPU utilization.
depending upon what exactly you want, you could use the output of uptime, free, and ps to get most, if not all, of top's information.

If you are looking for overall load, uptime is probably sufficient. However, if you want specific information about processes, you are adventurous, and have the /proc filessystem enabled, you may want to write your own tools. The primary benefit in this environment is that you can focus on exactly what you want and minimize the load introduced to the system.
The proc file system gives your application read access to the kernel memory that keeps track of many of the interesting variables. Reading from /proc is one of the lightest ways to get this information. Additionally, you may be able to get more information than provided by top. I've done this in the past to get amount of time spent in user and system by this process. Additionally, you can use this to get information about the number of file descriptors open by the process. You might also use this to get detailed information about how the network system is working.
Much of this information is pre-processed by other applications which can be used if you get the information you need. However, it is rather straight-forward to read the raw information. Do a man proc for more information.

Pity you haven't said what you are monitoring for.
You should decide whether 15 seconds is ok or not. Feel free to drop it way lower if you wish (and have a fast HDD)
No worries unless you are running a soft real-time system
Have a look at tools suggested in other answers. I'll add another sugestion: "iotop", for answering a "who is thrashing the HDD" questions.

At work for system monitoring during stress tests we use a tool called nmon.
What I love about nmon is it has the ability to export to XLS and generate beautiful graphs for you.
It generates statistics for:
Memory Usage
CPU Usage
Network Usage
Disk I/O
Good luck :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string