Linux: echo 3 > /proc/sys/vm/drop_caches takes hours to complete - linux

I have a Thecus N8900 NAS, which is a Linux based file server, providing files via NFS to six clients. For some reason that Thecus support has yet to explain, it runs a script that checks /proc/meminfo every 60 seconds and if the disk cache exceeds 50% of available RAM they do a "echo 3 > /proc/sys/vm/drop_caches" command to flush the cache.
Leaving aside the issue of whether that makes sense or not, the actual "echo 3 > /proc/sys/vm/drop_caches" command can take hours to complete, which seems way too long to me.
The big problem is that when this happens, the load on the machine spikes, as does the disk utilization, making all NFS traffic crawl until the command finally completes, at which point things are responsive again.
The NAS itself has 16 gigs of RAM, 7 drives in a raid6 configuration (plus a hot spare), no drive problems at all (according to S.M.A.R.T. tests).
So the question is: what would cause the drop_caches command to take so long?

The command itself should complete instantaneously. The consequences, i.e. everything needs to be cached again, can take a lot of time. It doesn't make sense: if you can remove it completely it would be a good idea. (Also this is off topic in StackOverflow)
Edit: does it executes also a sync before echo 3 > /proc/sys/vm/drop_caches, such in
sync; echo 3 > /proc/sys/vm/drop_caches? Because the sync operation, which flushes all writes to the disk, may take a bit to complete. Also, while also the sync have performance issue, it may have some sense, in case of sudden power failure the data has been written to the disk already so you are going to be safe.

Related

Is it unhealthy for SSD if I write 'vital signal' to check a python code is running?

A python program that I'm building was used to die for no apparent reason. I couldn't figure out the reason, so my workaround was to add few lines that write the time to a 'vitality' file every time a certain line within the program is executed, which happens about every 0.1 seconds.
A separate script reads the 'vitality' file every 1 second, and when the vital sign doesn't update for, say 10 seconds, the script kills the program and restarts it.
So far this workaround has been working great on the original problem, but now I'm rather concerned if the SSD will degrade by this or not.
Does writing 10 digits of unixtimestamp every 0.1s to a file have negligible effect on SSD health, or would it degrade the SSD fast?
Doing that will degrade the SSD and destroy it over time.
In my last job, the SSD health tool (smartctl) indicated that the 15 SSDs in our cluster product were wearing rapidly and had only months of life left. The team found that a third party software package (etcd) was syncing a small amounts of data to a filesystem on SSD once per second. And each sync wrote at least an entire 16K block. Luckily, the problem was found early enough that we could patch it in a software update before suffering too many customer returns.
Write the 'vitality' file somewhere else. It could be on a tmpfs like /var/run/user/. Or use a different vitality mechanism; something like supervisord can manage your task, run health checks and restart it on failure.

ionice 'idle' not having the expected effects

We're working with a reasonably busy web server. We wanted to use rsync to do some data-moving which was clearly going to hammer the magnetic disk, so we used ionice to put the rsync process in the idle class. The queues for both disks on the system (SSD+HDD) are set to use the CFQ scheduler.
The result... was that the disk was absolutely hammered and the website performance was appalling.
I've done some digging to see if any tuning might help with this.
The man page for ionice says:
Idle: A program running with idle I/O priority will only get disk time
when no other program has asked for disk I/O for a defined grace period.
The impact of an idle I/O process on normal system activity should be zero.
This "defined grace period" is not clearly explained anywhere I can find with the help of Google. One posting suggest that it's the value of fifo_expire_async but I can't find any real support for this.
However, on our system, both fifo_expire_async and fifo_expire_sync are set sufficiently long (250ms, 125ms, which are the defaults) that the idle class should actually get NO disk bandwidth at all. Even if the person who believes that the grace period is set by fifo_expire_async is plain wrong, there's not a lot of wiggle-room in the statement "The impact of an idle I/O process on normal system activity should be zero".
Clearly this is not what's happening on our machine so I am wondering if CFQ+idle is simply broken.
Has anyone managed to get it to work? Tips greatly appreciated!
Update:
I've done some more testing today. I wrote a small Python app to read random sectors from all over the disk with short sleeps in between. I ran a copy of this without ionice and set it up to perform around 30 reads per second. I then ran a second copy of the app with various ionice classes to see if the idle class did what it said on the box. I saw no difference at all between the results when I used classes 1, 2, 3 (real-time, best-effort, idle). This, despite the fact that I'm now absolutely certain that the disk was busy.
Thus, I'm now certain that - at least for our setup - CFQ+idle does not work. [see Update 2 below - it's not so much "does not work" as "does not work as expected"...]
Comments still very welcome!
Update 2:
More poking about today. Discovered that when I push the I/O rate up dramatically, the idle-class processes DO in fact start to become starved. In my testing, this happened at I/O rates hugely higher than I had expected - basically hundreds of I/Os per second. I'm still trying to work out what the tuning parameters do...
I also discovered the rather important fact that async disk writes aren't included at all in the I/O prioritisation system! The ionice manpage I quoted above makes no reference to that fact, but the manpage for the syscall ioprio_set() helpfully states:
I/O priorities are supported for reads and for synchronous (O_DIRECT,
O_SYNC) writes. I/O priorities are not supported for asynchronous
writes because they are issued outside the context of the program
dirtying the memory, and thus program-specific priorities do not
apply.
This pretty significantly changes the way I was approaching the performance issues and I will be proposing an update for the ionice manpage.
Some more info on kernel and iosched settings (sdb is the HDD):
Linux 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23) x86_64 GNU/Linux
/etc/debian_version = 9.3
(cd /sys/block/sdb/queue/iosched; grep . *)
back_seek_max:16384
back_seek_penalty:2
fifo_expire_async:250
fifo_expire_sync:125
group_idle:8
group_idle_us:8000
low_latency:1
quantum:8
slice_async:40
slice_async_rq:2
slice_async_us:40000
slice_idle:8
slice_idle_us:8000
slice_sync:100
slice_sync_us:100000
target_latency:300
target_latency_us:300000
AFAIK, the only opportunity to solve your problem is using CGroup v2 (kernel v. 4.5 or newer). Please see the following article:
https://andrestc.com/post/cgroups-io/
Also please note, that you may use the systemd's wrappers to configure CGroup limits on per-service basis:
http://0pointer.de/blog/projects/resources.html
Add nocache to that and you're set (you can join it with ionice and nice):
https://github.com/Feh/nocache
On Ubuntu install with:
apt install nocache
It simply omits cache on IO and thanks to that other processes won't starve when the cache is flushed.
It's like calling the commands with O_DIRECT, so now you can limit the IO for example with:
systemd-run --scope -q --nice=19 -p BlockIOAccounting=true -p BlockIOWeight=10 -p "BlockIOWriteBandwidth=/dev/sda 10M" nocache youroperation_here
I usually use it with:
nice -n 19 ionice -c 3 nocache youroperation_here

Searching through really big files

I need to search through a TB of raw hard disk data. I need to find a couple of things inside. I tried using sudo cat /dev/sdc | less but this fails because it puts everything into RAM that is read. I only have 8 GB of RAM and 8 in swap space so putting a whole TB of data into RAM will not work.
I was wondering if I could somehow make less forgot what it has read after the 1GB mark or maybe use another editor.
I accidentally repartitioned my drive and lost some important files. I tried some utilities but none of them worked so I tried this. I got a few of the files but I can't get the rest because the computer freezes and runs out of RAM.
I learned my lesson, I need to make more frequent backups. Any help is greatly appreciated.
The -B option to less is exactly what you ask for. It allows less to be forgetful. Combine with -b1048576 to allocate 1G (the -b unit is K)
Or do it the interactive way: run less normally, scroll down until the point where it starts to get a little laggy, then just type -B at the less prompt to activate the option (did you know you can set less options interactively?)
Just don't try to scroll backward very far or you'll be forgotten-content land, where weird things happen.
(Side note: I've done this kind of recovery before, and it's easier if you can find the filesystem structures (inode blocks etc.) that point to the data, rather than searching for the data in a big dump. Even if some of the inodes are gone, by first recovering everything you can from the surviving inodes you narrow down the range of unknown blocks where the other files might be.)

cp command time discrepancy

Im not sure exactly what category to put this in.
I have tried to do the following with a file that is 7.7GB on my system Centos 5.5
time cp original copy
and
time cp copy copy2
The copy of the copy is about half the time of the copy of the original.
I thought maybe the OS was cacheing or something, so I went to another directory and copied a few small files and stuff, and went back to make the copy of the copy again, and it was still way faster.
Any ideas whats going on here? Is the OS caching the file or something?
What made me notice this problem is that I have some code that processes this file. I wanted to test it on two files, so I just made a copy. I then noticed that the original file takes the longest to process on. What kind of diagnostics can I run on this?
The OS doesn't cache the file so much as it caches the disk blocks it read.
There's a couple of ways to try and account for caching when running timing tests. You could try to flush the OS disk buffers by allocating a huge amount of memory (I usually run something like perl -e '"\0"x1024x1024x1024' to do this); free before and after should give you an idea of how much data the OS has cached (under the buffers and cached columns).
Or when you time your run, ignore the system time - that will be primarily I/O - and just watch the user time. Of course, different runs may be very well dealing with different amounts of data so you would expect there to be different amounts of I/O.
The most reliable way is to run the test several times and use the fastest time as the value to compare.
sync && echo 3 > /proc/sys/vm/drop_caches
time cp original copy
sync && echo 3 > /proc/sys/vm/drop_caches
time cp copy copy2

High %wa CPU load when running PHP as CLI

Sorry for the vague question, but I've just written some php code that executes itself as CLI, and I'm pretty sure it's misbehaving. When I run "top" on the command line it's showing very little resources given to any individual process, but between 40-98% to iowait time (%wa). I usually have about .7% distributed between %us and %sy, with the remaining resources going to idle processes (somewhere between 20-50% usually).
This server is executing MySQL queries in, easily, 300x the time it takes other servers to run the same query, and it even takes what seems like forever to log on via SSH... so despite there being some idle cpu time left over, it seems clear that something very bad is happening. Whatever scripts are running, are updating my MySQL database, but it seems to be exponentially slower then when they started.
I need some ideas to serve as launch points for me to diagnose what's going on.
Some things that I would like to know are:
How I can confirm how many scripts are actually running
Is there anyway to confirm that these scripts are actually shutting down when they are through, and not just "hanging around" taking up CPU time and memory?
What kind of bottlenecks should I be checking to make sure I don't create too many instances of this script so this doesn't happen again.
I realize this is probably a huge question, but I'm more then willing to follow any links provided and read up on this... I just need to know where to start looking.
High iowait means that your disk bandwidth is saturated. This might be just because you're flooding your MySQL server with too many queries, and it's maxing out the disk trying to load the data to execute them.
Alternatively, you might be running low on physical memory, causing large amounts of disk IO for swapping.
To start diagnosing, run vmstat 60 for 5 minutes and check the output - the si and so columns show swap-in and swap-out, and the bi and bo lines show other IO. (Edit your question and paste the output in for more assistance).
High iowait may mean you have a slow/defective disk. Try checking it out with a S.M.A.R.T. disk monitor.
http://www.linuxjournal.com/magazine/monitoring-hard-disks-smart
ps auxww | grep SCRIPTNAME
same.
Why are you running more than one instance of your script to begin with?

Resources