How to purge disk I/O caches on Linux? - linux

I need to do it for more predictable benchmarking.

Sounds like you want the sync command, or the sync() function.
If you want disk cache flushing: echo 3 | sudo tee /proc/sys/vm/drop_caches

You can do it like this:
# sync # (move data, modified through FS -> HDD cache) + flush HDD cache
# echo 3 > /proc/sys/vm/drop_caches # (slab + pagecache) -> HDD (https://www.kernel.org/doc/Documentation/sysctl/vm.txt)
# blockdev --flushbufs /dev/sda
# hdparm -F /dev/sda
# NEXT COMMAND IS NOT FOR BENCHMARKING:
# should be run before unplug, flushes everything possible guaranteed.
# echo 1 > /sys/block/sdX/device/delete
You may use strace to see that these are three different syscalls
Also, it may be desirable to turn off HDD cache using hdparm, not sure what thing you benchmarking.
In any way, you cannot prevent HDD to cache last 64/32/16 MB of recently used data. In order to kill that cache, just write some amount of zeroes (and flush) + read some unrelated place from HDD. This is required since cache may be divided to read-part and write-part. After that you can benchmark HDD.

Disk cache purging: echo 3 | sudo tee /proc/sys/vm/drop_caches
Command documentation: https://www.kernel.org/doc/Documentation/sysctl/vm.txt
Writing to this will cause the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free.
To free pagecache:
echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:
echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:
echo 3 > /proc/sys/vm/drop_caches
As this is a non-destructive operation, and dirty objects are not freeable, the user should run "sync" first in order to make sure all cached objects are freed.

Short good enough answer: (copy paste friendly)
DISK=/dev/sdX # <===ADJUST THIS===
sync
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DISK
hdparm -F $DISK
Explanation:
sync: From the man page: flush file system buffers. Force changed blocks to disk, update the super block.
echo 3 > /proc/sys/vm/drop_cache: from the kernel docs this will cause the kernel to drop clean caches
blockdev --flushbufs /dev/sda: from the man page: call block device ioctls [to] flush buffers.
hdparm -F /dev/sda: from the man page: Flush the on-drive write cache buffer (older drives may not implement this)
Although the blockdev and hdparm commands look similar according to an answer above they issue different ioctls to the device.
Long probably better way:
(I'll assume that you have formatted the disk but you can adapt these commands if you want to write directly to the disk)
Run this only once before the 1st benchmark:
MOUNT=/mnt/test # <===ADJUST THIS===
# create a file with psuedo-random data. We will read it
# to fill the read cache of the HDD with garbage
dd if=/dev/urandom of=$MOUNT/temp-hddread.tmp bs=64M count=16
Run this every time you want to empty the caches:
DISK=/dev/sdX # <===ADJUST THIS===
MOUNT=/mnt/test # <===AND THIS===
# create a file with psuedo-random data to fill the write cache
# of the disk with garbage. Delete it afterwards it's not useful anymore
dd if=/dev/urandom of=$MOUNT/temp-hddwrite.tmp bs=64M count=16
rm $MOUNT/temp-hddwrite.tmp
# see short good enough answer above
sync
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DISK
hdparm -F $DISK
# read the file with pseudo-random data to fill any read-cache
# the disk may have with garbage
dd if=$MOUNT/temp-hddread.tmp of=/dev/null
Run this when you're done.
MOUNT=/mnt/test # <===ADJUST THIS===
# delete the temporary file with pseudo-random data
rm $MOUNT/temp-hddread.tmp
Explanation:
The disk will probably have some H/W cache. Some disks by design or due to bugs may not clear their caches when you issue the blockdev and hdparm commands. To compensate we write and read pseudo-random data hopping to fill these caches so that any cached data are removed from them. How much data you need to fill the cache depends on its size. In the commands above I'm using dd to read/write 16*64MB=1024MB, adjust the arguments if your HDD may have bigger cache (data sheets and experimentation are your friend and it doesn't hurt to specify values above the actual size of the cache). I'm using /dev/urandom as a source for random data because it's fast and we don't care about true randomness (we only care for high entropy because the disk firmware may be using compression before storing data to the cache). I'm creating /mnt/test/temp-hddread.tmp from the start and use it every time I want to read enough random data. I'm creating and deleting /mnt/test/temp-hddwrite.tmp each time I want to write enough random data.
Credits
I've wrote this answer based on the best parts of the existing answers.

Unmounting and re-mounting the disk under test will reset all caches and buffers.

Related

how to get Linux to automatically remove old pages from pagecache?

Ever since upgrading to Linux kernel 5.8 and later, I've been having problems with my system freezing up from running out of RAM, and it's all going to the pagecache.
I have a program that reorganizes the data from the OpenStreetMap planet_latest.osm.pbf file into a structure that's more efficient for my usage. However, because this file is larger than the amount of RAM on my system (60GB file versus 48GB RAM), the page cache fills up. Before kernel 5.8, the cache would reach full, and then keep chugging along (with an increase in disk thrashing). Since 5.8, the system freezes because it won't ever automatically release a page from the page cache (such as 30GB earlier in my sequential read of the planet_latest.osm.pbf file). I don't need to use my reorganizing program to hang the system; I found the following unprivileged command would do it:
cat planet_latest.osm.pbf >/dev/null
I have tried using the fadvise64() system service to manually force releases of pages in the planet file I have already passed; it helps, but doesn't entirely solve the problem with the various output files my program creates (especially when those temporary output files are randomly read back later).
So, what does it take to get the 5.8 through 5.10 Linux kernel to actually automatically release old pages from the page cache when system RAM gets low?
To work around the problem, I have been using a script to monitor cache size and write to /proc/sys/vm/drop_caches when the cache gets too large, but of course that also releases new pages I am currently using along with obsolete pages.
while true ; do
H=`free | head -2 | tail -1 | awk '{print $6}'`
if [ $H -gt 35000000 ]; then
echo -n $H " # " ; date
echo 1 >/proc/sys/vm/drop_caches
sensors | grep '°C'
H=`free | head -2 | tail -1 | awk '{print $6}'`
echo -n $H " # "; date
fi
sleep 30
done
(the sensors stuff is to watch out for CPU overheating in stages of my program that are multi-threaded CPU-intensive rather than disk-intensive).
I have filed a bug report at kernel.org, but they haven't looked at it yet.

how to get rid of kswapd0 process running in linux

Frequently facing the issue of the kswapd0 running in one of the linux machines, what could be the reason for that, by looking more at the issue, understood that it will be because of the less memory, I tried the below options to avoid it:
echo 1 > /proc/sys/vm/drop_caches
cat /proc/sys/vm/drop_caches
sudo cat /proc/sys/vm/swappiness
sudo sysctl vm.swappiness=60
but it does not yield fruitful results, what could be the best method to avoid it, or its something some action needs to be taken on the RAM memory of the machine, Any suggestions on this ?
Every time we observe , all the running apps are killed automatically and kswapd0 occupies the complete cpu and memory.

Backup for a linux system via osx

I have an odroid (raspberry-like) machine with an arch linux system installed. Now I want to move the system from one microsd (A) to another microsd (B). When I tried this, the system became corrupted, information about files attributes were lost:
Copy files from A to osx-host cp -R /Volume/microsd_a/* ~/Desktop/backup
Copy files from osx-host to B cp -R ~/Desktop/backup/* /Volume/microsd_b
Is it real to copy linux-system using osx-host with preserving integrity?
Update:
dd. I tried this way, but there is a problem. My sd cards have different sizes, 64 Gb and 16 Gb, but system installed on 64 Gb disk has no more than 8 Gb. So when I launched the copying process, output image file exceed 16 Gb and I killed the process. By the way, the MBR contains information about partition table which should be different (one partition 64Gb / one partition 16 gb). And notice, I do not need to copy bootloader from MBR, I have an ability to flash disk bootloader by other ways.
cp. What I wanted to listen as the answer is the list of flags I need to make this operation. Reading man cp didn't help me. cp -a does not copy all files because of Cannot allocate memory error. Tried cp -aX, no attributes were restored after copying data to second sdcard.
tar. I tried multiple times with flags, last one was tar -cvpf; tar --same-owner -xpf. But file attributes were still corrupted.
Again:
- Are you sure, it is possible to preserve file attributes through copying ext4 -> APFS -> ext4?
- If this is possisble, how does it work and which command with which flags should I use?
cp -R results in change of permissions, time stamps and missing of hidden files, you can't use that command to create a disk image.
what you need is a disk copy/clone. The command to use is dd.
Check out this webpage:
https://pbxbook.com/other/dd_clone.html

How to measure IOPS for a command in linux?

I'm working on a simulation model, where I want to determine when the storage IOPS capacity becomes a bottleneck (e.g. and HDD has ~150 IOPS, while an SSD can have 150,000). So I'm trying to come up with a way to benchmark IOPS in a command (git) for some of it's different operations (push, pull, merge, clone).
So far, I have found tools like iostat, however, I am not sure how to limit the report to what a single command does.
The best idea I can come up with is to determine my HDD IOPS capacity, use time on the actual command, see how long it lasts, multiply that by IOPS and those are my IOPS:
HDD ->150 IOPS
time df -h
real 0m0.032s
150 * .032 = 4.8 IOPS
But, this is of course very stupid, because the duration of the execution may have been related to CPU usage rather than HDD usage, so unless usage of HDD was 100% for that time, it makes no sense to measure things like that.
So, how can I measure the IOPS for a command?
There are multiple time(1) commands on a typical Linux system; the default is a bash(1) builtin which is somewhat basic. There is also /usr/bin/time which you can run by either calling it exactly like that, or telling bash(1) to not use aliases and builtins by prefixing it with a backslash thus: \time. Debian has it in the "time" package which is installed by default, Ubuntu is likely identical, and other distributions will be quite similar.
Invoking it in a similar fashion to the shell builtin is already more verbose and informative, albeit perhaps more opaque unless you're already familiar with what the numbers really mean:
$ \time df
[output elided]
0.00user 0.00system 0:00.01elapsed 66%CPU (0avgtext+0avgdata 864maxresident)k
0inputs+0outputs (0major+261minor)pagefaults 0swaps
However, I'd like to draw your attention to the man page which lists the -f option to customise the output format, and in particular the %w format which counts the number of times the process gave up its CPU timeslice for I/O:
$ \time -f 'ios=%w' du Maildir >/dev/null
ios=184
$ \time -f 'ios=%w' du Maildir >/dev/null
ios=1
Note that the first run stopped for I/O 184 times, but the second run stopped just once. The first figure is credible, as there are 124 directories in my ~/Maildir: the reading of the directory and the inode gives roughly two IOPS per directory, less a bit because some inodes were likely next to each other and read in one operation, plus some extra again for mapping in the du(1) binary, shared libraries, and so on.
The second figure is of course lower due to Linux's disk cache. So the final piece is to flush the cache. sync(1) is a familiar command which flushes dirty writes to disk, but doesn't flush the read cache. You can flush that one by writing 3 to /proc/sys/vm/drop_caches. (Other values are also occasionally useful, but you want 3 here.) As a non-root user, the simplest way to do this is:
echo 3 | sudo tee /proc/sys/vm/drop_caches
Combining that with /usr/bin/time should allow you to build the scripts you need to benchmark the commands you're interested in.
As a minor aside, tee(1) is used because this won't work:
sudo echo 3 >/proc/sys/vm/drop_caches
The reason? Although the echo(1) runs as root, the redirection is as your normal user account, which doesn't have write permissions to drop_caches. tee(1) effectively does the redirection as root.
The iotop command collects I/O usage information about processes on Linux. By default, it is an interactive command but you can run it in batch mode with -b / --batch. Also, you can a list of processes with -p / --pid. Thus, you can monitor the activity of a git command with:
$ sudo iotop -p $(pidof git) -b
You can change the delay with -d / --delay.
You can use pidstat:
pidstat -d 2
More specifically pidstat -d 2 | grep COMMAND or pidstat -C COMMANDNAME -d 2
The pidstat command is used for monitoring individual tasks currently being managed by the Linux kernel. It writes to standard output activities for every task selected with option -p or for every task managed by the Linux kernel if option -p ALL has been used. Not selecting any tasks is equivalent to specifying -p ALL but only active tasks (tasks with non-zero statistics values) will appear in the report.
The pidstat command can also be used for monitoring the child processes of selected tasks.
-C commDisplay only tasks whose command name includes the stringcomm. This string can be a regular expression.

Bash script doesn't wait until commands have been properly executed

I am working on a very simple script but for some reason parts of it seem to run asynchronously.
singlePartDevice() {
# http://www.linuxquestions.org/questions/linux-software-2/removing-all-partition-from-disk-690256/
# http://serverfault.com/questions/257356/mdadm-on-ubuntu-10-04-raid5-of-4-disks-one-disk-missing-after-reboot
# Create single partition
parted -s "$1" mklabel msdos
# Find size of disk
v_disk=$(parted -s "$1" print|awk '/^Disk/ {print $3}'|sed 's/[Mm][Bb]//')
parted -s "$1" mkpart primary ext3 4096 ${v_disk}
parted -s "$1" set 1 raid on
return 0
}
singlePartDevice "/dev/sdc"
singlePartDevice "/dev/sdd"
#/dev/sdc1 exists but /dev/sdd1 does not exist
sleep 5s
#/dev/sdc1 exists AND /dev/sdd1 does also exist
As you see before the call of sleep the script has only partially finished its job. How do I make my script to wait until parted has done its job sucessfully?
(I am assuming that you are working on Linux due to the links in your question)
I am not very familiar with parted, but I believe that the partition device nodes are not created directly by it - they are created by udev, which is by nature an asynchronous procedure:
parted creates a partition
the kernel updates its internal state
the kernel notifies the udev daemon (udevd)
udevd checks its rule files (usually under /etc/udev/) and creates the appropriate device nodes
This procedure allows for clear separation of the device node handling policy from the kernel, which is a Good Thing (TM). Unfortunately, it also introduces relatively unpredictable delays.
A possible way to handle this is to have your script wait for the device nodes to appear:
while [ ! -e "/dev/sdd1" ]; do sleep 1; done
Assuming all you want to do is ensure that the partitions are created before proceeding, there are a couple of different approaches
Check whether process parted has completed before moving to the next step
Check if the devices are ready before moving to the next step (you will need to check the syntax). Eg
until [ -f /dev/sdc && -f /dev/sdd ]
sleep 5

Resources