I purchased a virtual server that had 8 vCPUs, 16G memory, and a 500G ssd volume (which is backed by ceph rbd). Then I used fio to test the server's IO performance. To better understanding the fio results, during the test, I also used blktrace to capture the block layer IO trace.
seqwriete
fio --filename=/dev/vdc --ioengine=libaio --bs=4k --rw=write --size=8G --iodepth=64 --numjobs=8 --direct=1 --runtime=960 --name=seqwrite --group_reporting
fio output for seqwrite
parsed blktrace output for seqwrite
randread
fio --filename=/dev/vdc --ioengine=libaio --bs=4k --rw=randread --size=8G --iodepth=64 --numjobs=8 --direct=1 --runtime=960 --name=randread --group_reporting
fio output for randread
parsed blktrace output for randread
What I am trying to understand is the difference at block layer between seqwrite and randread.
why does randread have large portion of I2D but seqwrite does not?
why doesn't randread have Q2M?
(Note this isn't really a programming question so Stackoverflow is the wrong place to ask this... Maybe Super User or Serverfault would be a better choice?)
why does randread have large portion of I2D but seqwrite does not?
Did you realise each of your 8 numjobs is overwriting the same area as the other numjobs? This means the block layer may be able to throw subsequent requests away if an overwrite for the same region comes in close enough (which is somewhat likely in the sequential case)...
why doesn't randread have Q2M?
It's hard to back merge random I/O with existing queued I/O as it's often discontiguous!
Related
I want to run in configuration where writes are at most 64KB in size.
(This isn't the right site for this type of question because it's not about programming - Super User or Serverfault look more appropriate)
I want to run in configuration where writes are at most 64KB in size.
Isn't that what the blocksize parameter does? For example:
fio --filename=/tmp/fio.tmp --size=128k --bs=64k --rw=write --name=example
Creates a file 128KiB big but then writes two 64KiB I/Os to it (but be aware the filesystem/block layer may modify requests before they are submitted to the disk).
If you meant "I'd like to write a range of different sized I/Os with 64KiB being the maximum" then take a look at the blocksize_range and/or bssplit parameters.
I am doing FIO testing of /dev/pmem for sequential read with command using:
fio --name=readf --filename=/dev/pmem --iodepth=4 --ioengine=libaio --direct=1 --buffered=0 --groupreprting --timebased --bs=64k --size=10g --rw=read --norandommap --refillbuffers=1 --randrepeat=0 --runtime=300
The question is vague and can be read as what does your maximum disk bandwidth depend on?
Speed of the underlying device.
State of the underlying device.
Busyness of the system.
Amount of I/O that can be sent down in tandem.
How I/O is batched together.
Block size chosen (disks generally have an optimal size).
Whether the I/O is sequential or random.
Whether there is other I/O happening to the same disk (e.g. SMART updates).
Whether cache on the device is exhausted.
Whether the device has to do maintenance.
Size of caches.
Amount of I/O put down.
Compressibility of data.
Configuration parameters of the OS.
Configuration parameters of the hardware.
Size of the region I/O is done within.
In the job given things that stand out are:
The iodepth looks a bit low try boosting it until you don't get a benefit.
setting norandommap is meaningless for a sequential job
Setting both direct=1 and buffered=0 is redundant
groupreprting and timebased are spelt incorrectly
refillbuffers is also spelt wrong and you might get away with scramble_buffers with lower overhead but at the risk of less random data
You might get some benefit from pinning fio to appropriate CPUs
You might get some benefit submitting I/O in batches
I was testing with the number of jobs and was expecting to get the total I/O throughput for each different number of jobs
The job numbers should have positive correlation with the total I/O throughput
The test I have conducted in the SSD workstation is here below
The result didnt make any sense because I/O throughput of 1 job is bigger than multiple number of jobs
FIO Test Result using SSD
However when I test it in my macbook with the virtual box(with SSD configured),the result is different
FIO Test Result using Virtual Box
This is the FIO parameter that I have used in the test
filename=/dev/sdd
bs=4k
numjobs=1 ~ 64
iodepth=32
direct=1
ioengine=libaio
rw=read
runtime=20
group_reporting=1
Is there something that I have done wrong?
I believe I have used the parameters wrong on this case.
I figured out the answer, the reason was the proper way of using the FIO
if we use the FIO asynchronously then the number of depth should increase and if the FIO is set to synchronous the number of jobs should increase to increase the throughput.
The answer over on fio -numjobs bigger, the iops will be smaller, the reason is? might have some information applicable here. However as you're using an SSD I'd note your iodepth (32) kinda matches the typical maximum number of SATA commands you can have outstanding. This would mean that you're in the case where your first job already generated the maximum throughput and you're deep into diminishing returns - adding more simultaneous jobs just leads to more queuing, which means more latency. All that extra work that isn't generating a benefit but is chewing up resources which means you aren't submitting the I/O that could be processed as fast and thus performance is past the peek.
I was trying to get performance numbers (simple 4K random read) using fio tool with ioengine as libaio.
I observe that if direct io is disabled (direct=0), then iops fell drastically. when direct=1 was provided the iops were 50 times better!
setup: fio being run from a linux client connected to a PCIe based
appliance over Fibre Channel.
Here is snipped from my fio config file:
[global]
filename=/dev/dm-30
size=10G
runtime=300
time_based
group_reporting
[test]
rw=randread
bs=4k
iodepth=16
runtime=300
ioengine=libaio
refill_buffers
ioscheduler=noop
#direct=1
With this setup, I observed the iops to be around 8000 and when I enabled direct=1 in this above shown config file, I see that iops jump to 250K! (which is realistic in case of the setup I am using)
So, my question is if we use libaio engine, using buffered i/o has any issues? is it mandatory that if we use libaio, we should stick to direct io?
Per the docs on Kernel Asynchronous I/O (AIO) Support for Linux:
What Does Not Work?
AIO read and write on files opened without O_DIRECT (i.e. normal buffered filesystem AIO). On ext2, ext3, jfs, xfs and nfs, these do not return an explicit error, but quietly default to synchronous or rather non-AIO behaviour (i.e io_submit waits for I/O to complete in these cases). For most other filesystems, -EINVAL is reported.
In short, if you don't use O_DIRECT, AIO still "works" for many of the most common file systems, but becomes a slow form on synchronous I/O (you may as well have just used read/write and saved yourself a few system calls). The massive performance increase is the result of actually benefiting from asynchronous behavior.
So to answer the question in your title: Yes, libaio should only be used with unbuffered/O_DIRECT file descriptors if you expect to derive any benefit from it.
My application is using O_DIRECT for flushing 2MB worth of data directly to a 3-way-stripe storage (mounted as an lvm volume)..
I am getting a very pathetic write speed on this storage. The iostat shows that the large request size is being broken into smaller ones.
avgrq-sz is <20... There aren't much read on that drive.
It takes around 2 seconds to flush down 2MB worth of contiguous memory blocks (using mlock to assure that), sector aligned (using posix_memalign), whereas tests with dd and iozone rate the storage capable of > 20Mbps of write speed.
I would appreciate any clues on how to investigate this issue further.
PS: If this is not the right forum for this query, I would appreciate indicators to a one that could be helpful.
Thanks.
Write IO breakups on linux?
The disk itself may have a maximum request size, there is a tradeoff being block size and latency (the bigger the request being sent to the disk the longer it will likely take to to be consumed) and there can be constraints on how much vectored I/O a driver can consume in a single request. Given all the above, the kernel is going to "break up" single requests that are too large when submitting further down the stack.
I would appreciate any clues on how to investigate this issue further.
Unfortunately it's hard to say why the avgrq-sz is so small (if its in sectors that about 10KBytes per I/O) without seeing the code actually submitting the I/O (maybe your program is submitting 10KByte buffers?). We also don't know if iozone and dd were using O_DIRECT during the questioners test. If they weren't then their I/O would have been going into the write back cache and then streamed out later and the kernel can do that in a more optimal fashion.
Note: Using O_DIRECT is NOT a go faster stripe. In the right circumstances O_DIRECT can lower overhead BUT writing O_DIRECTly to do disk increases the pressure on you to submit I/O in parallel (e.g. via AIO/io_uring or via multiple processes/threads) if you want to reach the highest possible throughput because you have robbed the kernel of its best way of creating parallel submission to the device for you.