I experimented for checking the LBA of files using IOzone.
I run IOzone with flag -o (sync write) and -i 0 (seq. write) -t 4 (4 thread) -r 32k(32 kilobytes record size).
And I thought that the LBA of 4 files might be split by 32 kilobytes record size and interfere each other. But the result is weird. The files are located with bigger size than I expected. And the size is 2x bigger each chucks. Actually i tried that on ext4 before. And I thought the reason is the multi block allocator of ext4. But it's same on ext3. After that, I checked the request size using blktrace. And the request size was 32KB as I expected.
I can't figure out the reason. I post it with the result of my experiment
Please help me guys.
Thank you for reading my question.
Related
When I look at du -hs for a directory, I get around 956k, but when I do du -b for the same directory I get around 604347 which would be around 590k. Why do I have such a large difference (600k vs 956k) between these two commands. Am I reading it incorrectly ?
The option -b does two things, first it reports the apparent size instead of disk usage; second it reports with the granularity of a single byte.
The main reason for a file to take up more space than its apparent size, or even actual size, is that the filesystem works with fixed-size blocks. So even a file of only 1 byte will take up a whole block.
What you see is the effect of many small files.
In the ext4 filesystem, the default block size is 4096 bytes. In your scenario, a smaller block size would probably be beneficial. On the other hand, if your disk is supposed to store mostly large files, it makes sense to increase the block size. The organization of blocks, after all, also takes up space on the disk. So the value 4096 is a compromise in this regard, but also fits to modern hard drives, which internally work with 4096 byte sectors.
If you man du you going to see that -b is equivalent to --apparent-size --block-size=1 and this means:
print apparent sizes, rather than disk usage; although the apparent
size is usually smaller, it may be larger due to holes in ('sparse')
files, internal fragmentation, indirect blocks, and the like.
So the -b option might show small values that might be smaller than the disk usage size of the referred file/directory although not always true as explained by the manual.
Suppose there is a big file on 1TB SSD (read and write at 500MB/sec) with ext4 file system. This file is close to 1TB in size. How to estimate the speed of fseek() to the middle of the file. Is it going to take seconds or a few milliseconds? Thanks.
To estimate latency of fseek, we should break it into two parts: software work and hardware seek time latency. Software work is implementation of ext4 filesystem (FS, in Linux this is VFS subsystem of kernel), which will generate several "random" requests (I/O operations) to the hardware block storage device. Hardware will use some time to handle each random request.
Classic UNIX filesystems (UFS/FFS) and linux filesystems, designed after them, use superblock to describe fs layout on disk, store files as inodes (there are arrays of inodes in known places), and store file data in blocks of fixed size (up to 4KB in Linux). To find inode from file name OS must read superblock, find every directory in the path, read data from directory to find what inode number the file has (ls -i will show you inodes of current dir). Then, using data from superblock OS may calculate where the inode stored and read the inode.
Inode contains list of file data blocks, usually in tree-like structures, check https://en.wikipedia.org/wiki/Inode_pointer_structure or http://e2fsprogs.sourceforge.net/ext2intro.html
First part of file, several dozens of KB is stored in blocks, listed directly in the inode (direct blocks; 12 in ext2/3/4). For larger files inode has pointer to one block with list of file blocks (indirectly addressed blocks). If file is larger, next pointer in inode is used, to describe "double indirect blocks". It points to block which enumerate other blocks, each of them contains pointers to blocks with actual data. Sometimes triple indirect block pointer is needed. These trees are rather efficient, at every level there is a degree of ~512 (4KB block, 8 bytes per pointer). So, to access the data from middle of the file, ext2/3/4 may generate up to 4-5 low-level I/O requests (superblock is cached in RAM, inode may be cached too). These requests are not consequent in addresses, so they are almost random seeks to the block device.
Modern variants of linux FS (ext4, XFS) have optimization for huge file storage, called extent (https://en.wikipedia.org/wiki/Extent_(file_systems)). Extents allow FS to describe file placement not as block lists, but as array of file fragments / pointer pairs (start_block, number_of_consequent_blocks). Every fragment is probably from part of MB up to 128 MB. 4 first extents are stored inside inode, more extents are stored as tree-like structure again. So, with extents you may need 2-4 random I/O operations to access middle of the file.
HDD had slow access time for random requests, because they should physically move headers to correct circular track (and position headers exactly on the track; this needs some part of rotation, like 1/8 or 1/16 to be done) and then wait for up to 1 rotation (revolution) of disks (platters) to get requested part of track. Typical rotation speeds of HDDs are 5400 and 7200 rpm (revolutions per minute, 90 rps and 120 rps) or for high-speed enterprise HDDs - 10000 rpm and 15000 rpm (160 rps and 250 rps). So, the mean time needed to get data from random position of the disk is around 0.7-1 rotations, and for typical 7200 rpm hdd (120rps) it is around 1/120 seconds = 8 ms (milliseconds) = 0.008 s. 8ms are needed to get data for every random request, and there are up to 4-5 random requests in your situation, so with HDD you may expect times up to near 40 ms to seek in file. (First seek will cost more, next seeks may be cheaper as some part of block pointer tree is cached by OS; seeks to several next blocks are very cheap because linux can read them just after first seek was requested).
SSD has no rotating or moving parts, and any request on SSD is handled in the same way. SSD controller resolves requested block id to internal nand chip+block id using its own translation tables and then reads real data. Read of data from NAND will check error correction codes, and sometimes several internal rereads are needed to correctly read the block. Read is slower in cheaper NAND type - TLC with 3 bits data stored in every cell - 8 levels; faster in MLC - 2 bit of data with 4 levels; very fast in non-existent SLC SSD with 1 bit and only 2 levels. Read is also slower in worn out SSDs or in SSDs with errors (wrong models of cell charge degradation) in their firmware.
Speed of such random accesses in SSD are very high, they are usually declared is SSD specification like 50000 - 100000 IOPS (I/O operations per second, usually 4KB). High IOPS counts may be declared for deeper queue, so real average random read latency of SSD (with QD1) is 200-300 microseconds per request (0.2 - 0.3 ms; in 2014; some part of the latency is slow SATA/SCSI emulation; NVMe SSDs may be faster as they use simpler software stack). With our 4-5 requests we can estimate fseek on SSD as few milliseconds, for example up to 1 - 1.5 ms or sometimes longer.
You can check time needed for fseek using strace -T ./your_fseek_program. It will report time needed to execute every syscall. But to get real latency, you should check not only seek time, but also time of next read syscall. Before each running of this test you may want to flush kernel caches with echo 3 > /proc/sys/vm/drop_caches command from root (https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache).
You may also try some I/O benchmarks, like iozone, iometer or fio to estimate seek latencies.
I am currently trying to split a large (9GB file) into smaller chunks to read into a database one at a time. Unfortunately, I only have a 20GB SSD in the machine (it is a cheap VPS) and so I only have 8GB free, hence splitting the file so I can read and delete. While I did think of scaling the VPS up for a short period of time apparently I cannot do that at this time, so I was stuck looking for other options.
I was wondering, therefore, if it is possible to use the split command to break a file into say 9 parts while incrementally removing the old file so that it could fit, instead of copying it (as split usually does).
I have looked in the manpages and see no reference to this process.
Thanks!
You could use tail -c 1G bigfile >lastchunk to save the last GB from the bigfile into lastchunk, then truncate -s -1G bigfile to remove the last GB from bigfile (and free the disk space). Repeat until you have only handy sized chunks.
Of course, the problem is how easy it is to get wrong. If the truncate removes a different number of bytes compared to the number of bytes read out by tail you will either lose bytes or have duplicates resulting in corrupt data. Using multipliers like G should reduce the possibility of harm. Still, have a backup and do a test run before.
My pc (with 4 GB of RAM) is running several IO bound applications, and I want to avoid as many writes as possible on my SSD.
In /etc/sysctl.conf file I have set:
vm.dirty_background_ratio = 75
vm.dirty_ratio = 90
vm.dirty_expire_centisecs = 360000
vm.swappiness = 0
And in /etc/fstab I added the commit=3600 parameter.
According to free command, my pc usually stays on with 1 GB of RAM used by applications and about 2500 of available ram. So with my settings I should be able to write at least about 1500-2000 MB of data without writing actually on the disk.
I have done some tests with moderate writes (300MB - 1000MB) and with free and cat /proc/meminfo | grep Dirty commands I noticed that often a few time later these writes (far less that dirty_expire_centisecs time), the dirty bytes go down to a value next to 0.
I suspect that the subsequent read operations fill the cache until the machine is near a OOM condition and is forced to flush dirty writes ignoring my sysctl.conf settings (correct me if my hypothesis is wrong).
So the question is: is it possible disabling only read caching (AFAIK not possible), or at least change pagecache replace policy, giving more priority to write cache, so that read cache can not force a writes flushing (maybe tweaking kernel source code...)? I know that I can solve easily this problem using tmpfs or union-fs like AUFS or OverlayFS, but for many reason I would like to avoid them.
Sorry for my bad english, I hope you understand my question. Thank you.
I have a user-space application that generates big SCSI writes (details below). However, when I'm looking at the SCSI commands that reach the SCSI target (i.e. the storage, connected by the FC) something is splitting these writes into 512K chunks.
The application basically does 1M-sized direct writes directly into the device:
fd = open("/dev/sdab", ..|O_DIRECT);
write(fd, ..., 1024 * 1024);
This code causes two SCSI WRITEs to be sent, 512K each.
However, if I issue a direct SCSI command, without the block layer, the write is not split.
I issue the following command from the command line:
sg_dd bs=1M count=1 blk_sgio=1 if=/dev/urandom of=/dev/sdab oflag=direct
I can see one single 1M-sized SCSI WRITE.
The question is, what is splitting the write and, more importantly, is it configurable?
Linux block layer seems to be guilty (because SG_IO doesn't pass through it) and 512K seems too arbitrary a number not to be some sort of a configurable parameter.
As described in an answer to the "Why is the size of my IO requests being limited, to about 512K" Unix & Linux Stack Exchange question and the "Device limitations" section of the "When 2MB turns into 512KB" document by kernel block layer maintainer Jens Axboe, this can be because your device and kernel have size restrictions (visible in /sys/block/<disk>/queue/):
max_hw_sectors_kb maximum size of a single I/O the hardware can accept
max_sectors_kb the maximum size the block layer will send
max_segment_size and max_segments the DMA engine limitations for scatter gather (SG) I/O (maximum size of each segment and the maximum number of segments for a single I/O)
The segment restrictions matter a lot when the buffer the I/O is coming from is not contiguous and in the worst case each segment can be as small as page (which is 4096 bytes on x86 platforms). This means SG I/O for one I/O can be limited to a size of 4096 * max_segments.
The question is, what is splitting the write
As you guessed the Linux block layer.
and, more importantly, is it configurable?
You can fiddle with max_sectors_kb but the rest is fixed and come from device/driver restrictions (so I'm going to guess in your case probably not but you might see bigger I/O directly after a reboot due to less memory fragmentation).
512K seems too arbitrary a number not to be some sort of a configurable parameter
The value is likely related to fragment SG buffers. Let's assume you're on an x86 platform and have a max_segments of 128 so:
4096 * 128 / 1024 = 512
and that's where 512K could come from.
Bonus chatter: according to https://twitter.com/axboe/status/1207509190907846657 , if your device uses an IOMMU rather than a DMA engine then you shouldn't be segment limited...
The blame is indeed on the block layer, the SCSI layer itself has little regard to the size. You should check though that the underlying layers are indeed able to pass your request, especially with regard to direct io since that may be split into many small pages and requires a scatter-gather list that is longer than what can be supported by the hardware or even just the drivers (libata is/was somewhat limited).
You should look and tune /sys/class/block/$DEV/queue there are assorted files there and the most likely to match what you need is max_sectors_kb but you can just try it out and see what works for you. You may also need to tune the partitions variables as well.
There's a max sectors per request attribute of the block driver. I'd have to check how to modify it. You used to be able to get this value via blockdev --getmaxsect but I'm not seeing the --getmaxsect option on my machine's blockdev.
Looking at the following files should tell you if the logical block size is different, possibly 512 in your case. I am not however sure if you can write to these files to change those values. (the logical block size that is)
/sys/block/<disk>/queue/physical_block_size
/sys/block/<disk>/queue/logical_block_size
try ioctl(fd, BLKSECTSET, &blocks)