How to estimate the seek speed in file system - linux

Suppose there is a big file on 1TB SSD (read and write at 500MB/sec) with ext4 file system. This file is close to 1TB in size. How to estimate the speed of fseek() to the middle of the file. Is it going to take seconds or a few milliseconds? Thanks.

To estimate latency of fseek, we should break it into two parts: software work and hardware seek time latency. Software work is implementation of ext4 filesystem (FS, in Linux this is VFS subsystem of kernel), which will generate several "random" requests (I/O operations) to the hardware block storage device. Hardware will use some time to handle each random request.
Classic UNIX filesystems (UFS/FFS) and linux filesystems, designed after them, use superblock to describe fs layout on disk, store files as inodes (there are arrays of inodes in known places), and store file data in blocks of fixed size (up to 4KB in Linux). To find inode from file name OS must read superblock, find every directory in the path, read data from directory to find what inode number the file has (ls -i will show you inodes of current dir). Then, using data from superblock OS may calculate where the inode stored and read the inode.
Inode contains list of file data blocks, usually in tree-like structures, check https://en.wikipedia.org/wiki/Inode_pointer_structure or http://e2fsprogs.sourceforge.net/ext2intro.html
First part of file, several dozens of KB is stored in blocks, listed directly in the inode (direct blocks; 12 in ext2/3/4). For larger files inode has pointer to one block with list of file blocks (indirectly addressed blocks). If file is larger, next pointer in inode is used, to describe "double indirect blocks". It points to block which enumerate other blocks, each of them contains pointers to blocks with actual data. Sometimes triple indirect block pointer is needed. These trees are rather efficient, at every level there is a degree of ~512 (4KB block, 8 bytes per pointer). So, to access the data from middle of the file, ext2/3/4 may generate up to 4-5 low-level I/O requests (superblock is cached in RAM, inode may be cached too). These requests are not consequent in addresses, so they are almost random seeks to the block device.
Modern variants of linux FS (ext4, XFS) have optimization for huge file storage, called extent (https://en.wikipedia.org/wiki/Extent_(file_systems)). Extents allow FS to describe file placement not as block lists, but as array of file fragments / pointer pairs (start_block, number_of_consequent_blocks). Every fragment is probably from part of MB up to 128 MB. 4 first extents are stored inside inode, more extents are stored as tree-like structure again. So, with extents you may need 2-4 random I/O operations to access middle of the file.
HDD had slow access time for random requests, because they should physically move headers to correct circular track (and position headers exactly on the track; this needs some part of rotation, like 1/8 or 1/16 to be done) and then wait for up to 1 rotation (revolution) of disks (platters) to get requested part of track. Typical rotation speeds of HDDs are 5400 and 7200 rpm (revolutions per minute, 90 rps and 120 rps) or for high-speed enterprise HDDs - 10000 rpm and 15000 rpm (160 rps and 250 rps). So, the mean time needed to get data from random position of the disk is around 0.7-1 rotations, and for typical 7200 rpm hdd (120rps) it is around 1/120 seconds = 8 ms (milliseconds) = 0.008 s. 8ms are needed to get data for every random request, and there are up to 4-5 random requests in your situation, so with HDD you may expect times up to near 40 ms to seek in file. (First seek will cost more, next seeks may be cheaper as some part of block pointer tree is cached by OS; seeks to several next blocks are very cheap because linux can read them just after first seek was requested).
SSD has no rotating or moving parts, and any request on SSD is handled in the same way. SSD controller resolves requested block id to internal nand chip+block id using its own translation tables and then reads real data. Read of data from NAND will check error correction codes, and sometimes several internal rereads are needed to correctly read the block. Read is slower in cheaper NAND type - TLC with 3 bits data stored in every cell - 8 levels; faster in MLC - 2 bit of data with 4 levels; very fast in non-existent SLC SSD with 1 bit and only 2 levels. Read is also slower in worn out SSDs or in SSDs with errors (wrong models of cell charge degradation) in their firmware.
Speed of such random accesses in SSD are very high, they are usually declared is SSD specification like 50000 - 100000 IOPS (I/O operations per second, usually 4KB). High IOPS counts may be declared for deeper queue, so real average random read latency of SSD (with QD1) is 200-300 microseconds per request (0.2 - 0.3 ms; in 2014; some part of the latency is slow SATA/SCSI emulation; NVMe SSDs may be faster as they use simpler software stack). With our 4-5 requests we can estimate fseek on SSD as few milliseconds, for example up to 1 - 1.5 ms or sometimes longer.
You can check time needed for fseek using strace -T ./your_fseek_program. It will report time needed to execute every syscall. But to get real latency, you should check not only seek time, but also time of next read syscall. Before each running of this test you may want to flush kernel caches with echo 3 > /proc/sys/vm/drop_caches command from root (https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache).
You may also try some I/O benchmarks, like iozone, iometer or fio to estimate seek latencies.

Related

Parallel read file with dask

I have a question about delayed decoration, it may seem similar to the following question" Dask: How would I parallelize my code with dask delayed? "
but even there it is not answered. I have the following code :
#dask.delayed
def remove_unnessasey_data(temp,l1):
do some work
return temp
#dask.delayed
def change_structure(temp):
do some work
return temp1
#dask.delayed
def read_one(filename):
return pd.read_csv(filename)
and then:
def f(filenames):
results = []
for filename in filenames:
results.append(change_structure( remove_unnessasey_data(
read_one(filename),l1)))
return results
result = dask.compute(*result)
according to this it should increase the speed, but the speed is the same if I read in chunks from the big file can anybody explain why??
I am aware of GIL but according to the documentation it should enhance the speed
according to this it should increase the speed
Bollocks. That documentation, for lack of a better word, is wrong in general.
Saying that doing IO in parallel will increase performance in general displays a significant misunderstanding of how most filesystems and disk storage systems work.
Why?
Seek time.
Generally, filesystems store files in as contiguous chunks as possible. To read position X in the file, the disk heads first have to be positioned over the track that holds the sector X is in. That takes time. Then the system has to wait until that sector rotates under the disk heads. That again takes time.
It should be obvious why reading a file sequentially from a spinning disk is faster - to read sector N, the disk heads have to first seek to the track that contains sector N. But because files are stored as contiguously as possible, the track that contains sector N also likely contains sector N+1, N+2, N+3, and quite a bit more. Toss in the read ahead caching that both the disk (disks are not usually dumb devices - they're pretty much full-fledged IO computers that have built-in cache systems) and the filesystem do, and sequential reading of a file from a spinning disk tends to minimize the time spent looking for data.
Now try reading in parallel.
Thread A reads sector X. Disk seeks to track, waits for sector X to pass under the heads. While that's happening, thread B tries to read sector Y. Disk finally gets to read sector X, but has a pending command to read sector Y. Now disk has to seek heads to the proper track, perhaps abandoning the readahead it would have done to get sector X+1 for thread A's next read, wait for the heads to move, then wait for sector Y to pass under the heads to read.
Meanwhile, thread C issues a request to read sector Z...
And the disk heads dance all over the disk. Then wait for the proper sector to pass under the heads.
A typical consumer-grade 5,400 RPM SATA disk that nominally supports IO rates of 100 MB/sec can be reduced to a few KILOBYTES per second through such IO patterns.
Reading or writing data in parallel almost never increases speed, especially if you're using standard filesystems on spinning disks.
You can get better performance using SSD(s) if a single thread's IO doesn't saturate the storage system - not just the disk, but entire path from CPU to/from disk. Many, many motherboards have cheap, slow disk controllers and/or lack IO bandwidth. How many people completely ignore the disk controller or the IO bandwidth of the motherboard when buying a computer?
There are filesystems that do support parallel IO for improved performance. They tend to be proprietary, expensive, and FAST. IBM's Spectrum Scale (originally GPFS) and Oracle's HSM (originally SAMFS/QFS) are two examples.

how to choose chunk size when reading a large file?

I know that reading a file with chunk size that is multiple of filesystem block size is better.
1) Why is that the case? I mean lets say block size is 8kb and I read 9kb. This means that it has to go and get 12kb and then get rid of the other extra 3kb.
Yes it did go and do some extra work but does that make much of a difference unless your block size is really huge?
I mean yes if I am reading 1tb file than this definitely makes a difference.
The other reason I can think of is that the block size refers to a group of sectors on hard disk (please correct me). So it could be pointing to 8 or 16 or 32 or just one sector. so your hard disk would have to do more work essentially if the block points to a lot more sectors? am I right?
2) So lets say block size is 8kb. Do I now read 16kb at a time? 1mb? 1gb? what should I use as a chunk size?
I know available memory is a limitation but apart from that what other factors affect my choice?
Thanks a lot in advance for all the answers.
Theorically, the fastest I/O could occur when the buffer is
page-aligned, and when its size is a multiple of the system block
size.
If the file was stored continuously on the hard disk, the fastest I/O
throughput would be attained by reading cylinder by cylinder. (There
could even not be any latency then, since when you read a whole track
you don't need to start from the start, you can start in the middle,
and loop over). Unfortunately nowadays it would be near impossible to
do that, since the hard disk firmware hides the physical layout of the
sectors, and may use replacement sectors needing even seeks while
reading a single track. The OS file system may also try to spread the
file blocks all over the disk (or at least, all over a cylinder
group), to avoid having to do long seeks over big files when
acccessing small files.
So instead of considering physical tracks, you may try to take into
account the hard disk buffer size. Most hard disks have buffer size of
8 MB, some 16 MB. So reading the file by chunks of up to 1 MB or 2 MB
should let the hard disk firmware optimize the throughput without
stalling it's buffer.
But then, if there are a lot of layers above, eg, a RAID, all bets are
off.
Really, the best you can do is to benchmark your particular
circumstances.

Consistency guarantee of file system regarding sequential write

My program (only 1 process and 1 thread) sequentially write n consecutive chunks of data to a file on a HDD (regular kind of HDD) using plain old write system call. It's like some kind of append-only log file.
After a system crash (power failure, not HDD failure), I read back and verified that chunks[i] (0 < i < n) had been entirely written down to disk (by checking length). May be the content of the chunk is not checksum correct but still the whole chunks[i] stably sit on the surface of the magnetic disk.
Is it safe for me to assume all other chunks before chunks[i] are entirely written down too? Or there exists a (or many) chunks[j] (0 < j < i) that is partly (or isn't at all) written down to disk? I know that random writes could be reordered to improve disk throughput but could sequential writes be reordered too?
Yes, writes that appear sequential (to you) can be reordered before being written to disk, primarily because the order seen by your code (or even the OS) may not correspond directly to locations on the disk.
Although IDE disks did (at one time) use addressing based on specifying a track, head and sector that would hold a piece of data, they've long since converted to a system where you just have some number of sectors, and it's up to the disk to arrange those in an order that makes sense. It usually does a pretty good job, but in some cases (especially if a sector has gone bad and been replaced by a spare sector) it may make the most sense to write sectors out of order.

writeback of dirty pages in linux

I have a question regarding the writeback of the dirty pages. If a portion of page data is modified, will the writeback write the whole page to the disk, or only the partial page with modified data?
The memory management hardware on x86 systems has a granularity of 4096 bytes. This means: It is not possible to find out which bytes of a 4096-byte page are really changed and which ones are unchanged.
Theoretically the disk driver system could check if bytes have been changed and not write the 512-byte blocks that have not been changed.
However this would mean that - if the blocks are no longer in disk cache memory - the page must be read from hard disk to check if it has changed before writing.
I do not think that Linux would do this in that way because reading the page from disk would cost too much time.
Upon EACH hardware interrupt, the CPU would like to write as much data as possible that the harddisk controller can handle - this size is defined by us as the blksize (or ONE sector, in Linux):
http://en.wikipedia.org/wiki/Disk_sector
https://superuser.com/questions/121252/how-do-i-find-the-hardware-block-read-size-for-my-hard-drive
But waiting too long for SINGLE interrupt for a large file can make the system appear unresponsive, so it is logical to break the chunks into smaller size (like 512bytes) so that the CPU can handle other tasks while transferring each 512 bytes down. Therefore, whether u changed one byte or 511 bytes, so long as it is within that single block, all data get written at the same time. And throughout linux kernel, flagging the blocks as dirty for write or not, all goes by the single unique identifier: sector number, so anything smaller than sector size is too difficult for efficient management.
All these said, don't forget that the harddisk controller itself also has a minimum block size for write operation.

Historical perspective to Linux Filesystems

Jonathan Leffler's comment in the question "How can I find the Size of some specified files?" is thought-provoking. I will break it into parts for analysis.
-- files are stored on pages;
you normally end up with more space being
used than that calculation gives
because a 1 byte file (often) occupies
one page (of maybe 512 bytes).
The
exact values vary - it was easier in
the days of the 7th Edition Unix file
system (though not trivial even then
4-5. if you wanted to take account of
indirect blocks referenced by the
inode as well as the raw data blocks).
Questions about the parts
What is the definition of "page"?
Why is the word "maybe" in the after-thought "one page (of maybe 512 bytes)"?
Why was it easier to measure exact sizes in the "7th Edition Unix file system"?
What is the definition of "indirect block"?
How can you have references by two things: "the inode" and "the raw data blocks"?
Historical Questions Emerged
I. What is the historical context Leffler is speaking about?
II. Have the
definitions changed over time?
I think he means block instead of page, a block being the minimum addressable unit on the filesystem.
block sizes can vary
Not sure why but perhaps it the filesystem interface exposed api's allowing a more exact measurement.
An indirect block is a block referenced by a pointer
The inode occupies space (blocks) just as the raw data does. This is what the author meant.
As usual for Wikipedia pages, Block (data storage) is informative despite being far too exuberant about linking all keywords.
In computing (specifically data transmission and data storage), a block is a sequence of bytes or bits, having a nominal length (a block size). Data thus structured is said to be blocked. The process of putting data into blocks is called blocking. Blocking is used to facilitate the handling of the data-stream by the computer program receiving the data. Blocked data is normally read a whole block at a time. Blocking is almost universally employed when storing data to 9-track magnetic tape, to rotating media such as floppy disks, hard disks, optical discs and to NAND flash memory.
Most file systems are based on a block device, which is a level of abstraction for the hardware responsible for storing and retrieving specified blocks of data, though the block size in file systems may be a multiple of the physical block size. In classical file systems, a single block may only contain a part of a single file. This leads to space inefficiency due to internal fragmentation, since file lengths are often not multiples of block size, and thus the last block of files will remain partially empty. This will create slack space, which averages half a block per file. Some newer file systems attempt to solve this through techniques called block suballocation and tail merging.
There's also a reasonable overview of the classical Unix File System.
Traditionally, hard disk geometry (the layout of blocks on the disk itself) has been CHS.
Head: the magnetic reader/writer on each (side of a) platter; can move in and out to access different cylinders
Cylinder: a track that passes under a head as the platter rotates
Sector: a constant-sized amount of data stored contiguously on a portion the cylinder; the smallest unit of data that the drive can deal with
CHS isn't used much these days, as
Hard disks no longer use a constant number of sectors per cylinder. More data is squeezed onto a platter by using a constant arclength per sector rather than a constant rotational angle, so there are more sectors on the outer cylinders than there are on the inner cylinders.
By the ATA specification, a drive may have no more than 216 cylinders per head, 24 heads, and 28 sectors per cylinder; with 512B sectors, this is a limit of 128GB. Through BIOS INT13, it is not possible to access anything beyond 7.88GB through CHS anyways.
For backwards-compatibility, larger drives still claim to have a CHS geometry (otherwise DOS wouldn't be able to boot), but getting to any of the higher data requires using LBA addressing.
CHS doesn't even make sense on RAID or non-rotational media.
but for historical reasons, this has affected block sizes: because sector sizes were almost always 512B, filesystem block sizes have always been multiples of 512B. (There is a movement afoot to introduce drives with 1kB and 4kB sector sizes, but compatibility looks rather painful.)
Generally speaking, smaller filesystem block sizes result in less wasted space when storing many small files (unless advanced techniques like tail merging are in use), while larger block sizes reduce external fragmentation and have lower overhead on large disks. The filesystem block size is usually a power of 2, is limited below by the block device's sector size, and is often limited above by the OS's page size.
The page size varies by OS and platform (and, in the case of Linux, can vary by configuration as well). Like block size, smaller block sizes reduce internal fragmentation but require more administrative overhead. 4kB page sizes on 32-bit platforms is common.
Now, on to describe indirect blocks. In the UFS design,
An inode describes a file.
In the UFS design, the number of pointers to data blocks that an inode could hold is very limited (less than 16). The specific number appears to vary in derived implementations.
For small files, the pointers can directly point to the data blocks that compose a file.
For larger files, there must be indirect pointers, which point to a block which only contains more pointers to blocks. These may be direct pointers to data blocks belonging to the file, or if the file is very large, they may be even more indirect pointers.
Thus the amount of storage required for a file may be greater than just the blocks containing its data, when indirect pointers are in use.
Not all filesystems use this method for keeping track of the data blocks belong to a file. FAT simply uses a single file allocation table which is effectively a gigantic series of linked lists, and many modern filesystems use extents.

Resources