how to choose chunk size when reading a large file? - linux

I know that reading a file with chunk size that is multiple of filesystem block size is better.
1) Why is that the case? I mean lets say block size is 8kb and I read 9kb. This means that it has to go and get 12kb and then get rid of the other extra 3kb.
Yes it did go and do some extra work but does that make much of a difference unless your block size is really huge?
I mean yes if I am reading 1tb file than this definitely makes a difference.
The other reason I can think of is that the block size refers to a group of sectors on hard disk (please correct me). So it could be pointing to 8 or 16 or 32 or just one sector. so your hard disk would have to do more work essentially if the block points to a lot more sectors? am I right?
2) So lets say block size is 8kb. Do I now read 16kb at a time? 1mb? 1gb? what should I use as a chunk size?
I know available memory is a limitation but apart from that what other factors affect my choice?
Thanks a lot in advance for all the answers.

Theorically, the fastest I/O could occur when the buffer is
page-aligned, and when its size is a multiple of the system block
size.
If the file was stored continuously on the hard disk, the fastest I/O
throughput would be attained by reading cylinder by cylinder. (There
could even not be any latency then, since when you read a whole track
you don't need to start from the start, you can start in the middle,
and loop over). Unfortunately nowadays it would be near impossible to
do that, since the hard disk firmware hides the physical layout of the
sectors, and may use replacement sectors needing even seeks while
reading a single track. The OS file system may also try to spread the
file blocks all over the disk (or at least, all over a cylinder
group), to avoid having to do long seeks over big files when
acccessing small files.
So instead of considering physical tracks, you may try to take into
account the hard disk buffer size. Most hard disks have buffer size of
8 MB, some 16 MB. So reading the file by chunks of up to 1 MB or 2 MB
should let the hard disk firmware optimize the throughput without
stalling it's buffer.
But then, if there are a lot of layers above, eg, a RAID, all bets are
off.
Really, the best you can do is to benchmark your particular
circumstances.

Related

File Copy using mmap

Problem - Transfer a file of size 350 MB from linux networking box 1 to another linux networking box 2 on the fly dynamically when the box 2 requests for it.
System has limited memory of 1 GB and the file which is of size 350 MB is stored on the disk. The system is actually busy doing a lot of other things.
What is the best approach to take to transfer the file on the fly on demand automatically? If I read the complete file from the disk and store it on RAM before transferring that would actually take up a lot of memory. If I want to avoid that, then whether using mmap to transfer the file would help? How would mmap fit in this scenario?
In most cases, you can (and should) use a buffered copy, along the lines of:
while (read some data from the input into a buffer) {
write data from the buffer to the output
}
and you're done
The buffer does not need to be large. Something on the order of 64 KB should be sufficient for most situations.
For the sending end only, you may be able to use the sendfile() system call as an optimization.

How to estimate the seek speed in file system

Suppose there is a big file on 1TB SSD (read and write at 500MB/sec) with ext4 file system. This file is close to 1TB in size. How to estimate the speed of fseek() to the middle of the file. Is it going to take seconds or a few milliseconds? Thanks.
To estimate latency of fseek, we should break it into two parts: software work and hardware seek time latency. Software work is implementation of ext4 filesystem (FS, in Linux this is VFS subsystem of kernel), which will generate several "random" requests (I/O operations) to the hardware block storage device. Hardware will use some time to handle each random request.
Classic UNIX filesystems (UFS/FFS) and linux filesystems, designed after them, use superblock to describe fs layout on disk, store files as inodes (there are arrays of inodes in known places), and store file data in blocks of fixed size (up to 4KB in Linux). To find inode from file name OS must read superblock, find every directory in the path, read data from directory to find what inode number the file has (ls -i will show you inodes of current dir). Then, using data from superblock OS may calculate where the inode stored and read the inode.
Inode contains list of file data blocks, usually in tree-like structures, check https://en.wikipedia.org/wiki/Inode_pointer_structure or http://e2fsprogs.sourceforge.net/ext2intro.html
First part of file, several dozens of KB is stored in blocks, listed directly in the inode (direct blocks; 12 in ext2/3/4). For larger files inode has pointer to one block with list of file blocks (indirectly addressed blocks). If file is larger, next pointer in inode is used, to describe "double indirect blocks". It points to block which enumerate other blocks, each of them contains pointers to blocks with actual data. Sometimes triple indirect block pointer is needed. These trees are rather efficient, at every level there is a degree of ~512 (4KB block, 8 bytes per pointer). So, to access the data from middle of the file, ext2/3/4 may generate up to 4-5 low-level I/O requests (superblock is cached in RAM, inode may be cached too). These requests are not consequent in addresses, so they are almost random seeks to the block device.
Modern variants of linux FS (ext4, XFS) have optimization for huge file storage, called extent (https://en.wikipedia.org/wiki/Extent_(file_systems)). Extents allow FS to describe file placement not as block lists, but as array of file fragments / pointer pairs (start_block, number_of_consequent_blocks). Every fragment is probably from part of MB up to 128 MB. 4 first extents are stored inside inode, more extents are stored as tree-like structure again. So, with extents you may need 2-4 random I/O operations to access middle of the file.
HDD had slow access time for random requests, because they should physically move headers to correct circular track (and position headers exactly on the track; this needs some part of rotation, like 1/8 or 1/16 to be done) and then wait for up to 1 rotation (revolution) of disks (platters) to get requested part of track. Typical rotation speeds of HDDs are 5400 and 7200 rpm (revolutions per minute, 90 rps and 120 rps) or for high-speed enterprise HDDs - 10000 rpm and 15000 rpm (160 rps and 250 rps). So, the mean time needed to get data from random position of the disk is around 0.7-1 rotations, and for typical 7200 rpm hdd (120rps) it is around 1/120 seconds = 8 ms (milliseconds) = 0.008 s. 8ms are needed to get data for every random request, and there are up to 4-5 random requests in your situation, so with HDD you may expect times up to near 40 ms to seek in file. (First seek will cost more, next seeks may be cheaper as some part of block pointer tree is cached by OS; seeks to several next blocks are very cheap because linux can read them just after first seek was requested).
SSD has no rotating or moving parts, and any request on SSD is handled in the same way. SSD controller resolves requested block id to internal nand chip+block id using its own translation tables and then reads real data. Read of data from NAND will check error correction codes, and sometimes several internal rereads are needed to correctly read the block. Read is slower in cheaper NAND type - TLC with 3 bits data stored in every cell - 8 levels; faster in MLC - 2 bit of data with 4 levels; very fast in non-existent SLC SSD with 1 bit and only 2 levels. Read is also slower in worn out SSDs or in SSDs with errors (wrong models of cell charge degradation) in their firmware.
Speed of such random accesses in SSD are very high, they are usually declared is SSD specification like 50000 - 100000 IOPS (I/O operations per second, usually 4KB). High IOPS counts may be declared for deeper queue, so real average random read latency of SSD (with QD1) is 200-300 microseconds per request (0.2 - 0.3 ms; in 2014; some part of the latency is slow SATA/SCSI emulation; NVMe SSDs may be faster as they use simpler software stack). With our 4-5 requests we can estimate fseek on SSD as few milliseconds, for example up to 1 - 1.5 ms or sometimes longer.
You can check time needed for fseek using strace -T ./your_fseek_program. It will report time needed to execute every syscall. But to get real latency, you should check not only seek time, but also time of next read syscall. Before each running of this test you may want to flush kernel caches with echo 3 > /proc/sys/vm/drop_caches command from root (https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache).
You may also try some I/O benchmarks, like iozone, iometer or fio to estimate seek latencies.

How file system block size works?

All Linux file systems have 4kb block size. Let's say I have 10mb of hard disk storage. That means I have 2560 blocks available and let's say I copied 2560 files each having 1kb of size. Each 1 kb block will occupy 1 block though it is not filling entire block.
So my entire disk is now filled but still I have 2560x3kb of free space. If I want to store another file of say 1mb will the file system allow me to store? Will it write in the free space left in the individual blocks? Is there any concept addressing this problem?
I would appreciate some clarification.
Thanks in advance.
It is true, you are in a way wasting disk space if you are storing a lot of files which are much smaller than the smallest block size of the file system.
The reason why the block size is around 4kb is the amount of metadata associated with blocks. Smaller the block size, more there is metadata about the locations of the blocks compared to the actual data and more fragmented is the worst case scenario.
However, there are filesystems with different block sizes, most filesystems let you define the block size, typically the minimum block size is 512 bytes. If you are storing a lot of very small files having a small block size might make sense.
http://www.tldp.org/LDP/sag/html/filesystems.html
XFS Filesystem documentation has some comments on how to select filesystem block size - it is also possible to defined the directory block size:
http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=bks&srch=&fname=/SGI_Admin/LX_XFS_AG/sgi_html/ch02.html
You should consider setting a logical block size for a filesystem
directory that is greater than the logical block size for the
filesystem if you are supporting an application that reads directories
(with the readdir(3C) or getdents(2) system calls) many times in
relation to how much it creates and removes files. Using a small
filesystem block size saves on disk space and on I/O throughput for
the small files.

writeback of dirty pages in linux

I have a question regarding the writeback of the dirty pages. If a portion of page data is modified, will the writeback write the whole page to the disk, or only the partial page with modified data?
The memory management hardware on x86 systems has a granularity of 4096 bytes. This means: It is not possible to find out which bytes of a 4096-byte page are really changed and which ones are unchanged.
Theoretically the disk driver system could check if bytes have been changed and not write the 512-byte blocks that have not been changed.
However this would mean that - if the blocks are no longer in disk cache memory - the page must be read from hard disk to check if it has changed before writing.
I do not think that Linux would do this in that way because reading the page from disk would cost too much time.
Upon EACH hardware interrupt, the CPU would like to write as much data as possible that the harddisk controller can handle - this size is defined by us as the blksize (or ONE sector, in Linux):
http://en.wikipedia.org/wiki/Disk_sector
https://superuser.com/questions/121252/how-do-i-find-the-hardware-block-read-size-for-my-hard-drive
But waiting too long for SINGLE interrupt for a large file can make the system appear unresponsive, so it is logical to break the chunks into smaller size (like 512bytes) so that the CPU can handle other tasks while transferring each 512 bytes down. Therefore, whether u changed one byte or 511 bytes, so long as it is within that single block, all data get written at the same time. And throughout linux kernel, flagging the blocks as dirty for write or not, all goes by the single unique identifier: sector number, so anything smaller than sector size is too difficult for efficient management.
All these said, don't forget that the harddisk controller itself also has a minimum block size for write operation.

Historical perspective to Linux Filesystems

Jonathan Leffler's comment in the question "How can I find the Size of some specified files?" is thought-provoking. I will break it into parts for analysis.
-- files are stored on pages;
you normally end up with more space being
used than that calculation gives
because a 1 byte file (often) occupies
one page (of maybe 512 bytes).
The
exact values vary - it was easier in
the days of the 7th Edition Unix file
system (though not trivial even then
4-5. if you wanted to take account of
indirect blocks referenced by the
inode as well as the raw data blocks).
Questions about the parts
What is the definition of "page"?
Why is the word "maybe" in the after-thought "one page (of maybe 512 bytes)"?
Why was it easier to measure exact sizes in the "7th Edition Unix file system"?
What is the definition of "indirect block"?
How can you have references by two things: "the inode" and "the raw data blocks"?
Historical Questions Emerged
I. What is the historical context Leffler is speaking about?
II. Have the
definitions changed over time?
I think he means block instead of page, a block being the minimum addressable unit on the filesystem.
block sizes can vary
Not sure why but perhaps it the filesystem interface exposed api's allowing a more exact measurement.
An indirect block is a block referenced by a pointer
The inode occupies space (blocks) just as the raw data does. This is what the author meant.
As usual for Wikipedia pages, Block (data storage) is informative despite being far too exuberant about linking all keywords.
In computing (specifically data transmission and data storage), a block is a sequence of bytes or bits, having a nominal length (a block size). Data thus structured is said to be blocked. The process of putting data into blocks is called blocking. Blocking is used to facilitate the handling of the data-stream by the computer program receiving the data. Blocked data is normally read a whole block at a time. Blocking is almost universally employed when storing data to 9-track magnetic tape, to rotating media such as floppy disks, hard disks, optical discs and to NAND flash memory.
Most file systems are based on a block device, which is a level of abstraction for the hardware responsible for storing and retrieving specified blocks of data, though the block size in file systems may be a multiple of the physical block size. In classical file systems, a single block may only contain a part of a single file. This leads to space inefficiency due to internal fragmentation, since file lengths are often not multiples of block size, and thus the last block of files will remain partially empty. This will create slack space, which averages half a block per file. Some newer file systems attempt to solve this through techniques called block suballocation and tail merging.
There's also a reasonable overview of the classical Unix File System.
Traditionally, hard disk geometry (the layout of blocks on the disk itself) has been CHS.
Head: the magnetic reader/writer on each (side of a) platter; can move in and out to access different cylinders
Cylinder: a track that passes under a head as the platter rotates
Sector: a constant-sized amount of data stored contiguously on a portion the cylinder; the smallest unit of data that the drive can deal with
CHS isn't used much these days, as
Hard disks no longer use a constant number of sectors per cylinder. More data is squeezed onto a platter by using a constant arclength per sector rather than a constant rotational angle, so there are more sectors on the outer cylinders than there are on the inner cylinders.
By the ATA specification, a drive may have no more than 216 cylinders per head, 24 heads, and 28 sectors per cylinder; with 512B sectors, this is a limit of 128GB. Through BIOS INT13, it is not possible to access anything beyond 7.88GB through CHS anyways.
For backwards-compatibility, larger drives still claim to have a CHS geometry (otherwise DOS wouldn't be able to boot), but getting to any of the higher data requires using LBA addressing.
CHS doesn't even make sense on RAID or non-rotational media.
but for historical reasons, this has affected block sizes: because sector sizes were almost always 512B, filesystem block sizes have always been multiples of 512B. (There is a movement afoot to introduce drives with 1kB and 4kB sector sizes, but compatibility looks rather painful.)
Generally speaking, smaller filesystem block sizes result in less wasted space when storing many small files (unless advanced techniques like tail merging are in use), while larger block sizes reduce external fragmentation and have lower overhead on large disks. The filesystem block size is usually a power of 2, is limited below by the block device's sector size, and is often limited above by the OS's page size.
The page size varies by OS and platform (and, in the case of Linux, can vary by configuration as well). Like block size, smaller block sizes reduce internal fragmentation but require more administrative overhead. 4kB page sizes on 32-bit platforms is common.
Now, on to describe indirect blocks. In the UFS design,
An inode describes a file.
In the UFS design, the number of pointers to data blocks that an inode could hold is very limited (less than 16). The specific number appears to vary in derived implementations.
For small files, the pointers can directly point to the data blocks that compose a file.
For larger files, there must be indirect pointers, which point to a block which only contains more pointers to blocks. These may be direct pointers to data blocks belonging to the file, or if the file is very large, they may be even more indirect pointers.
Thus the amount of storage required for a file may be greater than just the blocks containing its data, when indirect pointers are in use.
Not all filesystems use this method for keeping track of the data blocks belong to a file. FAT simply uses a single file allocation table which is effectively a gigantic series of linked lists, and many modern filesystems use extents.

Resources