Is reading from disk with two threads faster? - multithreading

Let's say I have a big 200 GiB file.
It's stored on disk in on piece (sequential and not fragmented).
The disk is connected via SATA III or PCIe (M.2).
The buffer size (when calling read) is selected big (1 GiB).
So increasing this won't make it faster.
Will reading the file with two threads be faster?

Related

File Copy using mmap

Problem - Transfer a file of size 350 MB from linux networking box 1 to another linux networking box 2 on the fly dynamically when the box 2 requests for it.
System has limited memory of 1 GB and the file which is of size 350 MB is stored on the disk. The system is actually busy doing a lot of other things.
What is the best approach to take to transfer the file on the fly on demand automatically? If I read the complete file from the disk and store it on RAM before transferring that would actually take up a lot of memory. If I want to avoid that, then whether using mmap to transfer the file would help? How would mmap fit in this scenario?
In most cases, you can (and should) use a buffered copy, along the lines of:
while (read some data from the input into a buffer) {
write data from the buffer to the output
}
and you're done
The buffer does not need to be large. Something on the order of 64 KB should be sufficient for most situations.
For the sending end only, you may be able to use the sendfile() system call as an optimization.

How to estimate the seek speed in file system

Suppose there is a big file on 1TB SSD (read and write at 500MB/sec) with ext4 file system. This file is close to 1TB in size. How to estimate the speed of fseek() to the middle of the file. Is it going to take seconds or a few milliseconds? Thanks.
To estimate latency of fseek, we should break it into two parts: software work and hardware seek time latency. Software work is implementation of ext4 filesystem (FS, in Linux this is VFS subsystem of kernel), which will generate several "random" requests (I/O operations) to the hardware block storage device. Hardware will use some time to handle each random request.
Classic UNIX filesystems (UFS/FFS) and linux filesystems, designed after them, use superblock to describe fs layout on disk, store files as inodes (there are arrays of inodes in known places), and store file data in blocks of fixed size (up to 4KB in Linux). To find inode from file name OS must read superblock, find every directory in the path, read data from directory to find what inode number the file has (ls -i will show you inodes of current dir). Then, using data from superblock OS may calculate where the inode stored and read the inode.
Inode contains list of file data blocks, usually in tree-like structures, check https://en.wikipedia.org/wiki/Inode_pointer_structure or http://e2fsprogs.sourceforge.net/ext2intro.html
First part of file, several dozens of KB is stored in blocks, listed directly in the inode (direct blocks; 12 in ext2/3/4). For larger files inode has pointer to one block with list of file blocks (indirectly addressed blocks). If file is larger, next pointer in inode is used, to describe "double indirect blocks". It points to block which enumerate other blocks, each of them contains pointers to blocks with actual data. Sometimes triple indirect block pointer is needed. These trees are rather efficient, at every level there is a degree of ~512 (4KB block, 8 bytes per pointer). So, to access the data from middle of the file, ext2/3/4 may generate up to 4-5 low-level I/O requests (superblock is cached in RAM, inode may be cached too). These requests are not consequent in addresses, so they are almost random seeks to the block device.
Modern variants of linux FS (ext4, XFS) have optimization for huge file storage, called extent (https://en.wikipedia.org/wiki/Extent_(file_systems)). Extents allow FS to describe file placement not as block lists, but as array of file fragments / pointer pairs (start_block, number_of_consequent_blocks). Every fragment is probably from part of MB up to 128 MB. 4 first extents are stored inside inode, more extents are stored as tree-like structure again. So, with extents you may need 2-4 random I/O operations to access middle of the file.
HDD had slow access time for random requests, because they should physically move headers to correct circular track (and position headers exactly on the track; this needs some part of rotation, like 1/8 or 1/16 to be done) and then wait for up to 1 rotation (revolution) of disks (platters) to get requested part of track. Typical rotation speeds of HDDs are 5400 and 7200 rpm (revolutions per minute, 90 rps and 120 rps) or for high-speed enterprise HDDs - 10000 rpm and 15000 rpm (160 rps and 250 rps). So, the mean time needed to get data from random position of the disk is around 0.7-1 rotations, and for typical 7200 rpm hdd (120rps) it is around 1/120 seconds = 8 ms (milliseconds) = 0.008 s. 8ms are needed to get data for every random request, and there are up to 4-5 random requests in your situation, so with HDD you may expect times up to near 40 ms to seek in file. (First seek will cost more, next seeks may be cheaper as some part of block pointer tree is cached by OS; seeks to several next blocks are very cheap because linux can read them just after first seek was requested).
SSD has no rotating or moving parts, and any request on SSD is handled in the same way. SSD controller resolves requested block id to internal nand chip+block id using its own translation tables and then reads real data. Read of data from NAND will check error correction codes, and sometimes several internal rereads are needed to correctly read the block. Read is slower in cheaper NAND type - TLC with 3 bits data stored in every cell - 8 levels; faster in MLC - 2 bit of data with 4 levels; very fast in non-existent SLC SSD with 1 bit and only 2 levels. Read is also slower in worn out SSDs or in SSDs with errors (wrong models of cell charge degradation) in their firmware.
Speed of such random accesses in SSD are very high, they are usually declared is SSD specification like 50000 - 100000 IOPS (I/O operations per second, usually 4KB). High IOPS counts may be declared for deeper queue, so real average random read latency of SSD (with QD1) is 200-300 microseconds per request (0.2 - 0.3 ms; in 2014; some part of the latency is slow SATA/SCSI emulation; NVMe SSDs may be faster as they use simpler software stack). With our 4-5 requests we can estimate fseek on SSD as few milliseconds, for example up to 1 - 1.5 ms or sometimes longer.
You can check time needed for fseek using strace -T ./your_fseek_program. It will report time needed to execute every syscall. But to get real latency, you should check not only seek time, but also time of next read syscall. Before each running of this test you may want to flush kernel caches with echo 3 > /proc/sys/vm/drop_caches command from root (https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache).
You may also try some I/O benchmarks, like iozone, iometer or fio to estimate seek latencies.

How file system block size works?

All Linux file systems have 4kb block size. Let's say I have 10mb of hard disk storage. That means I have 2560 blocks available and let's say I copied 2560 files each having 1kb of size. Each 1 kb block will occupy 1 block though it is not filling entire block.
So my entire disk is now filled but still I have 2560x3kb of free space. If I want to store another file of say 1mb will the file system allow me to store? Will it write in the free space left in the individual blocks? Is there any concept addressing this problem?
I would appreciate some clarification.
Thanks in advance.
It is true, you are in a way wasting disk space if you are storing a lot of files which are much smaller than the smallest block size of the file system.
The reason why the block size is around 4kb is the amount of metadata associated with blocks. Smaller the block size, more there is metadata about the locations of the blocks compared to the actual data and more fragmented is the worst case scenario.
However, there are filesystems with different block sizes, most filesystems let you define the block size, typically the minimum block size is 512 bytes. If you are storing a lot of very small files having a small block size might make sense.
http://www.tldp.org/LDP/sag/html/filesystems.html
XFS Filesystem documentation has some comments on how to select filesystem block size - it is also possible to defined the directory block size:
http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=bks&srch=&fname=/SGI_Admin/LX_XFS_AG/sgi_html/ch02.html
You should consider setting a logical block size for a filesystem
directory that is greater than the logical block size for the
filesystem if you are supporting an application that reads directories
(with the readdir(3C) or getdents(2) system calls) many times in
relation to how much it creates and removes files. Using a small
filesystem block size saves on disk space and on I/O throughput for
the small files.

writeback of dirty pages in linux

I have a question regarding the writeback of the dirty pages. If a portion of page data is modified, will the writeback write the whole page to the disk, or only the partial page with modified data?
The memory management hardware on x86 systems has a granularity of 4096 bytes. This means: It is not possible to find out which bytes of a 4096-byte page are really changed and which ones are unchanged.
Theoretically the disk driver system could check if bytes have been changed and not write the 512-byte blocks that have not been changed.
However this would mean that - if the blocks are no longer in disk cache memory - the page must be read from hard disk to check if it has changed before writing.
I do not think that Linux would do this in that way because reading the page from disk would cost too much time.
Upon EACH hardware interrupt, the CPU would like to write as much data as possible that the harddisk controller can handle - this size is defined by us as the blksize (or ONE sector, in Linux):
http://en.wikipedia.org/wiki/Disk_sector
https://superuser.com/questions/121252/how-do-i-find-the-hardware-block-read-size-for-my-hard-drive
But waiting too long for SINGLE interrupt for a large file can make the system appear unresponsive, so it is logical to break the chunks into smaller size (like 512bytes) so that the CPU can handle other tasks while transferring each 512 bytes down. Therefore, whether u changed one byte or 511 bytes, so long as it is within that single block, all data get written at the same time. And throughout linux kernel, flagging the blocks as dirty for write or not, all goes by the single unique identifier: sector number, so anything smaller than sector size is too difficult for efficient management.
All these said, don't forget that the harddisk controller itself also has a minimum block size for write operation.

How does the CPU read from the disk?

I'm a bit confused about the whole idea of IO; I want to know how the CPU reads from the disk (a SATA disk for example) ?
When the program with read()/write() is complied with a reference to a specific file and when the CPU encounters this reference, does it read from the disk directly (via memory mapped IO ports)? Or does it write to the RAM and then writes back to disk?
I'd suggest reading:
http://www.makelinux.net/books/ulk3/understandlk-CHP-13-SECT-1
With a supplement of:
http://en.wikipedia.org/wiki/Direct_memory_access
With regards to buffering in RAM: most programming languages and operating systems buffer at least part of I/O operations (read and write) to memory. This is usually done asynchronously: i.e. a buffer is created, filled, and then processed. For a read, the CPU would (working with the disk controller) create IO instructions to fetch data and a place to put it in memory, fill that space, and then present its contents to the program making the request. For a write request, this would be queuing write operations and their associated data and then sending them off to the IO controller and eventually the disk to be executed. Buffering can happen in multiple places: on the CPU's caches, in RAM, (sometimes) on the disk controller, or on the hard disk itself. How much buffering is done, and exactly how the abstract sequence of operations I've mentioned is handled, differs depending on your hardware architecture, OS, and task.
Main memory is the only large storage area (millions to bilions of bytes) that the processors can access directly.
"Operating System Concepts" said.
So if you want to run a program or manipulate some data, they (program and data) must be in Main memory.

Resources