Parallel read file with dask - multithreading

I have a question about delayed decoration, it may seem similar to the following question" Dask: How would I parallelize my code with dask delayed? "
but even there it is not answered. I have the following code :
#dask.delayed
def remove_unnessasey_data(temp,l1):
do some work
return temp
#dask.delayed
def change_structure(temp):
do some work
return temp1
#dask.delayed
def read_one(filename):
return pd.read_csv(filename)
and then:
def f(filenames):
results = []
for filename in filenames:
results.append(change_structure( remove_unnessasey_data(
read_one(filename),l1)))
return results
result = dask.compute(*result)
according to this it should increase the speed, but the speed is the same if I read in chunks from the big file can anybody explain why??
I am aware of GIL but according to the documentation it should enhance the speed

according to this it should increase the speed
Bollocks. That documentation, for lack of a better word, is wrong in general.
Saying that doing IO in parallel will increase performance in general displays a significant misunderstanding of how most filesystems and disk storage systems work.
Why?
Seek time.
Generally, filesystems store files in as contiguous chunks as possible. To read position X in the file, the disk heads first have to be positioned over the track that holds the sector X is in. That takes time. Then the system has to wait until that sector rotates under the disk heads. That again takes time.
It should be obvious why reading a file sequentially from a spinning disk is faster - to read sector N, the disk heads have to first seek to the track that contains sector N. But because files are stored as contiguously as possible, the track that contains sector N also likely contains sector N+1, N+2, N+3, and quite a bit more. Toss in the read ahead caching that both the disk (disks are not usually dumb devices - they're pretty much full-fledged IO computers that have built-in cache systems) and the filesystem do, and sequential reading of a file from a spinning disk tends to minimize the time spent looking for data.
Now try reading in parallel.
Thread A reads sector X. Disk seeks to track, waits for sector X to pass under the heads. While that's happening, thread B tries to read sector Y. Disk finally gets to read sector X, but has a pending command to read sector Y. Now disk has to seek heads to the proper track, perhaps abandoning the readahead it would have done to get sector X+1 for thread A's next read, wait for the heads to move, then wait for sector Y to pass under the heads to read.
Meanwhile, thread C issues a request to read sector Z...
And the disk heads dance all over the disk. Then wait for the proper sector to pass under the heads.
A typical consumer-grade 5,400 RPM SATA disk that nominally supports IO rates of 100 MB/sec can be reduced to a few KILOBYTES per second through such IO patterns.
Reading or writing data in parallel almost never increases speed, especially if you're using standard filesystems on spinning disks.
You can get better performance using SSD(s) if a single thread's IO doesn't saturate the storage system - not just the disk, but entire path from CPU to/from disk. Many, many motherboards have cheap, slow disk controllers and/or lack IO bandwidth. How many people completely ignore the disk controller or the IO bandwidth of the motherboard when buying a computer?
There are filesystems that do support parallel IO for improved performance. They tend to be proprietary, expensive, and FAST. IBM's Spectrum Scale (originally GPFS) and Oracle's HSM (originally SAMFS/QFS) are two examples.

Related

how to choose chunk size when reading a large file?

I know that reading a file with chunk size that is multiple of filesystem block size is better.
1) Why is that the case? I mean lets say block size is 8kb and I read 9kb. This means that it has to go and get 12kb and then get rid of the other extra 3kb.
Yes it did go and do some extra work but does that make much of a difference unless your block size is really huge?
I mean yes if I am reading 1tb file than this definitely makes a difference.
The other reason I can think of is that the block size refers to a group of sectors on hard disk (please correct me). So it could be pointing to 8 or 16 or 32 or just one sector. so your hard disk would have to do more work essentially if the block points to a lot more sectors? am I right?
2) So lets say block size is 8kb. Do I now read 16kb at a time? 1mb? 1gb? what should I use as a chunk size?
I know available memory is a limitation but apart from that what other factors affect my choice?
Thanks a lot in advance for all the answers.
Theorically, the fastest I/O could occur when the buffer is
page-aligned, and when its size is a multiple of the system block
size.
If the file was stored continuously on the hard disk, the fastest I/O
throughput would be attained by reading cylinder by cylinder. (There
could even not be any latency then, since when you read a whole track
you don't need to start from the start, you can start in the middle,
and loop over). Unfortunately nowadays it would be near impossible to
do that, since the hard disk firmware hides the physical layout of the
sectors, and may use replacement sectors needing even seeks while
reading a single track. The OS file system may also try to spread the
file blocks all over the disk (or at least, all over a cylinder
group), to avoid having to do long seeks over big files when
acccessing small files.
So instead of considering physical tracks, you may try to take into
account the hard disk buffer size. Most hard disks have buffer size of
8 MB, some 16 MB. So reading the file by chunks of up to 1 MB or 2 MB
should let the hard disk firmware optimize the throughput without
stalling it's buffer.
But then, if there are a lot of layers above, eg, a RAID, all bets are
off.
Really, the best you can do is to benchmark your particular
circumstances.

Fastest way to copy a large file locally

I was asked this in an interview.
I said lets just use cp. Then I was asked to mimic implementation cp itself.
So I thought okay, lets open the file, read one by one and write it to another file.
Then I was asked to optimize it further. I thought lets do chunks of read and write those chunks. I didn't have a good answer about what would be good chunk size. Please help me out with that.
Then I was asked to optimize even further. I thought may be we could read from different threads in parallel and write it in parallel.
But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.
So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one.
Does that even improve performance? (I mean not for small files. it would be more overhead but for large files)
Also, is there like an OS trick where I could just point two files to the same data in disk? I mean I know there are symlinks but apart from that?
"The fastest way to copy a file" is going to depend on the system - all the way from the storage media to the CPUs. The most likely bottleneck will be the storage media - but it doesn't have to be. Imagine high-end storage that can move data faster than your system can create physical page mappings to read the data into...
In general, the fastest way to move a lot of data is to make as few copies of it as possible, and to avoid any extra operations, especially S-L-O-W ones such as physical disk head seeks.
So for a local copy on a common single-rotating-disk workstation/desktop/laptop system, the biggest thing to do is minimize physical disk seeks. That means read and write single-threaded, in large chunks (1 MB, for example) so the system can do whatever optimization it can, such as read-ahead or write coalescing.
That will likely get you to 95% or even better of the system's maximum copy performance. Even standard C buffered fopen()/fread()/fwrite() probably gets at least 80-90% of the best possible performance.
You can get the last few percentage points in a few ways. First, by matching your IO block size to a multiple of the file system's block size so that you're always reading full blocks from the filesystem. Second, you can use direct IO to bypass copying your data through the page cache. It will be faster to go disk->userspace or userspace->disk than it is to go disk->page cache->userspace and userspace->page cache->disk, but for single-spinning-disk copy that's not going to matter much, if it's even measurable.
You can use various dd options to test copying a file like this. Try using direct, or notrunc.
You can also try using sendfile() to avoid copying data into userspace entirely. Depending on the implementation, that might be faster than using direct IO.
Pre-allocating the destination file may or may not improve copy performance - that will depend on the filesystem. If the filesystem doesn't support sparse files, though, preallocating the file to a specific length might very well be very, very slow.
There just isn't all that much you can do to dramatically improve performance of a copy from and to the same single spinning physical disk - those disk heads will dance, and that will take time.
SSDs are much easier - to get maximal IO rates, just use parallel IO via multiple threads. But again, the "normal" IO will probably be at 80-90% of maximal.
Things get a lot more interesting and complex optimizing IO performance for other types of storage systems such as large RAID arrays and/or complex filesystems that can stripe single files across multiple underlying storage devices. Maximizing IO on such systems involves matching the software's IO patterns to the characteristics of the storage, and that can be quite complex.
Finally, one important part of maximizing IO rates is not doing things that dramatically slow things down. It's really easy to drag a physical disk down to a few KB/sec IO rates - read/write small chunks from/to random locations all over the disk. If your write process drops 16-byte chunks to random locations, the disk will spend almost all its time seeking and it won't move much data at all while doing that.
In fact, not "killing yourself" with bad IO patterns is a lot more important than spending a lot of effort attempting to get a four or five percentage points faster in optimal cases.
Because if IO is a bottleneck on a simple system, just go buy a faster disk.
But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.
Multithreading is not normally going to speed up a process like this. Any performance benefit you may gain could be wiped out by the synchronization overhead.
So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one.
That's only going to give an advantage on a system that supports asychronous I/O.
To get the maximum speed you'd want to write in buffer sizes that are increments of the cluster factor of the disk (assuming a hard file system). This could be sped up on systems that permit queuing asynchronous I/O (as does, say, Windoze).
You'd also want to create the output file with its initial size being the same as the input file. That ways your write operations never have to extend the file.
Probably the fastest file copy possible would be to memory map the input and output files and did a memory copy. This is especially efficient in systems that treat mapped files as page files.

How to estimate the seek speed in file system

Suppose there is a big file on 1TB SSD (read and write at 500MB/sec) with ext4 file system. This file is close to 1TB in size. How to estimate the speed of fseek() to the middle of the file. Is it going to take seconds or a few milliseconds? Thanks.
To estimate latency of fseek, we should break it into two parts: software work and hardware seek time latency. Software work is implementation of ext4 filesystem (FS, in Linux this is VFS subsystem of kernel), which will generate several "random" requests (I/O operations) to the hardware block storage device. Hardware will use some time to handle each random request.
Classic UNIX filesystems (UFS/FFS) and linux filesystems, designed after them, use superblock to describe fs layout on disk, store files as inodes (there are arrays of inodes in known places), and store file data in blocks of fixed size (up to 4KB in Linux). To find inode from file name OS must read superblock, find every directory in the path, read data from directory to find what inode number the file has (ls -i will show you inodes of current dir). Then, using data from superblock OS may calculate where the inode stored and read the inode.
Inode contains list of file data blocks, usually in tree-like structures, check https://en.wikipedia.org/wiki/Inode_pointer_structure or http://e2fsprogs.sourceforge.net/ext2intro.html
First part of file, several dozens of KB is stored in blocks, listed directly in the inode (direct blocks; 12 in ext2/3/4). For larger files inode has pointer to one block with list of file blocks (indirectly addressed blocks). If file is larger, next pointer in inode is used, to describe "double indirect blocks". It points to block which enumerate other blocks, each of them contains pointers to blocks with actual data. Sometimes triple indirect block pointer is needed. These trees are rather efficient, at every level there is a degree of ~512 (4KB block, 8 bytes per pointer). So, to access the data from middle of the file, ext2/3/4 may generate up to 4-5 low-level I/O requests (superblock is cached in RAM, inode may be cached too). These requests are not consequent in addresses, so they are almost random seeks to the block device.
Modern variants of linux FS (ext4, XFS) have optimization for huge file storage, called extent (https://en.wikipedia.org/wiki/Extent_(file_systems)). Extents allow FS to describe file placement not as block lists, but as array of file fragments / pointer pairs (start_block, number_of_consequent_blocks). Every fragment is probably from part of MB up to 128 MB. 4 first extents are stored inside inode, more extents are stored as tree-like structure again. So, with extents you may need 2-4 random I/O operations to access middle of the file.
HDD had slow access time for random requests, because they should physically move headers to correct circular track (and position headers exactly on the track; this needs some part of rotation, like 1/8 or 1/16 to be done) and then wait for up to 1 rotation (revolution) of disks (platters) to get requested part of track. Typical rotation speeds of HDDs are 5400 and 7200 rpm (revolutions per minute, 90 rps and 120 rps) or for high-speed enterprise HDDs - 10000 rpm and 15000 rpm (160 rps and 250 rps). So, the mean time needed to get data from random position of the disk is around 0.7-1 rotations, and for typical 7200 rpm hdd (120rps) it is around 1/120 seconds = 8 ms (milliseconds) = 0.008 s. 8ms are needed to get data for every random request, and there are up to 4-5 random requests in your situation, so with HDD you may expect times up to near 40 ms to seek in file. (First seek will cost more, next seeks may be cheaper as some part of block pointer tree is cached by OS; seeks to several next blocks are very cheap because linux can read them just after first seek was requested).
SSD has no rotating or moving parts, and any request on SSD is handled in the same way. SSD controller resolves requested block id to internal nand chip+block id using its own translation tables and then reads real data. Read of data from NAND will check error correction codes, and sometimes several internal rereads are needed to correctly read the block. Read is slower in cheaper NAND type - TLC with 3 bits data stored in every cell - 8 levels; faster in MLC - 2 bit of data with 4 levels; very fast in non-existent SLC SSD with 1 bit and only 2 levels. Read is also slower in worn out SSDs or in SSDs with errors (wrong models of cell charge degradation) in their firmware.
Speed of such random accesses in SSD are very high, they are usually declared is SSD specification like 50000 - 100000 IOPS (I/O operations per second, usually 4KB). High IOPS counts may be declared for deeper queue, so real average random read latency of SSD (with QD1) is 200-300 microseconds per request (0.2 - 0.3 ms; in 2014; some part of the latency is slow SATA/SCSI emulation; NVMe SSDs may be faster as they use simpler software stack). With our 4-5 requests we can estimate fseek on SSD as few milliseconds, for example up to 1 - 1.5 ms or sometimes longer.
You can check time needed for fseek using strace -T ./your_fseek_program. It will report time needed to execute every syscall. But to get real latency, you should check not only seek time, but also time of next read syscall. Before each running of this test you may want to flush kernel caches with echo 3 > /proc/sys/vm/drop_caches command from root (https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache).
You may also try some I/O benchmarks, like iozone, iometer or fio to estimate seek latencies.

Queueing writes to file system on Linux?

On a very large SMP machine with many CPUS scripts are run with tens of simultaneous jobs (fewer than the number of CPUs) like this:
some_program -in FIFO1 >OUTPUT1 2>s_p1.log </dev/null &
some_program -in FIFO2 >OUTPUT2 2>s_p2.log </dev/null &
...
some_program -in FIFO40 >OUTPUT40 2>s_p40.log </dev/null &
splitter_program -in real_input.dat -out FIFO1,FIFO2...FIFO40
The splitter reads the input data flat out and distributes it to the FIFOs in order. (Records 1,41,81... to FIFO1; 2,42,82 to FIFO2, etc.) The splitter has low overhead and can pretty much process data as fast as the file system can supply it.
Each some_program processes its stream and writes it to its output file. However, nothing controls the order in which the file system sees these writes. The writes are also very small, on the order of 10 bytes. The script "knows" that there are 40 streams here and that they could be buffered in 20M (or whatever) chunks, and then each chunk written to the file system sequentially. That is, queued writes should be used to maximize write speed to the disks. The OS, however, just sees a bunch of writes at about the same rate on each of the 40 streams.
What happens in practice during a run is that the subprocesses get a lot of CPU time (in top, >80%), then a flush process appears (10% CPU), and all the others drop to low CPU (1%), then it goes back to the higher rate. These pauses go on for several seconds at a time. The flush means that the writes are overwhelming the file caching. Also I think the OS and/or the underlying RAID controller is probably bouncing the physical disk heads around erratically which is reducing the ultimate write speed to the physical disks. That is just a guess though, since it is hard to say what exactly is happening as there is file cache (in a system with over 500Gb of RAM) and a RAID controller between the writes and the disk.
Is there a program or method around for controlling this sort of IO, forcing the file system writes to queue nicely to maximize write speed?
The "buffer" program is not going to help much here because, while it would accumulate an output stream into a big chunk, there wouldn't be an orderly queuing of the writes, so several could go out at the same time. If the data rate in the output streams was uncorrelated this would be less of a problem, but in some cases the data rate is exactly the same in all streams, which means the buffers would all fill at the same time. This would stall the entire tree until the last one was written because any process that cannot write an output will not reads its next input, and that would stall the splitter, as all I/O is synchronous. The buffers need to be emptied in a cyclical manner, preferably before any of them completely fill up, although that may not be avoidable when data output rate exceeds the file system write rate.
There are dozens of parameters for tuning the file system, some of those might help. The scheduler was changed from cfq to deadline because the system was locking up for minutes at a time with the former.
If the problem is sheer I/O bandwidth then buffering won't solve anything. In that case, you need to shrink the data or send it to a higher-bandwidth sink to improve and level your performance. One way to do that would be to reduce the number of parallel jobs, as #thatotherguy said.
If in fact the problem is with the number of distinct I/O operations rather than with the overall volume of data, however, then buffering might be a viable solution. I am unfamiliar with the buffer program you mentioned, but I suppose it does what its name suggests. I don't completely agree with your buffering comments, however:
The "buffer" program is not going to help much here because, while it would accumulate an output stream into a big chunk, there wouldn't be an orderly queuing of the writes, so several could go out at the same time.
You don't necessarily need big chunks. It would probably be ideal to chunk at the native block size of the file system, or a small integer multiple thereof. That might be, say, 4096- or 8192-byte chunks.
Moreover, I don't see why you think you have an "orderly queueing of writes" now, or why you're confident that such a thing is needed.
If the data rate in the output streams was uncorrelated this would be less of a problem, but in some cases the data rate is exactly the same in all streams, which means the buffers would all fill at the same time. This would stall the entire tree until the last one was written because any process that cannot write an output will not reads its next input, and that would stall the splitter, as all I/O is synchronous.
Your splitter is writing to FIFOs. Though it may do so serially, that is not "synchronous" in the sense that the data needs to be drained out the other end before the splitter can proceed -- at least, not if the writes do not exceed the size of the FIFOs' buffers. FIFO buffer capacity varies from system to system, adapts dynamically on some systems, and is configurable (e.g. via fcntl()) on some systems. The default buffer size on modern Linux is 64kB.
The buffers need to be emptied in a cyclical manner, preferably before any of them completely fill up, although that may not be avoidable when data output rate exceeds the file system write rate.
I think this is a problem that pretty much solves itself. If one of the buffers backs up enough to block the splitter, then that ensures that the competing processes will, before too long, give the blocked buffer the opportunity to write. But this is also why you don't want enormous buffers -- you want to interleave disk I/O from different processes relatively finely to try to keep everything going.
The alternative to an external buffer program is to modify your processes to perform internal buffering. That might be an advantage because it removes a whole set of pipes (to an external buffering program) from the mix, and it lightens the process load on the machines. It does mean modifying your working processing program, though, so perhaps it would be better to start with external buffering to see how well that does.
If the problem is that your 40 streams each have a high data rate, and your RAID controller cannot write to physical disk fast enough, then you need to redesign your disk system. Basically, divide it into 40 RAID-1 mirrors and write one file to each mirror set. That makes the writes sequential for each stream, but requires 80 disks.
If the data rate isn't the problem then you need to add more buffering. You might need a pair of threads. One thread to collect the data into memory buffers and another thread to write it into data files and fsync() it. To make the disk writes sequential it should fsync each output file one at a time. That should result in writing large sequential chunks of whatever your buffer size is. 8 MB maybe?

Consistency guarantee of file system regarding sequential write

My program (only 1 process and 1 thread) sequentially write n consecutive chunks of data to a file on a HDD (regular kind of HDD) using plain old write system call. It's like some kind of append-only log file.
After a system crash (power failure, not HDD failure), I read back and verified that chunks[i] (0 < i < n) had been entirely written down to disk (by checking length). May be the content of the chunk is not checksum correct but still the whole chunks[i] stably sit on the surface of the magnetic disk.
Is it safe for me to assume all other chunks before chunks[i] are entirely written down too? Or there exists a (or many) chunks[j] (0 < j < i) that is partly (or isn't at all) written down to disk? I know that random writes could be reordered to improve disk throughput but could sequential writes be reordered too?
Yes, writes that appear sequential (to you) can be reordered before being written to disk, primarily because the order seen by your code (or even the OS) may not correspond directly to locations on the disk.
Although IDE disks did (at one time) use addressing based on specifying a track, head and sector that would hold a piece of data, they've long since converted to a system where you just have some number of sectors, and it's up to the disk to arrange those in an order that makes sense. It usually does a pretty good job, but in some cases (especially if a sector has gone bad and been replaced by a spare sector) it may make the most sense to write sectors out of order.

Resources