My program (only 1 process and 1 thread) sequentially write n consecutive chunks of data to a file on a HDD (regular kind of HDD) using plain old write system call. It's like some kind of append-only log file.
After a system crash (power failure, not HDD failure), I read back and verified that chunks[i] (0 < i < n) had been entirely written down to disk (by checking length). May be the content of the chunk is not checksum correct but still the whole chunks[i] stably sit on the surface of the magnetic disk.
Is it safe for me to assume all other chunks before chunks[i] are entirely written down too? Or there exists a (or many) chunks[j] (0 < j < i) that is partly (or isn't at all) written down to disk? I know that random writes could be reordered to improve disk throughput but could sequential writes be reordered too?
Yes, writes that appear sequential (to you) can be reordered before being written to disk, primarily because the order seen by your code (or even the OS) may not correspond directly to locations on the disk.
Although IDE disks did (at one time) use addressing based on specifying a track, head and sector that would hold a piece of data, they've long since converted to a system where you just have some number of sectors, and it's up to the disk to arrange those in an order that makes sense. It usually does a pretty good job, but in some cases (especially if a sector has gone bad and been replaced by a spare sector) it may make the most sense to write sectors out of order.
Related
I have a question about delayed decoration, it may seem similar to the following question" Dask: How would I parallelize my code with dask delayed? "
but even there it is not answered. I have the following code :
#dask.delayed
def remove_unnessasey_data(temp,l1):
do some work
return temp
#dask.delayed
def change_structure(temp):
do some work
return temp1
#dask.delayed
def read_one(filename):
return pd.read_csv(filename)
and then:
def f(filenames):
results = []
for filename in filenames:
results.append(change_structure( remove_unnessasey_data(
read_one(filename),l1)))
return results
result = dask.compute(*result)
according to this it should increase the speed, but the speed is the same if I read in chunks from the big file can anybody explain why??
I am aware of GIL but according to the documentation it should enhance the speed
according to this it should increase the speed
Bollocks. That documentation, for lack of a better word, is wrong in general.
Saying that doing IO in parallel will increase performance in general displays a significant misunderstanding of how most filesystems and disk storage systems work.
Why?
Seek time.
Generally, filesystems store files in as contiguous chunks as possible. To read position X in the file, the disk heads first have to be positioned over the track that holds the sector X is in. That takes time. Then the system has to wait until that sector rotates under the disk heads. That again takes time.
It should be obvious why reading a file sequentially from a spinning disk is faster - to read sector N, the disk heads have to first seek to the track that contains sector N. But because files are stored as contiguously as possible, the track that contains sector N also likely contains sector N+1, N+2, N+3, and quite a bit more. Toss in the read ahead caching that both the disk (disks are not usually dumb devices - they're pretty much full-fledged IO computers that have built-in cache systems) and the filesystem do, and sequential reading of a file from a spinning disk tends to minimize the time spent looking for data.
Now try reading in parallel.
Thread A reads sector X. Disk seeks to track, waits for sector X to pass under the heads. While that's happening, thread B tries to read sector Y. Disk finally gets to read sector X, but has a pending command to read sector Y. Now disk has to seek heads to the proper track, perhaps abandoning the readahead it would have done to get sector X+1 for thread A's next read, wait for the heads to move, then wait for sector Y to pass under the heads to read.
Meanwhile, thread C issues a request to read sector Z...
And the disk heads dance all over the disk. Then wait for the proper sector to pass under the heads.
A typical consumer-grade 5,400 RPM SATA disk that nominally supports IO rates of 100 MB/sec can be reduced to a few KILOBYTES per second through such IO patterns.
Reading or writing data in parallel almost never increases speed, especially if you're using standard filesystems on spinning disks.
You can get better performance using SSD(s) if a single thread's IO doesn't saturate the storage system - not just the disk, but entire path from CPU to/from disk. Many, many motherboards have cheap, slow disk controllers and/or lack IO bandwidth. How many people completely ignore the disk controller or the IO bandwidth of the motherboard when buying a computer?
There are filesystems that do support parallel IO for improved performance. They tend to be proprietary, expensive, and FAST. IBM's Spectrum Scale (originally GPFS) and Oracle's HSM (originally SAMFS/QFS) are two examples.
I was asked this in an interview.
I said lets just use cp. Then I was asked to mimic implementation cp itself.
So I thought okay, lets open the file, read one by one and write it to another file.
Then I was asked to optimize it further. I thought lets do chunks of read and write those chunks. I didn't have a good answer about what would be good chunk size. Please help me out with that.
Then I was asked to optimize even further. I thought may be we could read from different threads in parallel and write it in parallel.
But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.
So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one.
Does that even improve performance? (I mean not for small files. it would be more overhead but for large files)
Also, is there like an OS trick where I could just point two files to the same data in disk? I mean I know there are symlinks but apart from that?
"The fastest way to copy a file" is going to depend on the system - all the way from the storage media to the CPUs. The most likely bottleneck will be the storage media - but it doesn't have to be. Imagine high-end storage that can move data faster than your system can create physical page mappings to read the data into...
In general, the fastest way to move a lot of data is to make as few copies of it as possible, and to avoid any extra operations, especially S-L-O-W ones such as physical disk head seeks.
So for a local copy on a common single-rotating-disk workstation/desktop/laptop system, the biggest thing to do is minimize physical disk seeks. That means read and write single-threaded, in large chunks (1 MB, for example) so the system can do whatever optimization it can, such as read-ahead or write coalescing.
That will likely get you to 95% or even better of the system's maximum copy performance. Even standard C buffered fopen()/fread()/fwrite() probably gets at least 80-90% of the best possible performance.
You can get the last few percentage points in a few ways. First, by matching your IO block size to a multiple of the file system's block size so that you're always reading full blocks from the filesystem. Second, you can use direct IO to bypass copying your data through the page cache. It will be faster to go disk->userspace or userspace->disk than it is to go disk->page cache->userspace and userspace->page cache->disk, but for single-spinning-disk copy that's not going to matter much, if it's even measurable.
You can use various dd options to test copying a file like this. Try using direct, or notrunc.
You can also try using sendfile() to avoid copying data into userspace entirely. Depending on the implementation, that might be faster than using direct IO.
Pre-allocating the destination file may or may not improve copy performance - that will depend on the filesystem. If the filesystem doesn't support sparse files, though, preallocating the file to a specific length might very well be very, very slow.
There just isn't all that much you can do to dramatically improve performance of a copy from and to the same single spinning physical disk - those disk heads will dance, and that will take time.
SSDs are much easier - to get maximal IO rates, just use parallel IO via multiple threads. But again, the "normal" IO will probably be at 80-90% of maximal.
Things get a lot more interesting and complex optimizing IO performance for other types of storage systems such as large RAID arrays and/or complex filesystems that can stripe single files across multiple underlying storage devices. Maximizing IO on such systems involves matching the software's IO patterns to the characteristics of the storage, and that can be quite complex.
Finally, one important part of maximizing IO rates is not doing things that dramatically slow things down. It's really easy to drag a physical disk down to a few KB/sec IO rates - read/write small chunks from/to random locations all over the disk. If your write process drops 16-byte chunks to random locations, the disk will spend almost all its time seeking and it won't move much data at all while doing that.
In fact, not "killing yourself" with bad IO patterns is a lot more important than spending a lot of effort attempting to get a four or five percentage points faster in optimal cases.
Because if IO is a bottleneck on a simple system, just go buy a faster disk.
But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.
Multithreading is not normally going to speed up a process like this. Any performance benefit you may gain could be wiped out by the synchronization overhead.
So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one.
That's only going to give an advantage on a system that supports asychronous I/O.
To get the maximum speed you'd want to write in buffer sizes that are increments of the cluster factor of the disk (assuming a hard file system). This could be sped up on systems that permit queuing asynchronous I/O (as does, say, Windoze).
You'd also want to create the output file with its initial size being the same as the input file. That ways your write operations never have to extend the file.
Probably the fastest file copy possible would be to memory map the input and output files and did a memory copy. This is especially efficient in systems that treat mapped files as page files.
I'm working with a C program that does a bunch of things but in some part it writes a whole HDD with multiple calls to fwrite on a single file with a fixed size.
The calls are something like this:
fwrite(some_memory,size_element,total_elements,file);
When I measure the wall time of this calls each call takes a bit longer than the previous one. So for example, if want to write in chunks of 900MB of data, the first call (with empty disk) ends within 7 seconds but the last ones takes somewhere between 10~11 secs (with the disk almost at full capacity).
Is this an expected behavior? Is there any way of getting consistent write times independently of disk current capacity?
I'm using an EXT4 wd green 2TB volume.
I'd say this is to be expected as your early calls are most likely satisfied by the kernel's writeback cache and thus returning slightly quicker because at the time fwrite returns not all the data has reached the disk yet. However I don't know how much memory your system has compared to the 900MBytes of data you're trying to write so this is a guess...
If the kernel's cache becomes full (e.g. because the disk can't keep up) then your userspace program is made to block until it is sufficiently empty and able to accept a bit more data leaky bucket style. Only once all data has gone in to the cache can the fwrite complete. However at that point you are likely doing another fwrite call that tops the cache up again and your subsequent call is forced to wait a bit longer because the cache has not been fully emptied. I'd imagine you would reach a fixed point though...
To see if caches really were behind the behaviour you could issue an fsync after each fwrite (destroying your performance) and take the time from the fwrite submission to fsync completion and see if the variance was so large.
Another thing you could do that might help is to preallocate the full size of the file up front so that the filesystem isn't forced to keep regrowing it as new data is appended to the end (this should cut down on metadata operations and fight fragmentation too).
The dirty_* knobs in https://www.kernel.org/doc/Documentation/sysctl/vm.txt are also likely comming in to play given the large amounts of data you are writing.
On a very large SMP machine with many CPUS scripts are run with tens of simultaneous jobs (fewer than the number of CPUs) like this:
some_program -in FIFO1 >OUTPUT1 2>s_p1.log </dev/null &
some_program -in FIFO2 >OUTPUT2 2>s_p2.log </dev/null &
...
some_program -in FIFO40 >OUTPUT40 2>s_p40.log </dev/null &
splitter_program -in real_input.dat -out FIFO1,FIFO2...FIFO40
The splitter reads the input data flat out and distributes it to the FIFOs in order. (Records 1,41,81... to FIFO1; 2,42,82 to FIFO2, etc.) The splitter has low overhead and can pretty much process data as fast as the file system can supply it.
Each some_program processes its stream and writes it to its output file. However, nothing controls the order in which the file system sees these writes. The writes are also very small, on the order of 10 bytes. The script "knows" that there are 40 streams here and that they could be buffered in 20M (or whatever) chunks, and then each chunk written to the file system sequentially. That is, queued writes should be used to maximize write speed to the disks. The OS, however, just sees a bunch of writes at about the same rate on each of the 40 streams.
What happens in practice during a run is that the subprocesses get a lot of CPU time (in top, >80%), then a flush process appears (10% CPU), and all the others drop to low CPU (1%), then it goes back to the higher rate. These pauses go on for several seconds at a time. The flush means that the writes are overwhelming the file caching. Also I think the OS and/or the underlying RAID controller is probably bouncing the physical disk heads around erratically which is reducing the ultimate write speed to the physical disks. That is just a guess though, since it is hard to say what exactly is happening as there is file cache (in a system with over 500Gb of RAM) and a RAID controller between the writes and the disk.
Is there a program or method around for controlling this sort of IO, forcing the file system writes to queue nicely to maximize write speed?
The "buffer" program is not going to help much here because, while it would accumulate an output stream into a big chunk, there wouldn't be an orderly queuing of the writes, so several could go out at the same time. If the data rate in the output streams was uncorrelated this would be less of a problem, but in some cases the data rate is exactly the same in all streams, which means the buffers would all fill at the same time. This would stall the entire tree until the last one was written because any process that cannot write an output will not reads its next input, and that would stall the splitter, as all I/O is synchronous. The buffers need to be emptied in a cyclical manner, preferably before any of them completely fill up, although that may not be avoidable when data output rate exceeds the file system write rate.
There are dozens of parameters for tuning the file system, some of those might help. The scheduler was changed from cfq to deadline because the system was locking up for minutes at a time with the former.
If the problem is sheer I/O bandwidth then buffering won't solve anything. In that case, you need to shrink the data or send it to a higher-bandwidth sink to improve and level your performance. One way to do that would be to reduce the number of parallel jobs, as #thatotherguy said.
If in fact the problem is with the number of distinct I/O operations rather than with the overall volume of data, however, then buffering might be a viable solution. I am unfamiliar with the buffer program you mentioned, but I suppose it does what its name suggests. I don't completely agree with your buffering comments, however:
The "buffer" program is not going to help much here because, while it would accumulate an output stream into a big chunk, there wouldn't be an orderly queuing of the writes, so several could go out at the same time.
You don't necessarily need big chunks. It would probably be ideal to chunk at the native block size of the file system, or a small integer multiple thereof. That might be, say, 4096- or 8192-byte chunks.
Moreover, I don't see why you think you have an "orderly queueing of writes" now, or why you're confident that such a thing is needed.
If the data rate in the output streams was uncorrelated this would be less of a problem, but in some cases the data rate is exactly the same in all streams, which means the buffers would all fill at the same time. This would stall the entire tree until the last one was written because any process that cannot write an output will not reads its next input, and that would stall the splitter, as all I/O is synchronous.
Your splitter is writing to FIFOs. Though it may do so serially, that is not "synchronous" in the sense that the data needs to be drained out the other end before the splitter can proceed -- at least, not if the writes do not exceed the size of the FIFOs' buffers. FIFO buffer capacity varies from system to system, adapts dynamically on some systems, and is configurable (e.g. via fcntl()) on some systems. The default buffer size on modern Linux is 64kB.
The buffers need to be emptied in a cyclical manner, preferably before any of them completely fill up, although that may not be avoidable when data output rate exceeds the file system write rate.
I think this is a problem that pretty much solves itself. If one of the buffers backs up enough to block the splitter, then that ensures that the competing processes will, before too long, give the blocked buffer the opportunity to write. But this is also why you don't want enormous buffers -- you want to interleave disk I/O from different processes relatively finely to try to keep everything going.
The alternative to an external buffer program is to modify your processes to perform internal buffering. That might be an advantage because it removes a whole set of pipes (to an external buffering program) from the mix, and it lightens the process load on the machines. It does mean modifying your working processing program, though, so perhaps it would be better to start with external buffering to see how well that does.
If the problem is that your 40 streams each have a high data rate, and your RAID controller cannot write to physical disk fast enough, then you need to redesign your disk system. Basically, divide it into 40 RAID-1 mirrors and write one file to each mirror set. That makes the writes sequential for each stream, but requires 80 disks.
If the data rate isn't the problem then you need to add more buffering. You might need a pair of threads. One thread to collect the data into memory buffers and another thread to write it into data files and fsync() it. To make the disk writes sequential it should fsync each output file one at a time. That should result in writing large sequential chunks of whatever your buffer size is. 8 MB maybe?
I would like to store a couple of entries to a file (optimized for reading) and a good data structure for that seems to be a B+ tree. It offers a O(log(n)/log(b)) access time where b is the number of entries in one block.
There are many papers etc. describing B+ trees, but I still have some troubles understaning block based storage systems in general. Maybe someone can point me to the right direction or answer a couple of questions:
Do (all common) file systems create new files at the beginning of a new block? So, can I be sure that seek(0) will set the read/write head to a multiply of the device's block size?
Is it right that I only should use calls like pread(fd, buf, n * BLOCK_SIZE, p * BLOCK_SIZE) (with n, p being integers) to ensure that I always read full blocks?
Is it better to read() BLOCK_SIZE bytes into an array or mmap() those instead? Or is there only a difference if I mmap many blocks and access only a few? What is better?
Should I try to avoid keys spawning multiple blocks by adding padding bytes at the end of each block? Should I do the same for the leaf nodes by adding padding bytes between the data too?
Many thanks,
Christoph
In general, file systems create new files at the beginning of a new block because that is how the underlying device works. Hard disks are block devices and thus cannot handle anything less than a "block" or "sector". Additionally, operating systems treat memory and memory mappings in terms of pages, which are usually even larger (sectors are often 512 or 1024 bytes, pages usually 4096 bytes).
One exception to this rule that comes to mind would be ReiserFS, which puts small files directly into the filesystem structure (which, if I remember right, is incidentially a B+ tree!). For very small files this can actually a viable optimization since the data is already in RAM without another seek, but it can equally be an anti-optimization, depending on the situation.
It does not really matter, because the operating system will read data in units of full pages (normally 4kB) into the page cache anyway. Reading one byte will transfer 4kB and return a byte, reading another byte will serve you from the page cache (if it's the same page or one that was within the readahead range).
read is implemented by copying data from the page cache whereas mmap simply remaps the pages into your address space (possibly marking them copy-on-write, depending on your protection flags). Therefore, mmap will always be at least as fast and usually faster. mmap is more comfortable too, but has the disadvantage that it may block at unexpected times when it needs to fetch more pages that are not in RAM (though, that is generally true for any application or data that is not locked into memory). readon the other hand blocks when you tell it, not otherwise.
The same is true under Windows with the exception that memory mapped files under pre-Vista Windows don't scale well under high concurrency, as the cache manager serializes everything.
Generally one tries to keep data compact, because less data means fewer pages, and fewer pages means higher likelihood they're in the page cache and fit within the readahead range. Therefore I would not add padding, unless it is necessary for other reasons (alignment).
Filesystems which support delayed allocation don't create new files anywhere on disc. Lots of newer filesystems support packing very small files into their own pages or sharing them with metadata (For example, reiser puts very tiny files into the inode?). But for larger files, mostly, yes.
You can do this, but the OS page cache will always read an entire block in, and just copy the bits you requested into your app's memory.
It depends on whether you're using direct IO or non-direct IO.
If you're using direct IO, which bypasses the OS's cache, you don't use mmap. Most databases do not use mmap and use direct IO.
Direct IO means that the pages don't go through the OS's page cache, they don't get cached at all by the OS and don't push other blocks out of the OS cache. It also means that all reads and writes need to be done on block boundaries. Block boundaries can sometimes be determined by a statfs call on the filesystem.
Most databases seem to take the view that they should manage their own page cache themselves, and use the OS only for physical reads/writes. Therefore they typically use direct and synchronous IO.
Linus Torvalds famously disagrees with this approach. I think the vendors really do it to achieve better consistency of behaviour across different OSs.
Yes. Doing otherwise would cause unnecessary complications in FS design.
And the options (as an alternative to "only") are ...?
In Windows memory-mapped files work faster than file API (ReadFile). I guess on Linux it's the same, but you can conduct your own measurements