Algorithm to optimally distribute multiple writer threads on multiple physical disks

Algorithm to optimally distribute multiple writer threads on multiple physical disks - multithreading

I have a logical store which has multiple physical disks assigned to it
STORE
X:\
Y:\
Z:\
...
I also have a pool of threads that write data (size unknown) to the STORE. Is there an algorithm which I can use (load balancing, scheduling... etc.) to help me determine on which physical disk I should write?
Factors to take under consideration:
Free disk space available.
Disk utilization (proper distribution of threads across physical disks).
Free space % on all disks should be more or less be the same.
Notes:
Each thread has its own data to process, so a single thread can sleep if its data is not available.
Disks are not necessarily the same size.
One or more disks could be taken offline.
One or more disks could be added to the STORE.
UPDATE:
I should've explained the objective of these threads better in my question; these threads read from different data sources/streams and write immediately to disk(s), buffering steams in memory is not much of an option because their size tend to grow huge quickly

Whatever you go with is going to require some tuning. What I describe below is a simple and effective starting point that might very well fit your needs.
First, I doubt that you actually need three threads to handle writing to three disk drives. The amount of processing required to orchestrate this is actually quite small.
As a first cut, you could do a simple round-robin scheduling with one thread and asynchronous writes. That is, you just have a circular queue that you fill with [X, Y, Z]. When a request comes in, you take a disk from the front of the queue and initiate an asynchronous write to that drive.
When the next request comes in, you again take the first item from the queue and issue an asynchronous write.
When an asynchronous write completes, the disk to which the data was written is added to the end of the queue.
If a drive is taken offline, it's removed from the queue. If a drive is added to the store, you make a new entry for it in the queue.
An obvious problem with the above is what to do if you get more concurrent write requests than you have drives. Using the technique I described above, the thread would have to block until there is a drive available. If you have to support bursts of activity, you could easily create a request queue into which requests are written (with their associated data). The thread doing the orchestration, then, would read an item from the queue, get a disk drive from the drive queue, and start the asynchronous write.
Note that with this setup, no drive can be doing more than a single write at a time. That's typically not a problem because the drive hardware typically can't handle multiple concurrent writes.
Keeping free space percentage relatively the same across drives might not be much harder. You could keep track of the free space percentage on each drive easily enough, and rather than using a FIFO queue for the drives, use a priority queue so that you always write to the drive that has the highest free space percentage. That will work well as long as your average write size isn't a huge percentage of a drive's free space.
Edit
Note that I said asynchronous writes. So you can have as many concurrent writes as you have drives. Those writes are running concurrently and will notify on an I/O completion port when done. There's no need for multiple threads.
As for the priority queue, there are plenty of those to choose from, although finding a good concurrent priority queue is a bit more work. In the past I've just used locks to synchronize access to my own priority queue implementation. I guess I should formalize that at some point.
You could play with what I describe above by, for example, adding two or more entries in the queue for each drive. More for faster drives, fewer for slower drives. It's unclear how well that would work, but it's probably worth trying. If those "drives" are high performance network storage devices, they might actually be able to handle multiple concurrent writes better than a typical local disk drive can. But at some point you'll have to buffer writes because your computer can almost certainly create data much faster than your drives can write. The key is making your buffer large enough to handle the normal bursts of data, and also robust enough to block the program briefly if the buffer fills up.

Related

Does threading a lot leads to thrashing?

Does threading a lot leads to thrashing if each new thread wants to access the memory (specifically the same database in my case) and perform read/write operations through out its lifetime?
I assume that this is true. If my assumption is true, then what is the best way to maximize the CPU utilization? And how can i determine that some specific number of threads will give good CPU utilization?
If my assumption is wrong, please do give proper illustrations to let me understand the scenario clearly.

Trashy code causes trashing. Not thread. All code is ran by some threads, even the main(). Temp objects are garbage collected the same way on any thread.
The subtle part is when each thread preloads its own objects to perform the work, which can duplicate a lot of same classes. It's usually a small sacrifice to make to get the power of concurrency. But it's not trash (no leak, no deterioration).
There is one exception: when some 3rd party code caches material in thread locals... You could end up caching the same stuff on each thread. Not really a leak, but not efficient.
Rule of thumb for number of threads? Depends on the task.
If the tasks are pure computation like math, then you should not exceed the number of non-hyperthreaded cores.
If the job is memory intensive along with pure computation work (most cases), then the number of hyperthreaded cores is your target (because the CPU will use the idle time of memory access for another core computations).
If the job is mostly large sequential disk i/o, then you number of threads should be not to much above the number of disk spindle available to read. This is VERY approximative since the disk caches, DMA, SSD, raids and such are completely affecting how the disk layer can service your thread without idling. When using random access, this is also valid. However, the virtualization these days will throw all your estimates out the window. Disk i/o could be much more available than you think, but also much worse.
If the jobs are mostly network i/o waits, then it is not really limited from your side; I would go with about 3x the number of cores to start. This multiplier is simply presuming that such thread wait on network for 2/3 of its time. Which is very low in practice. Could be 99% of its time waiting for nw i/o (100x). Which is why you see NIO sockets everywhere, to deal with many connections with fewer busier threads.

No, you could have 100's of idle threads waiting for work and not see any thrashing, which is caused by application working set size exceeding available memory size, so active pages need to be reloaded from disk (even written out to disk to when temporary variable storage needs saving to be relaoded later).
Threads share an address space, having many active leads to diminishing returns due to lock contention. So in the DB case, many processes reading tables can proceed simultaneously, yet updates of dependant data need to be serialised to keep data consistent which may cause lock contention and limit parallel processing.
Poorly written queries which need to load & sort large tables into memory, may cause thrashing when they exceed free RAM (perhaps poor choice of indexs). You can increase the query throughput, to utilise CPUs more, by having large RAM disk caches and using SSDs to reduce random data access times.
On memory intensive computations, cache sizes may become important, fewer threads whose data stays in cache and CPU pre-fetches minimise stalls, work better than threads competing to load their data from main memory.

Reducing seek times when reading many small files

I need to write some code (in any language) to process 10,000 files that reside on a local Linux filesystem. Each file is ~500KB in size, and consists of fixed-size records of 4KB each.
The processing time per record is negligible, and the records can be processed in any order, both within and across different files.
A naïve implementation would read the files one by one, in some arbitrary order. However, since my disks are very fast to read but slow to seek, this will almost certainly produce code that's bound by disk seeks.
Is there any way to code the reading up so that it's bound by disk throughput rather than seek time?
One line of inquiry is to try and get an approximate idea of where the files reside on disk, and use that to sequence the reads. However, I am not sure what API could be used to do that.
I am of course open to any other ideas.
The filesystem is ext4, but that's negotiable.

Perhaps you could do the reads by scheduling all of them in quick succession with aio_read. That would put all reads in the filesystem read queue at once, and then the filesystem implementation is free to complete the reads in a way that minimizes seeks.

A very simple approach, although no results guaranteed. Open as many of the files at once as you can and read all of them at once - either using threads or asynchronous I/O. This way the disk scheduler knows what you read and can reduce the seeks by itself. Edit: as wildplasser observes, parallel open() is probably only doable using threads, not async I/O.
The alternative is to try to do the heavy lifting yourself. Unfortunately this involves a difficult step - getting the mapping of the files to physical blocks. There is no standard interface to do that, you could probably extract the logic from something like ext2fsprogs or the kernel FS driver. And this involves reading the physical device underlying a mounted filesystem, which can be writing to it at the same time you're trying to get a consistent snapshot.
Once you get the physical blocks, just order them, reverse the mapping back to the file offsets and execute the reads in the physical block order.

could you recommend using a SSD for the file storage? that should reduce seek times greatly as there's no head to move.

Since operations are similar and data are independent you can try using a thread pool to submit jobs that work on a number of files (can be a single file). Then you can have an idle thread complete a single job. This might help overlapping IO operations with execution.

A simple way would be to keep the original program, but fork an extra process which has no other task than to prefetch the files, and prime the disk buffer cache. ( a unix/linux system uses all "free" memory as disk buffer).
The main task will stay a few files behind (say ten). The hard part would be to keep things synchronised. A pipe seems the obvious way to accomplish this.
UPDATE:
Pseudo code for the main process:
fetch filename from worklist
if empty goto 2.
(maybe) fork a worker process or thread
add to prefetch queue
add to internal queue
if fewer than XXX items on internal queue goto 1
fetch filename from internal queue
process it
goto 1
For the slave processes:
fetch from queue
if empty: quit
prefetch file
loop or quit
For the queue, a message queue seems most appropiate, since it maintains message boundaries. Another way would be to have one pipe per child (in the fork() case) or use mutexes (when using threads).
You'll need approximate seektime_per_file / processing_time_per_file worker threads / processes.
As a simplification: if seeking the files is not required (only sequential access), the slave processes could consist of the equivalent of
dd if=name bs=500K
, which could be wrapped into a popen() or a pipe+fork().

Is it useful to use multithreading to handle files on a hard drive?

In terms of performance and speed of execution it is useful to use multithreading to handle files on a hard drive? (to move files from a disk to another or to check integrity of files)
I think it is mainly the speed of my HDD that will determine the speed of my treatment.

Multithreading can help, at least sometimes. The reason is that if you are writing to a "normal" hard drive (e.g. not a solid state drive) then the thing that is going to slow you down the most is the hard drive's seek time (that is, the time it takes for the hard drive to reposition its read/write head from one distance along the the disk's radius to another). That movement is glacially slow compared to the rest of the system, and the time it takes for the head to seek is proportional to the distance it must travel. So for example, the worst case scenario would be if the head had to move from the edge of the disk to center of the disk after each operation.
Of course the ideal solution is to have the disk head never seek, or seek only very rarely, and if you can arrange it so that your program only needs to read/write a single file sequentially, that will be fastest. Or better yet, switch to an SSD, where there is no disk head, and the seek time is effectively zero. :)
But sometimes you need your drive to be able to read/write multiple files in parallel, in which case the drive head will (of necessity) be seeking back and forth a lot. So how can multithreading help in this scenario? The answer is this: with a sufficiently smart disk I/O subsystem (e.g. SCSI, I'm not sure if IDE can do this), the I/O logic will maintain a queue of all currently outstanding read/write requests, and it will dynamically re-order that queue so that the requests are fulfilled in the order that minimizes the amount of travel by the read/write head. This is known as the Elevator Algorithm, because it is similar to the logic used by an elevator to maximize the number of people it can transport in a given period of time.
Of course, the OS's I/O subsystem can only implement this optimization if it knows in advance what I/O requests are pending... and if you have only one thread initiating I/O requests, then the I/O subsystem will only know about the current request. (i.e. it can't "peek" into your thread's userland request queue to see what your thread will want next). And of course your userland thread doesn't know the details of the disk layout, so it's difficult (impossible?) to implement the Elevator Algorithm in user space.
But if your program has N threads reading/writing the disk at once, then the OS's I/O subsystem will be aware of up to N I/O requests at once, and can re-order those requests as it sees fit to maximize disk performance.

Perhaps your main concern should be code maintainability. Threading helps hugely, IMO, because it does not permit the kind of hackery that single-threading permits.

Data access synchronization between multiple threads

I'm trying to implement a multi threaded, recursive file search logic in Visual C++. The logic is as follows:
Threads 1,2 will start at a directory location and match the files present in the directory with the search criteria. If they find a child directory, they will add it to a work Queue. Once a thread finishes with the files in a directory, it grabs another directory path from the work queue. The work queue is a STL Stack class guarded with CriticalSections for push(),pop(),top() calls.
If the stack is empty at any point, the threads will wait for a minute amount of time before retrying. Also when all the threads are in waiting state, the search is marked as complete.
This logic works without any problems but I feel that I'm not gaining the full potential of using threads because there isn't drastic performance gain compared to using single thread. I feel the work Stack is the bottle neck but can't figure out how to do away with the locking part. I tried another variation where each thread will be having its own Stack and will add a work item to the global Stack only when the local stack size crosses a fixed number of work items. If the local Stack is empty, threads will try fetching from global queue. I didn't find noticeable difference even with this variation. Does any one have any suggestions for improving the synchronization logic.
Regards,

I really doubt that your work stack is the bottleneck. The disk only has one head, and can only read one stream of data at a time. As long as your threads are processing the data as fast as the disk can supply it, there's not much else you can do that's going to have any significant effect on overall speed.
For other types of tasks your queue might become a significant bottleneck, but for this task, I doubt it. Keep in mind the time scales of the operations here. A simple operation that happens inside of a CPU takes considerably less than a nanosecond. A read from main memory takes on the order of tens of nanoseconds. Something like a thread switch or synchronization takes on the order of a couple hundred nanoseconds or so. A single head movement on the disk drive takes on the order of a millisecond or so (1,000,000 nanoseconds).

In addition to #Jerry's answer, your bottleneck is the disk system. If you have a RAID array you might see some moderate improvement from using 2 or 3 threads.
If you have to search multiple drives (note: physical drives, not volumes on a single physical drive) you can use extra threads for each of them.

Does multithreading make sense for IO-bound operations?

When performing many disk operations, does multithreading help, hinder, or make no difference?
For example, when copying many files from one folder to another.
Clarification: I understand that when other operations are performed, concurrency will obviously make a difference. If the task was to open an image file, convert to another format, and then save, disk operations can be performed concurrently with the image manipulation. My question is when the only operations performed are disk operations, whether concurrently queuing and responding to disk operations is better.

Most of the answers so far have had to do with the OS scheduler. However, there is a more important factor that I think would lead to your answer. Are you writing to a single physical disk, or multiple physical disks?
Even if you parallelize with multiple threads...IO to a single physical disk is intrinsically a serialized operation. Each thread would have to block, waiting for its chance to get access to the disk. In this case, multiple threads are probably useless...and may even lead to contention problems.
However, if you are writing multiple streams to multiple physical disks, processing them concurrently should give you a boost in performance. This is particularly true with managed disks, like RAID arrays, SAN devices, etc.
I don't think the issue has much to do with the OS scheduler as it has more to do with the physical aspects of the disk(s) your writing to.

That depends on your definition of "I/O bound" but generally multithreading has two effects:
Use multiple CPUs concurrently (which won't necessarily help if the bottleneck is the disk rather than the CPU[s])
Use a CPU (with a another thread) even while one thread is blocked (e.g. waiting for I/O completion)
I'm not sure that Konrad's answer is always right, however: as a counter-example, if "I/O bound" just means "one thread spends most of its time waiting for I/O completion instead of using the CPU", but does not mean that "we've hit the system I/O bandwidth limit", then IMO having multiple threads (or asynchronous I/O) might improve performance (by enabling more than one concurrent I/O operation).

I would think it depends on a number of factors, like the kind of application you are running, the number of concurrent users, etc.
I am currently working on a project that has a high degree of linear (reading files from start to finish) operations. We use a NAS for storage, and were concerned about what happens if we run multiple threads. Our initial thought was that it would slow us down because it would increase head seeks. So we ran some tests and found out that the ideal number of threads is the same as the number of cores in the computer.
But your mileage may vary.

It can do, simply because whenever there is more work for a thread to do (identifying the next file to copy) the OS wakes it up, so threads are a simple way to hook into the OS scheduler and yet still write code in a traditional sequential way, instead of having to break it up into a state machine with callbacks.
This is mainly an assistance with clear programming rather than performance.

In most cases, using multi-thread for disk IO will not benefit efficiency. Let's imagine 2 circumstances:
Lock-Free File: We can split the file for each thread by giving them different IO offset. For instance, a 1024B bytes file is split into n pieces and each thread writes the 1024/n respectively. This will cause a lot of verbose disk head movement because of the different offset.
Lock File: Actually lock the IO operation for each critical section. This will cause a lot of verbose thread switches and it turns out that only one thread can write the file simultaneously.
Correct me if I' wrong.

No, it makes no sense. At some point, the operations have to be serialized (by the OS). On the other hand, since modern OS's have to cope with multiple processes anyway I doubt that there's an added overhead.

I'd think it would hinder the operations... You only have one controller and one drive.
You could use a second thread to do the operation, and a main thread that shows an updated UI.

I think it could worsen the performance, because the multiple threads will compete for the same resources.
You can test the impact of doing concurrent IO operations on the same device by copying a set of files from one place to another and measuring the time, then split the set in two parts and make the copies in parallel... the second option will be sensibly slower.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string