Does multithreading make sense for IO-bound operations? - multithreading

When performing many disk operations, does multithreading help, hinder, or make no difference?
For example, when copying many files from one folder to another.
Clarification: I understand that when other operations are performed, concurrency will obviously make a difference. If the task was to open an image file, convert to another format, and then save, disk operations can be performed concurrently with the image manipulation. My question is when the only operations performed are disk operations, whether concurrently queuing and responding to disk operations is better.

Most of the answers so far have had to do with the OS scheduler. However, there is a more important factor that I think would lead to your answer. Are you writing to a single physical disk, or multiple physical disks?
Even if you parallelize with multiple threads...IO to a single physical disk is intrinsically a serialized operation. Each thread would have to block, waiting for its chance to get access to the disk. In this case, multiple threads are probably useless...and may even lead to contention problems.
However, if you are writing multiple streams to multiple physical disks, processing them concurrently should give you a boost in performance. This is particularly true with managed disks, like RAID arrays, SAN devices, etc.
I don't think the issue has much to do with the OS scheduler as it has more to do with the physical aspects of the disk(s) your writing to.

That depends on your definition of "I/O bound" but generally multithreading has two effects:
Use multiple CPUs concurrently (which won't necessarily help if the bottleneck is the disk rather than the CPU[s])
Use a CPU (with a another thread) even while one thread is blocked (e.g. waiting for I/O completion)
I'm not sure that Konrad's answer is always right, however: as a counter-example, if "I/O bound" just means "one thread spends most of its time waiting for I/O completion instead of using the CPU", but does not mean that "we've hit the system I/O bandwidth limit", then IMO having multiple threads (or asynchronous I/O) might improve performance (by enabling more than one concurrent I/O operation).

I would think it depends on a number of factors, like the kind of application you are running, the number of concurrent users, etc.
I am currently working on a project that has a high degree of linear (reading files from start to finish) operations. We use a NAS for storage, and were concerned about what happens if we run multiple threads. Our initial thought was that it would slow us down because it would increase head seeks. So we ran some tests and found out that the ideal number of threads is the same as the number of cores in the computer.
But your mileage may vary.

It can do, simply because whenever there is more work for a thread to do (identifying the next file to copy) the OS wakes it up, so threads are a simple way to hook into the OS scheduler and yet still write code in a traditional sequential way, instead of having to break it up into a state machine with callbacks.
This is mainly an assistance with clear programming rather than performance.

In most cases, using multi-thread for disk IO will not benefit efficiency. Let's imagine 2 circumstances:
Lock-Free File: We can split the file for each thread by giving them different IO offset. For instance, a 1024B bytes file is split into n pieces and each thread writes the 1024/n respectively. This will cause a lot of verbose disk head movement because of the different offset.
Lock File: Actually lock the IO operation for each critical section. This will cause a lot of verbose thread switches and it turns out that only one thread can write the file simultaneously.
Correct me if I' wrong.

No, it makes no sense. At some point, the operations have to be serialized (by the OS). On the other hand, since modern OS's have to cope with multiple processes anyway I doubt that there's an added overhead.

I'd think it would hinder the operations... You only have one controller and one drive.
You could use a second thread to do the operation, and a main thread that shows an updated UI.

I think it could worsen the performance, because the multiple threads will compete for the same resources.
You can test the impact of doing concurrent IO operations on the same device by copying a set of files from one place to another and measuring the time, then split the set in two parts and make the copies in parallel... the second option will be sensibly slower.

Related

Mulithreading does not help for IO intensive task?

I need to copy a set of files with the size of each file ranging from 1MB to 700MB. After I copy each file, I need to validate the checksum of each file against an entry in md5sum.txt.
I wanted to optimize this task and hence evaluated the performance by splitting the load among multiple threads. The results were not as expected. I was expecting that the time taken for copy and validation would decrease with increase in the number of threads, but the time taken actually increased.
I have modified the ThreadPool source code shared in this link https://stackoverflow.com/a/22285532/1568395 to implement the threadpool.
The source code for the application can be found here
https://github.com/saai63/ThreadPool
The results for various number of threads is as shown below,
As per my reading, the probable reason could be that all tasks are now IO bound tasks and hence all of the threads will be blocked on IO operation and hence cannot run in parallel as the shared resource here is the HDD. I also understand that HDD controller tries to optimize the disk access by reducing the seek time. Disks love sequential access patterns, and any concurrent accesses will disrupt this pattern and hence the delay for large files.
Is this the only reason for the delay or there are some other factors? Why does the time increases with the increase in number of threads?
IO is always much slower than the CPU. When multiple threads try to read from an IO device, what they usually achieve is a "bull rush" to the device and increase the "randomness" of the IO operations, thus making it all slower. Fewer threads have a greater chance of sequential operations, which are notoriously faster.
In case of Multithreading you share CPU amongst threads. CPU is swtichted amongts threads whenever running thread goes into some sort of waiting state.
Here you have IO bound task and, there's no point of making your program multithreaded as all of them will be relying on single IO device.
Even if you implement a multiprocess solution (multiple processes on same node), all processes will be waiting for the same IO device and won't give any performance optimization.
One solution would be building some sort of multi node solution with shared disk having simultaneous multi-client access support.
Using this kind of approach you can divide your task amongst multiple nodes, access same disk and perform operation.
Edit:
I think increase in time is beacause time taken by Operating System for servicing multiple threads.
Switching CPU and IO devices amongst thread is gonna take as you increase number of threads, Context Switch is compute intensive task as well as you lose the IO/CPU Cache performance as you switch amongts threads.

Cost of a thread

I understand how to create a thread in my chosen language and I understand about mutexs, and the dangers of shared data e.t.c but I'm sure about how the O/S manages threads and the cost of each thread. I have a series of questions that all relate and the clearest way to show the limit of my understanding is probably via these questions.
What is the cost of spawning a thread? Is it worth even worrying about when designing software? One of the costs to creating a thread must be its own stack pointer and process counter, then space to copy all of the working registers to as it is moved on and off of a core by the scheduler, but what else?
Is the amount of stack available for one program split equally between threads of a process or on a first come first served?
Can I somehow check the hardware on start up (of the program) for number of cores. If I am running on a machine with N cores, should I keep the number of threads to N-1?
then space to copy all of the working registeres to as it is moved on
and off of a core by the scheduler, but what else?
One less evident cost is the strain imposed on the scheduler which may start to choke if it needs to juggle thousands of threads. The memory isn't really the issue. With the right tweaking you can get a "thread" to occupy very little memory, little more than its stack. This tweaking could be difficult (i.e. using clone(2) directly under linux etc) but it can be done.
Is the amount of stack available for one program split equally between
threads of a process or on a first come first served
Each thread gets its own stack, and typically you can control its size.
If I am running on a machine with N cores, should I keep the number of
threads to N-1
Checking the number of cores is easy, but environment-specific. However, limiting the number of threads to the number of cores only makes sense if your workload consists of CPU-intensive operations, with little I/O. If I/O is involved you may want to have many more threads than cores.
You should be as thoughtful as possible in everything you design and implement.
I know that a Java thread stack takes up about 1MB each time you create a thread. , so they add up.
Threads make sense for asynchronous tasks that allow long-running activities to happen without preventing all other users/processes from making progress.
Threads are managed by the operating system. There are lots of schemes, all under the control of the operating system (e.g. round robin, first come first served, etc.)
It makes perfect sense to me to assign one thread per core for some activities (e.g. computationally intensive calculations, graphics, math, etc.), but that need not be the deciding factor. One app I develop uses roughly 100 active threads in production; it's not a 100 core machine.
To add to the other excellent posts:
'What is the cost of spawning a thread? Is it worth even worrying about when designing software?'
It is if one of your design choices is doing such a thing often. A good way of avoiding this issue is to create threads once, at app startup, by using pools and/or app-lifetime threads dedicated to operations. Inter-thread signaling is much quicker than continual thread creation/termination/destruction and also much safer/easier.
The number of posts concerning problems with thread stopping, terminating, destroying, thread count runaway, OOM failure etc. is ledgendary. If you can avoid doing it at all, great.

Linux File IO - Multithreading performance - writing to different files

I'm currently working on an audio recording application, that fetches up to 8 audio streams from the network and saves the data to the disk (simplified ;) ).
Right now, each stream gets handled by one thread -> the same thread also does the saving work on the disk.
That means I got 8 different threads that perform writes on the same disk, each one into a different file.
Do you think there would be an increase in the disk i/o performance if all the writing work would be done by one common thread (that would sequently write the data into the particular files)?
OS is an embedded Linux, the "disk" is a CF card, the application is written in C.
Thanks for your ideas
Nick
The short answer: Given that you are writing to a Flash disk, I wouldn't expect the number of threads to make much difference one way or another. But if it did make a difference, I would expect multiple threads to be faster than a single thread, not slower.
The longer answer:
I wrote a similar program to the one you describe about 6 years ago -- it ran on an embedded PowerPC Linux card and read/wrote multiple simultaneous audio files to/from a SCSI hard drive. I originally wrote it with a single thread doing I/O, because I thought that would give the best throughput, but it turned out that that was not the case.
In particular, when multiple threads were reading/writing at once, the SCSI layer was aware of all the pending requests from all the different threads, and was able to reorder the I/O requests such that seeking of the drive head was minimized. In the single-thread-IO scenario, on the other hand, the SCSI layer knew only about the single "next" outstanding I/O request and thus could not do that optimization. That meant extra travel for the drive head in many cases, and therefore lower throughput.
Of course, your application is not using SCSI or a rotating drive with heads that need seeking, so that may not be an issue for you -- but there may be other optimizations that the filesystem/hardware layer can do if it is aware of multiple simultaneous I/O requests. The only real way to find out is to try various models and measure the results.
My suggestion would be to decouple your disk I/O from your network I/O by moving your disk I/O into a thread-pool. You can then vary the maximum size of your I/O-thread-pool from 1 to N, and for each size measure the performance of the system. That would give you a clear idea of what works best on your particular hardware, without requiring you to rewrite the code more than once.
If it's embedded linux, I guess your machine has only one processor/core. In this case threads won't improve I/O performance at all. Of course linux block subsystem works well in concurrent environment, but in your case (if my guess about number of cores is right) there can't be a situation when several threads do something simultaneously.
If my guess is wrong and you have more than 1 core, then I'd suggest to benchmark disk I/O. Write a program that writes a lot of data from different threads and another program that does the same from only one thread. The results will show you everything you want to know.
I think that there is no big difference between multithreaded and singlethreaded solution in your case, but in case of multithreading you can syncronize between receiving threads and no one thread can affect on other threads in case of blocking in some system call.
I did particulary the same thing on embedded system, the problem was the high cpu usage when kernel drop many cached dirty pages to the CF, pdflush kernel process take all cpu time in that moment and if you receive stream via udp so it can be skipped because of cpu was busy when udp stream came, so I solved that problem by fdatasync() call every time when some not big amount of data received.

Delphi 2010: Advantage of running multi threads if cannot allocate memory to create object for calculation in each thread

My Previous Question
From the above answer, means if in my threads has create objects, i will face memory allocation/deallocation bottleneck, thus result running threads may slower or no obvious time taken diff. than no thread. What's the advantages of running multi threads in the application if I cannot allocate memory to create the object for calculations in my thread?
What's the advantages of running multi threads in the application if I cannot allocate memory to create the objects for calculations in my thread?
It depends on where your bottlenecks are. If your bottleneck is the amount of memory available, then creating more threads won't help. Or, if I/O is a bottleneck, trying to parallelize will just slightly slow down everything because of context switching. It's like trying to make an underpowered car faster by putting wider tyres in it: fixing the wrong thing doesn't help.
Threads are useful when the bottleneck is the processor and there are several processors available.
Well, if you allocate chunks of memory in a loop, things will slow down.
If you can create your objects once at the beginning of TThread.execute, the overhead will be smaller.
Threads can also be benificial if you have to wait for IO-operations, or if you have expensive calculations to do on a machine with more than one physical core.
If you have memory intensive threads (many memory allocations/deallocations) you better use TopMM instead of FastMM:
http://www.topsoftwaresite.nl/
FastMM uses a lock which blocks all other threads, TopMM does not so it scales much better on multi cores/cpus!
When it comes to multithreding, shared resources issues will always arise (with current technology). All resources that may need serialization (RAM, disk, etc.) are a possible bottleneck. Multithreading is not a magic solution that turns a slow app in a fast one, and not always result in better speed. Made in the wrong way, it can actually result in worse speed. it should be analyzed to find possible bottlenecks, and some parts could need to be rewritten to minimize bottlenecks using different techniques (i.e. preallocating memory, using async I/O, etc.). Anyway, performance is only one of the reasons to use more than one thread. There are several other reason, for example letting the user to be able to interact with the application while background threads perform operations (i.e. printing, checking data, etc.) without "locking" the user. The application that way could seem "faster" (the user can keep on using it without waiting) even if it is actually slowerd (it takes more time to finish operations than if made them serially).

Simultaneous Or Sequential writes-- Does it matter in terms of speed?

Simultaneous Or Sequential write operation-- Does it matter in terms of speed?
With multicore processor, does it make sense to parallelize all the file write operation using multi thread, just to get a boost of speed? Of course, all those write operations are independent.
Generally, no.
As of now, the physical write to disk IS the bottle neck by some orders of magnitude, and it is in most scenarios rather sequential. Parallelizing writes you have good chances to worsen performance by incurring seeks. Sequential reads and writes will largely outperform interleaving n most cases.
Per-disk parallelization (TCQ and NCQ) mainly work by reducing the seeks that are naturally required when different clients concurrently request data from different sections of the disk. If you can avoid these seeks in the first place, you are better off.
I some scenarios - RAID 1, JBOD or when different streams of data arrive rather slowly - the right scheduling can improve your throughput, but that requires intimate knowledge of the hardware at hand, and other processes not spoiling your fun.
At best, you can leave that as a decision to the end user (e.g. give an option to turn it off), and provide performance measures to guide him. (You might even prove me wrong ;))
That depends on the disks and their controller. Do they have TCQ/NCQ? Is it RAID?
If so that might make some sense. With one regular SATA disk w/o NCQ, it won't.
Write the simplest code first, and see whether that performs well enough with the target environment. (Different disks, operating system versions, CPUs, drivers etc may well affect the result significantly.)
If the simplest correct code isn't fast enough, then it makes sense to try to work out faster ways of performing IO. At a guess, it might make sense to parallelize the write operations if you're writing to different disks, but possibly not otherwise. That's only a complete guess though.
Purely by coincidence, I'm planning to benchmark a related situation soon. I have a blog post describing the tests I intend to perform, and will update the entry with a link to results when I've got some. It's not quite the same as what you're describing, but close enough to perhaps be of interest.
Technically, you can mmap a file and have multiple threads write to it, but the disk will probably still create a bottleneck.
If you need maximize I/O throughput, a starting point would be to investigate the asynchronous I/O your environment supports.
This is a simple question, but the answer can be really really complicated. Les try to narrow down the scenario with some assumptions: The OS is Windows, you have a relatively large number of writes that are truly independent.
You can skip the multi-threading by simply issuing the writes asynchronously.
Issue them all at once - let the OS schedule the writes
It doesn't matter if the writes are to the same file or to different files. Note, this is only true if the above assumption about the writes being independent is true.
Worst case, this won' be any slower than a single plain old every day disk on a parallel ATA controller: it will be slow.
Best case, the OS can schedule the writes very efficiency. This would be true in the case of a storage system with lots of spindles, or with a disk that supports NCQ.
The key thing to remember here is that disk I/O (in general) isn't CPU bound, so going out of your way to use multi-core won't help you; it will just make life complex.
Note, you can help things if you order the writes so they are sequential in a file (overall) or sequential on the disk by sorting them by their extent.
If you are talking about writing to one file, the answer is no. You can't parallelize writing to one file since every process or thread has to acquire a lock for the file from the OS to do writes.
Otherwize this has to depend on the hardware controllers and type of storage, the OS kernel and filesystem implementation.

Resources