Simultaneous Or Sequential write operation-- Does it matter in terms of speed?
With multicore processor, does it make sense to parallelize all the file write operation using multi thread, just to get a boost of speed? Of course, all those write operations are independent.
Generally, no.
As of now, the physical write to disk IS the bottle neck by some orders of magnitude, and it is in most scenarios rather sequential. Parallelizing writes you have good chances to worsen performance by incurring seeks. Sequential reads and writes will largely outperform interleaving n most cases.
Per-disk parallelization (TCQ and NCQ) mainly work by reducing the seeks that are naturally required when different clients concurrently request data from different sections of the disk. If you can avoid these seeks in the first place, you are better off.
I some scenarios - RAID 1, JBOD or when different streams of data arrive rather slowly - the right scheduling can improve your throughput, but that requires intimate knowledge of the hardware at hand, and other processes not spoiling your fun.
At best, you can leave that as a decision to the end user (e.g. give an option to turn it off), and provide performance measures to guide him. (You might even prove me wrong ;))
That depends on the disks and their controller. Do they have TCQ/NCQ? Is it RAID?
If so that might make some sense. With one regular SATA disk w/o NCQ, it won't.
Write the simplest code first, and see whether that performs well enough with the target environment. (Different disks, operating system versions, CPUs, drivers etc may well affect the result significantly.)
If the simplest correct code isn't fast enough, then it makes sense to try to work out faster ways of performing IO. At a guess, it might make sense to parallelize the write operations if you're writing to different disks, but possibly not otherwise. That's only a complete guess though.
Purely by coincidence, I'm planning to benchmark a related situation soon. I have a blog post describing the tests I intend to perform, and will update the entry with a link to results when I've got some. It's not quite the same as what you're describing, but close enough to perhaps be of interest.
Technically, you can mmap a file and have multiple threads write to it, but the disk will probably still create a bottleneck.
If you need maximize I/O throughput, a starting point would be to investigate the asynchronous I/O your environment supports.
This is a simple question, but the answer can be really really complicated. Les try to narrow down the scenario with some assumptions: The OS is Windows, you have a relatively large number of writes that are truly independent.
You can skip the multi-threading by simply issuing the writes asynchronously.
Issue them all at once - let the OS schedule the writes
It doesn't matter if the writes are to the same file or to different files. Note, this is only true if the above assumption about the writes being independent is true.
Worst case, this won' be any slower than a single plain old every day disk on a parallel ATA controller: it will be slow.
Best case, the OS can schedule the writes very efficiency. This would be true in the case of a storage system with lots of spindles, or with a disk that supports NCQ.
The key thing to remember here is that disk I/O (in general) isn't CPU bound, so going out of your way to use multi-core won't help you; it will just make life complex.
Note, you can help things if you order the writes so they are sequential in a file (overall) or sequential on the disk by sorting them by their extent.
If you are talking about writing to one file, the answer is no. You can't parallelize writing to one file since every process or thread has to acquire a lock for the file from the OS to do writes.
Otherwize this has to depend on the hardware controllers and type of storage, the OS kernel and filesystem implementation.
Related
I have a problem which is essentially a series of searches for multiple copies of items (needles) in a massive but in memory database (10s of Gb) - the haystack.
This is divided into tasks where each task is to find each of a series of needles in the haystack
and each task is logically independent from the other tasks.
(This is already distributed across multiple machines where each machine
has its own copy of the haystack.)
There are many ways this could be parallelized on individual machines.
We could have one search process per CPU core sharing memory.
Or we could have one search process with multiple threads (one per core). Or even several multi-threaded processes.
3 possible architectures:
A process loads the haystack into Posix shared memory.
Subsequent processes use the shared memory segment instead (like a cache)
A process loads the haystack into memory and then forks.
Each process uses the same memory because of copy on write semantics.
A process loads the haystack into memory and spawns multiple search threads
The question is one method likely to be better and why? or rather what are the trade offs.
(For argument's sake assume performance trumps implementation complexity).
Implementing two or three and measuring is possible of course but hard work.
Are there any reasons why one might be definitively better?
Data in the haystack is immutable.
The processes are running on Linux. So processes are not significantly more expensive than threads.
The haystack spans many GBs so CPU caches are not likely to help.
The search process is essentially a binary search (actually equal_range with a touch of interpolation).
Because the tasks are logically independent there is no benefit from inter-thread communication being
cheaper than inter-process communication (as per for example https://stackoverflow.com/a/18114475/1569204).
I cannot think of any obvious performance trade-offs between threads and shared memory here. Are there any? Perhaps the code maintenance trade-offs are more relevant?
Background research
The only relevant SO answer I could find refers to the overhead of synchronising threads - Linux: Processes and Threads in a Multi-core CPU - which is true but less applicable here.
Related and interesting but different questions are:
Multithreading: What is the point of more threads than cores?
Performance difference between IPC shared memory and threads memory
performance - multithreaded or multiprocess applications
An interesting presentation is https://elinux.org/images/1/1c/Ben-Yossef-GoodBadUgly.pdf
It suggests there can be a small difference in the speed of thread vs process context switches.
I am assuming that except for a monitoring threads/process the others are never switched out.
General advise: Be able to measure improvements! Without that, you may tweak all you like based on advise off the internet but still don't get optimal performance. Effectively, I'm telling you not to trust me or anyone else (including yourself) but to measure. Also prepare yourself for measuring this in real time on production systems. A benchmark may help you to some extent, but real load patterns are still a different beast.
Then, you say the operations are purely in-memory, so the speed doesn't depend on (network or storage) IO performance. The two bottlenecks you face are CPU and RAM bandwidth. So, in order to work on the right part, find out which is the limiting factor. Making sure that the according part is efficient ensures optimal performance for your searches.
Further, you say that you do binary searches. This basically means you do log(n) comparisons, where each comparison requires a load of a certain element from the haystack. This load probably goes through all caches, because the size of the data makes cache hits very unlikely. However, you could hold multiple needles to search for in cache at the same time. If you then manage to trigger the cache loads for the needles first and then perform the comparison, you could reduce the time where either CPU or RAM are idle because they wait for new operations to perform. This is obviously (like others) a parameter you need to tweak for the system it runs on.
Even further, reconsider binary searching. Binary searching performs reliably with a good upper bound on random data. If you have any patterns (i.e. anything non-random) in your data, try to exploit this knowledge. If you can roughly estimate the location of the needle you're searching for, you may thus reduce the number of lookups. This is basically moving the work from the RAM bus to the CPU, so it again depends which is the actual bottleneck. Note that you can also switch algorithms, e.g. going from an educated guess to a binary search when you have less than a certain amount of elements left to consider.
Lastly, you say that every node has a full copy of your database. If each of the N nodes is assigned one Nth of the database, it could improve caching. You'd then make one first step at locating the element to determine the node and then dispatch the search to the responsible node. If in doubt, every node can still process the search as a fallback.
The modern approach is to use threads and a single process.
Whether that is better than using multiple processes and a shared memory segment might depend somewhat on your personal preference and how easy threads are to use in the language you are using, but I would say that if decent thread support is available (e.g. Java) you are pretty much always better off using it.
The main advantage of using multiple processes as far as I can see is that it is impossible to run into the kind of issues you can get when managing multiple threads (e.g., forgetting to synchronise access to shared writable resources - except for the shared memory pool). However, thread-safety by not having threads at all is not much of an argument in favour.
It might also be slightly easier to add processes than add threads. You would have to write some code to change the number of processing threads online (or use a framework or application server).
But overall, the multiple-process approach is dead. I haven't used shared memory in decades. Threads have won the day and it is worth the investment to learn to use them.
If you do need to have multi-threaded access to common writable memory then languages like Java give you all sorts of classes for doing that (as well as language primitives). At some point you are going to find you want that and then with the multi-process approach you are faced with synchronising using semaphores and writing your own classes or maybe looking for a third party library, but the Java people will be miles ahead by then.
You also mentioned forking and relying on copy-on-write. This seems like a very fragile solution dependent on particular behaviour of the system and I would not myself use it.
Ive been trying to figure out if multithreading in an io-bound application will actually improve performance or reduce it. Many sources I have read are conflicting.
Take this one for example.
Why multithreading with io-bound is bad
The accepted answer is that if your application is io-bound multithreading will cause contention and slow down your application.
Where as is this example the answer with the highest votes states that it can improve throughput.
Why multithreading with io-bound is good
Am I misunderstanding something here?
In my situation I need to read from n disk locations n times a second. I'm finding it difficult to decide if I should be using threads at all.
For example, if I had 20 files on disk and 20 seperate threads reading from disk in a state of waiting and waking, is this going to completely slow down my system?
If a pthread is executing code that reads from disk, would all the other 19 threads doing the same thing on different file be blocked?
Why do you need multithreading for this? If you work with single disk, multithreading will provide the same performance in the best case or a bit slower otherwise.
In this question:
Does it make sense to spawn more than one thread per processor?
That you labeled "why multithreading with io-bound is good", the top answer has 16 upvotes and he states that "If your software makes frequent use of disk or network IO". Take particular note of the last part, "network IO". This is a distinguishing factor from your first linked question, which is only about threads and disk IO.
The only way that additional threads will help in an io-bound situation is if your storage system can handle multiple requests in parallel. This can be the case on high end storage arrays, but is unlikely to be the case on a consumer device like the ipad.
In terms of performance and speed of execution it is useful to use multithreading to handle files on a hard drive? (to move files from a disk to another or to check integrity of files)
I think it is mainly the speed of my HDD that will determine the speed of my treatment.
Multithreading can help, at least sometimes. The reason is that if you are writing to a "normal" hard drive (e.g. not a solid state drive) then the thing that is going to slow you down the most is the hard drive's seek time (that is, the time it takes for the hard drive to reposition its read/write head from one distance along the the disk's radius to another). That movement is glacially slow compared to the rest of the system, and the time it takes for the head to seek is proportional to the distance it must travel. So for example, the worst case scenario would be if the head had to move from the edge of the disk to center of the disk after each operation.
Of course the ideal solution is to have the disk head never seek, or seek only very rarely, and if you can arrange it so that your program only needs to read/write a single file sequentially, that will be fastest. Or better yet, switch to an SSD, where there is no disk head, and the seek time is effectively zero. :)
But sometimes you need your drive to be able to read/write multiple files in parallel, in which case the drive head will (of necessity) be seeking back and forth a lot. So how can multithreading help in this scenario? The answer is this: with a sufficiently smart disk I/O subsystem (e.g. SCSI, I'm not sure if IDE can do this), the I/O logic will maintain a queue of all currently outstanding read/write requests, and it will dynamically re-order that queue so that the requests are fulfilled in the order that minimizes the amount of travel by the read/write head. This is known as the Elevator Algorithm, because it is similar to the logic used by an elevator to maximize the number of people it can transport in a given period of time.
Of course, the OS's I/O subsystem can only implement this optimization if it knows in advance what I/O requests are pending... and if you have only one thread initiating I/O requests, then the I/O subsystem will only know about the current request. (i.e. it can't "peek" into your thread's userland request queue to see what your thread will want next). And of course your userland thread doesn't know the details of the disk layout, so it's difficult (impossible?) to implement the Elevator Algorithm in user space.
But if your program has N threads reading/writing the disk at once, then the OS's I/O subsystem will be aware of up to N I/O requests at once, and can re-order those requests as it sees fit to maximize disk performance.
Perhaps your main concern should be code maintainability. Threading helps hugely, IMO, because it does not permit the kind of hackery that single-threading permits.
While designing a fairly simple backup system for Linux in python, I'm finding myself asking the question, could there be any time advantage to backing up up several datasets/archives simultaneously?
My intuition tells me that writing to several archives simultaneously would not buy me much time as I/O would already be the greatest bottleneck.
On the other hand, if using something like bz2, would there be an advantage with multi-threading since higher demand of CPU will decrease I/O demand? Or is it a wash since all threads would be doing essentially the same thing and therefore sharing the same bottlenecks?
It depends on your system. If you have multiple disks, it could be very worthwhile to parallelize your backup job. If you have multiple processors, compressing multiple jobs in parallel may be worth your while.
If the processor is slow enough (and the disks are fast enough) that zipping makes your CPU a bottleneck, you'll make some gains on multicore or hyperthreaded processors. The reduced I/O demand from zipped data being written is almost certainly a win if your CPU can keep up with the read speed of your drive(s).
Anyway, this is all very system dependent. Try it and see. Run two jobs at once and then run the same two in serial and see which took longer. The cheap (coding-wise) way is to just run your backup script twice with different input and output parameters. Once you've established a winner, you can go farther down the path.
When performing many disk operations, does multithreading help, hinder, or make no difference?
For example, when copying many files from one folder to another.
Clarification: I understand that when other operations are performed, concurrency will obviously make a difference. If the task was to open an image file, convert to another format, and then save, disk operations can be performed concurrently with the image manipulation. My question is when the only operations performed are disk operations, whether concurrently queuing and responding to disk operations is better.
Most of the answers so far have had to do with the OS scheduler. However, there is a more important factor that I think would lead to your answer. Are you writing to a single physical disk, or multiple physical disks?
Even if you parallelize with multiple threads...IO to a single physical disk is intrinsically a serialized operation. Each thread would have to block, waiting for its chance to get access to the disk. In this case, multiple threads are probably useless...and may even lead to contention problems.
However, if you are writing multiple streams to multiple physical disks, processing them concurrently should give you a boost in performance. This is particularly true with managed disks, like RAID arrays, SAN devices, etc.
I don't think the issue has much to do with the OS scheduler as it has more to do with the physical aspects of the disk(s) your writing to.
That depends on your definition of "I/O bound" but generally multithreading has two effects:
Use multiple CPUs concurrently (which won't necessarily help if the bottleneck is the disk rather than the CPU[s])
Use a CPU (with a another thread) even while one thread is blocked (e.g. waiting for I/O completion)
I'm not sure that Konrad's answer is always right, however: as a counter-example, if "I/O bound" just means "one thread spends most of its time waiting for I/O completion instead of using the CPU", but does not mean that "we've hit the system I/O bandwidth limit", then IMO having multiple threads (or asynchronous I/O) might improve performance (by enabling more than one concurrent I/O operation).
I would think it depends on a number of factors, like the kind of application you are running, the number of concurrent users, etc.
I am currently working on a project that has a high degree of linear (reading files from start to finish) operations. We use a NAS for storage, and were concerned about what happens if we run multiple threads. Our initial thought was that it would slow us down because it would increase head seeks. So we ran some tests and found out that the ideal number of threads is the same as the number of cores in the computer.
But your mileage may vary.
It can do, simply because whenever there is more work for a thread to do (identifying the next file to copy) the OS wakes it up, so threads are a simple way to hook into the OS scheduler and yet still write code in a traditional sequential way, instead of having to break it up into a state machine with callbacks.
This is mainly an assistance with clear programming rather than performance.
In most cases, using multi-thread for disk IO will not benefit efficiency. Let's imagine 2 circumstances:
Lock-Free File: We can split the file for each thread by giving them different IO offset. For instance, a 1024B bytes file is split into n pieces and each thread writes the 1024/n respectively. This will cause a lot of verbose disk head movement because of the different offset.
Lock File: Actually lock the IO operation for each critical section. This will cause a lot of verbose thread switches and it turns out that only one thread can write the file simultaneously.
Correct me if I' wrong.
No, it makes no sense. At some point, the operations have to be serialized (by the OS). On the other hand, since modern OS's have to cope with multiple processes anyway I doubt that there's an added overhead.
I'd think it would hinder the operations... You only have one controller and one drive.
You could use a second thread to do the operation, and a main thread that shows an updated UI.
I think it could worsen the performance, because the multiple threads will compete for the same resources.
You can test the impact of doing concurrent IO operations on the same device by copying a set of files from one place to another and measuring the time, then split the set in two parts and make the copies in parallel... the second option will be sensibly slower.