Does threading a lot leads to thrashing? - multithreading

Does threading a lot leads to thrashing if each new thread wants to access the memory (specifically the same database in my case) and perform read/write operations through out its lifetime?
I assume that this is true. If my assumption is true, then what is the best way to maximize the CPU utilization? And how can i determine that some specific number of threads will give good CPU utilization?
If my assumption is wrong, please do give proper illustrations to let me understand the scenario clearly.

Trashy code causes trashing. Not thread. All code is ran by some threads, even the main(). Temp objects are garbage collected the same way on any thread.
The subtle part is when each thread preloads its own objects to perform the work, which can duplicate a lot of same classes. It's usually a small sacrifice to make to get the power of concurrency. But it's not trash (no leak, no deterioration).
There is one exception: when some 3rd party code caches material in thread locals... You could end up caching the same stuff on each thread. Not really a leak, but not efficient.
Rule of thumb for number of threads? Depends on the task.
If the tasks are pure computation like math, then you should not exceed the number of non-hyperthreaded cores.
If the job is memory intensive along with pure computation work (most cases), then the number of hyperthreaded cores is your target (because the CPU will use the idle time of memory access for another core computations).
If the job is mostly large sequential disk i/o, then you number of threads should be not to much above the number of disk spindle available to read. This is VERY approximative since the disk caches, DMA, SSD, raids and such are completely affecting how the disk layer can service your thread without idling. When using random access, this is also valid. However, the virtualization these days will throw all your estimates out the window. Disk i/o could be much more available than you think, but also much worse.
If the jobs are mostly network i/o waits, then it is not really limited from your side; I would go with about 3x the number of cores to start. This multiplier is simply presuming that such thread wait on network for 2/3 of its time. Which is very low in practice. Could be 99% of its time waiting for nw i/o (100x). Which is why you see NIO sockets everywhere, to deal with many connections with fewer busier threads.

No, you could have 100's of idle threads waiting for work and not see any thrashing, which is caused by application working set size exceeding available memory size, so active pages need to be reloaded from disk (even written out to disk to when temporary variable storage needs saving to be relaoded later).
Threads share an address space, having many active leads to diminishing returns due to lock contention. So in the DB case, many processes reading tables can proceed simultaneously, yet updates of dependant data need to be serialised to keep data consistent which may cause lock contention and limit parallel processing.
Poorly written queries which need to load & sort large tables into memory, may cause thrashing when they exceed free RAM (perhaps poor choice of indexs). You can increase the query throughput, to utilise CPUs more, by having large RAM disk caches and using SSDs to reduce random data access times.
On memory intensive computations, cache sizes may become important, fewer threads whose data stays in cache and CPU pre-fetches minimise stalls, work better than threads competing to load their data from main memory.

Related

Mulithreading does not help for IO intensive task?

I need to copy a set of files with the size of each file ranging from 1MB to 700MB. After I copy each file, I need to validate the checksum of each file against an entry in md5sum.txt.
I wanted to optimize this task and hence evaluated the performance by splitting the load among multiple threads. The results were not as expected. I was expecting that the time taken for copy and validation would decrease with increase in the number of threads, but the time taken actually increased.
I have modified the ThreadPool source code shared in this link https://stackoverflow.com/a/22285532/1568395 to implement the threadpool.
The source code for the application can be found here
https://github.com/saai63/ThreadPool
The results for various number of threads is as shown below,
As per my reading, the probable reason could be that all tasks are now IO bound tasks and hence all of the threads will be blocked on IO operation and hence cannot run in parallel as the shared resource here is the HDD. I also understand that HDD controller tries to optimize the disk access by reducing the seek time. Disks love sequential access patterns, and any concurrent accesses will disrupt this pattern and hence the delay for large files.
Is this the only reason for the delay or there are some other factors? Why does the time increases with the increase in number of threads?
IO is always much slower than the CPU. When multiple threads try to read from an IO device, what they usually achieve is a "bull rush" to the device and increase the "randomness" of the IO operations, thus making it all slower. Fewer threads have a greater chance of sequential operations, which are notoriously faster.
In case of Multithreading you share CPU amongst threads. CPU is swtichted amongts threads whenever running thread goes into some sort of waiting state.
Here you have IO bound task and, there's no point of making your program multithreaded as all of them will be relying on single IO device.
Even if you implement a multiprocess solution (multiple processes on same node), all processes will be waiting for the same IO device and won't give any performance optimization.
One solution would be building some sort of multi node solution with shared disk having simultaneous multi-client access support.
Using this kind of approach you can divide your task amongst multiple nodes, access same disk and perform operation.
Edit:
I think increase in time is beacause time taken by Operating System for servicing multiple threads.
Switching CPU and IO devices amongst thread is gonna take as you increase number of threads, Context Switch is compute intensive task as well as you lose the IO/CPU Cache performance as you switch amongts threads.

How is processor speed distributed across threads?

Objective:
I am trying to estimate how fast my code will execute when run concurrently in multiple threads.
Question 1)
If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?
Question 2)
What impact, if any, does the presence of other threads effect the execution speed of each other thread?
My Situation:
I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time. Takes Half a second on 1 thread and I was worried how this will scale with multiple users performing the same query. Every user requests is handled by a separate thread so 100 simultaneous users will require 100 simultaneous threads. Each thread is sharing the same resource but read only. No writing. Is there any chance I could get each user to see roughly the same execution time?
Note: I know it will depend upon a number of factors but surely there must be some way of identifying whether or not your code will scale if you find it takes x amount of time for a single thread given x hardware. As final note I'd like to add I have limited experience with computer hardware architecture and how multi-threading works under the hood.
These are all interesting questions, but there is, unfortunately, no straightforward answer, because the answer will depend on a lot of different factors.
Most modern machines are multi-core: in an ideal situation, a four-thread process has the ability to scale up almost linearly in a four-core machine (i.e. run four times as fast).
Most programs, though, spend most of their time waiting for things: disk or database access, the memory bus, network I/O, user input, and other resources. Faster machines don't generally make these things appreciably faster.
The way that most modern operating systems, including Windows, Unix/Linux, and MacOS, use the processor is by scheduling processor time to processes and threads in a more-or-less round-robin manner: at any given time there may be threads that are waiting for processor time (this is a bit simplistic, as they all have some notions of process prioritization, so that high-criticality processes get pushed up the queue earlier than less important ones).
When a thread is using a processor core, it gets it all for as long as its time slice lasts: indeed, only one thing at a time is actually running on a single core. When the process uses up its time slice, or requests some resource that isn't immediately available, it its turn at the processor core is ended, and the next scheduled task will begin. This tends to make pretty optimal use of the processor resources.
So what are the factors that determine how well a process will scale up?
What portion of its run time does a single process spend waiting for
I/O and user input?
Do multiple threads hit the same resources, or different ones?
How much communication has to happen between threads? Between individual threads and your processes main thread? This takes synchronization, and introduces waiting.
How "tight" are the hotspots of the active thread? Can the body of it fit into the processor's memory, or does the (much slower) bus memory have to be accessed?
As a general rule, the more independent individual threads are of one another, the more linearly your application will scale. In real-world business applications, though, that is far from the case. The best way to increase the scaling ability of your process is to understand it--and its dependencies--well, and then use a profiler to find out where the most waiting occurs, and see if you can devise technical strategies to obviate them.
If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?
No, you should determine it empirically.
What impact, if any, does the presence of other threads effect the execution speed of each other thread?
Computation-bound tasks will likely scale very well and be mostly independent of other threads. Interestingly enough, some CPU manufacturers implement features which can increase the clock of a lone-busy CPU core to compensate for the all the idle cores. This sort of feature might confound your measurements and expectations about scaling.
Cache/Memory/disk-bound tasks will start to contend with each other except for where resource partitions exist.
I know it will depend upon a number of factors
Absolutely! So I recommend that you prototype it and measure it. And then find out why it didn't scale as well as you'd hoped and try a different algorithm. Iterate.
but surely there must be some way of identifying whether or not your code will scale
Yes, but unfortunately it requires a detailed description of the algorithm implemented by the code. Your results will be heavily dependent on the ratio of your code's activity among these general regions, and your target's capability for these:
disk I/O
network I/O
memory I/O
computation
My Situation: My application runs in an app server that assigns one thread for every user request. If my application executes in 2 seconds for 1 user I can't assume it will be always take 2 seconds if say 100 users are simultaneously running the same operation correct?
If your app server computes pi to 100 digits for each user request, it will likely scale reasonably well until you encounter the core limit of your target.
If your app server does database queries for each user request, it will likely scale only as well as the target hardware can sustain the necessary load.
EDIT given specifics:
I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time.
Your problem sounds memory+cache-bound. You should study the details of your target CPU/mem deployment or if you are designing it, opt for high memory throughput.
A NUMA system ("resource partitioning" for memory) can likely maximize your overall concurrent memory throughput. Note that since your problem seems to dictate concurrent access to the same memory pages, a NUMA system would penalize the process doing remote memory accesses. In this case, consider creating multiple copies of the data at initialization time.
Depending on the pattern of traversal, TLB pressure might be a factor. Consider experimenting with huge (aka "large") pages.
Cache contention may be a factor in scaling as well.
Your specific algorithm could easily end up dominating over any of the specific system effects, depending on how far apart the best and worst cases are.
limited experience with computer hardware architecture and how multi-threading works under the hood.
Profile the query using CPU performance counters with a tool like Intel's VTune, perf, or oprofile. It can tell you where expensive operations are executing in your code. With this information you can optimize your query to perform well (individually and in aggregate).

Dual-Core Hyperthreading: Should I use 4 threads or 3 or 2?

If you're spawning multiple threads (or processes) concurrently, is it better to spawn as many as the number of physical processors or the number of logical processors, assuming the task is CPU-bound? Or is it better to do something in between (say, 3 threads)?
Does the performance depend on the kind of instructions that are getting executed (say, would non-local memory access be much different from cache hits)? If so, in which cases is it better to take advantage of hyperthreading?
Update:
The reason I'm asking is, I remember reading somewhere that if you have as many tasks as the number of virtual processors, tasks on the same physical core can sometimes starve some CPU resources and prevent each other from getting as many resources as needed, possibly decreasing performance. That's why I'm wondering if having as many threads as virtual cores is a good idea.
The performance depends on a huge variety of factors. Most tasks are not strictly CPU bound, since even if all of the data is in memory it is usually not on-board in the processor cache. I have seen examples (like this one) where memory access patterns can dramatically change the performance profile of a given 'parallel' process.
In short, there is no perfect number for all situations.
Chances are pretty good that you will see a performance improvement running 2 threads per core with HyperThreading enabled. Jobs that appear to be entirely CPU bound usually aren't, and HyperThreading can extract a few "extra" cycles out of the occasional interrupt or context switch.
On the other hand, with a core iX processor that has Turbo Boost, you might actually do better running 1 thread per core to encourage the CPU to overclock itself.
At work, we routinely run many-core servers at full CPU doing various kinds of calculation for days at a time. A while back we measured the performance difference with and without HT. We found that on average, with HyperThreading, and running twice as many jobs at once, we could complete the same amount of jobs about 10% faster than than without HyperThreading.
Assume that 2 × cores is a good place to start, but the bottom line is: measure!
I remember info that hyperthreading can give you up to 30% of performance boost. in general you'd better to treat them as 4 different cores. of course in some specific circumstances (e.g. having the same long running task bound to each core) you can divide your processing better taking into account that some cores are just logical ones
more info about hyperthreading itself here
Using Hyperthreading to run two threads on the same core, when both threads have similar memory access patterns but access disjoint data structures, would be very roughly equivalent to running them on two separate cores each with half the cache. If the memory-access patterns are such that half the cache would be sufficient to prevent thrashing, performance may be good. If the memory-access patterns are such that halving the cache induces thrashing, there may be a ten-fold performance hit (implying one would have been much better off without hyperthreading).
On the other hand, there are some situations where hyperthreading may be a huge win. If many threads will all be reading and writing the same shared data using lock-free data structures, and all threads must see a consistent view of the data, trying to run threads on disjoint processor may cause thrashing since only one processor at a time may have read-write access to any given cache line; running such a threads on two cores may take longer than running only one at a time. Such cache arbitration is not required, however, when a piece of data is accessed by multiple threads on a single core. In those cases, hyperthreading can be a huge win.
Unfortunately, I don't know any way to give the scheduler any "hints" to suggest that some threads should share a core when possible, while others should run separately when possible.
HT allows a boost of approximately 10-30% for mostly cpu-bound tasks that use the extra virtual cores. Although these tasks may seem CPU-bound, unless they are custom made assembly, they will usually suffer from IO waits between RAM and local cache. This allows one thread running on a physical HT-enabled core to work while the other thread is waiting for IO. This does come with a disadvantage though, as two threads share the same cache/bus, which will result in less resources each which may cause both threads to pause while waiting for IO.
In the last case, running a single thread will decrease the maximum simultaneous theoretical processing power(by 10-30%) in favor of running a single thread without the slowdown of cache thrashing which may be very significant in some applications.
Choosing which cores to use is just as important as choosing how many threads to run. If each thread is CPU-bound for roughly the same duration it is best to set the affinity such that threads using mostly different resources find themselves on different physical cores and threads using common resources be grouped to the same physical cores(different virtual core) so that common resources can be used from the same cache without extra IO wait.
Since each program has different CPU-usage characteristics and cache thrashing may or may not be a major slowdown(it usually is) it is impossible to determine what the ideal number of threads should be without profiling first. One last thing to note is that the OS/Kernel will also require some CPU and cache space. It is usually ideal to keep a single (physical)core set aside for the OS if real-time latency is required on CPU-bound threads so as to avoid sharing cache/cpu resources. If threads are often waiting for IO and cache thrashing is not an issue, or if running a real-time OS specifically designed for the application, you can skip this last step.
http://en.wikipedia.org/wiki/Thrashing_(computer_science)
http://en.wikipedia.org/wiki/Processor_affinity
All of the other answers already give lots of excellent info. But, one more point to consider is that the SIMD unit is shared between logical cores on the same die. So, if you are running threads with SSE code, do you run them on all 4 logical cores, or just spawn 2 threads (assuming you have two chips)? For this odd case, best to profile with your app.

Delphi 2010: Advantage of running multi threads if cannot allocate memory to create object for calculation in each thread

My Previous Question
From the above answer, means if in my threads has create objects, i will face memory allocation/deallocation bottleneck, thus result running threads may slower or no obvious time taken diff. than no thread. What's the advantages of running multi threads in the application if I cannot allocate memory to create the object for calculations in my thread?
What's the advantages of running multi threads in the application if I cannot allocate memory to create the objects for calculations in my thread?
It depends on where your bottlenecks are. If your bottleneck is the amount of memory available, then creating more threads won't help. Or, if I/O is a bottleneck, trying to parallelize will just slightly slow down everything because of context switching. It's like trying to make an underpowered car faster by putting wider tyres in it: fixing the wrong thing doesn't help.
Threads are useful when the bottleneck is the processor and there are several processors available.
Well, if you allocate chunks of memory in a loop, things will slow down.
If you can create your objects once at the beginning of TThread.execute, the overhead will be smaller.
Threads can also be benificial if you have to wait for IO-operations, or if you have expensive calculations to do on a machine with more than one physical core.
If you have memory intensive threads (many memory allocations/deallocations) you better use TopMM instead of FastMM:
http://www.topsoftwaresite.nl/
FastMM uses a lock which blocks all other threads, TopMM does not so it scales much better on multi cores/cpus!
When it comes to multithreding, shared resources issues will always arise (with current technology). All resources that may need serialization (RAM, disk, etc.) are a possible bottleneck. Multithreading is not a magic solution that turns a slow app in a fast one, and not always result in better speed. Made in the wrong way, it can actually result in worse speed. it should be analyzed to find possible bottlenecks, and some parts could need to be rewritten to minimize bottlenecks using different techniques (i.e. preallocating memory, using async I/O, etc.). Anyway, performance is only one of the reasons to use more than one thread. There are several other reason, for example letting the user to be able to interact with the application while background threads perform operations (i.e. printing, checking data, etc.) without "locking" the user. The application that way could seem "faster" (the user can keep on using it without waiting) even if it is actually slowerd (it takes more time to finish operations than if made them serially).

Does multithreading make sense for IO-bound operations?

When performing many disk operations, does multithreading help, hinder, or make no difference?
For example, when copying many files from one folder to another.
Clarification: I understand that when other operations are performed, concurrency will obviously make a difference. If the task was to open an image file, convert to another format, and then save, disk operations can be performed concurrently with the image manipulation. My question is when the only operations performed are disk operations, whether concurrently queuing and responding to disk operations is better.
Most of the answers so far have had to do with the OS scheduler. However, there is a more important factor that I think would lead to your answer. Are you writing to a single physical disk, or multiple physical disks?
Even if you parallelize with multiple threads...IO to a single physical disk is intrinsically a serialized operation. Each thread would have to block, waiting for its chance to get access to the disk. In this case, multiple threads are probably useless...and may even lead to contention problems.
However, if you are writing multiple streams to multiple physical disks, processing them concurrently should give you a boost in performance. This is particularly true with managed disks, like RAID arrays, SAN devices, etc.
I don't think the issue has much to do with the OS scheduler as it has more to do with the physical aspects of the disk(s) your writing to.
That depends on your definition of "I/O bound" but generally multithreading has two effects:
Use multiple CPUs concurrently (which won't necessarily help if the bottleneck is the disk rather than the CPU[s])
Use a CPU (with a another thread) even while one thread is blocked (e.g. waiting for I/O completion)
I'm not sure that Konrad's answer is always right, however: as a counter-example, if "I/O bound" just means "one thread spends most of its time waiting for I/O completion instead of using the CPU", but does not mean that "we've hit the system I/O bandwidth limit", then IMO having multiple threads (or asynchronous I/O) might improve performance (by enabling more than one concurrent I/O operation).
I would think it depends on a number of factors, like the kind of application you are running, the number of concurrent users, etc.
I am currently working on a project that has a high degree of linear (reading files from start to finish) operations. We use a NAS for storage, and were concerned about what happens if we run multiple threads. Our initial thought was that it would slow us down because it would increase head seeks. So we ran some tests and found out that the ideal number of threads is the same as the number of cores in the computer.
But your mileage may vary.
It can do, simply because whenever there is more work for a thread to do (identifying the next file to copy) the OS wakes it up, so threads are a simple way to hook into the OS scheduler and yet still write code in a traditional sequential way, instead of having to break it up into a state machine with callbacks.
This is mainly an assistance with clear programming rather than performance.
In most cases, using multi-thread for disk IO will not benefit efficiency. Let's imagine 2 circumstances:
Lock-Free File: We can split the file for each thread by giving them different IO offset. For instance, a 1024B bytes file is split into n pieces and each thread writes the 1024/n respectively. This will cause a lot of verbose disk head movement because of the different offset.
Lock File: Actually lock the IO operation for each critical section. This will cause a lot of verbose thread switches and it turns out that only one thread can write the file simultaneously.
Correct me if I' wrong.
No, it makes no sense. At some point, the operations have to be serialized (by the OS). On the other hand, since modern OS's have to cope with multiple processes anyway I doubt that there's an added overhead.
I'd think it would hinder the operations... You only have one controller and one drive.
You could use a second thread to do the operation, and a main thread that shows an updated UI.
I think it could worsen the performance, because the multiple threads will compete for the same resources.
You can test the impact of doing concurrent IO operations on the same device by copying a set of files from one place to another and measuring the time, then split the set in two parts and make the copies in parallel... the second option will be sensibly slower.

Resources