How many threads should I spawn for maximum performance? [closed] - multithreading

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 months ago.
Improve this question
I am writing a Rust script that needs to brute force the solution to some calculation and is likely to run 2^80 times. That is a lot! I am trying to make it run as fast as possible and thus want to divide the burden to multiple threads. However if I understand correctly this only accelerates my script if the threads actually run on different cores, otherwise they will not truly run simultaneously but switch between one another when running..
How can I make sure they use different cores, and how can I know that no more cores are available?

TL;DR: Use std::thread::available_parallelism (or alternatively the num-cpus crate) to know how many threads to run and let your OS handle the rest.
Typically when you create a thread, the OS thread scheduler is given free liberty to decide where and when those threads execute, however it will do so in a way that best takes advantage of CPU resources. So of course if you use less threads than the system has available, you are potentially missing out on performance. If you use more than the number of available threads, that's not particularly a problem since the thread scheduler will try its best to balance the threads that have work to do, but more than the available threads would be a mall waste of memory, OS resources, and context-switches. Creating your threads to match the number of logical CPU cores on your system is the sweetspot, and the above function will get that.
You could tell the OS exactly which cores to run which threads by setting their affinity, however that isn't really advisable since it wouldn't particularly make anything faster unless you start really configuring your kernel or are really taking advantage of your NUMA nodes.

Related

Can I use every thread my system has to offer at once? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 days ago.
Improve this question
I am writing a C++ program that will have multithreading as a major part of it to speed up some tasks. However, since I have 8 threads on my processor accessible to me, I was wondering if I can actually use all 8 simultaneously? Would it actually speed things up if I only used 7 so that my OS can use the 8th thread without fighting for processing power and slowing things down?
For the moment my code is running on Ubuntu if that makes any difference. The task involve my threads taking in a handful of data, doing a ton of number crunching, then saving the results directly to disk from the thread. I've tested a similar thread-to-disk approach before in a different project of mine, and my SSD can handle this no problem with the threads going, although my HDD bottlenecks (Quite honestly, kinda expected that).
So mainly I'm just wondering if it is good practice to reserve a thread or two to allow my OS to run unimpeded, or if there's going to be no functional difference and I just ramp my code all the way to running on the max number of threads my system has
The term "thread" in programming is different from the term "thread" as used in CPU design. When you run a multi-threaded program, you're effectively telling the operating system "here are the operations that I want to do in parallel". The operating system then schedules the threads by deciding which CPU core should run them (and if you have threaded CPUs, which of the core's threads), but there isn't a 1-to-1 correspondence; for example, if the operating system needs one CPU core for its own purposes, then it'll just use it itself, and put the threads of your program onto the other cores; even if there are more threads than cores, the operating system will simply put multiple threads on a single core and have the CPU alternate between them. (Operating systems will typically also move your threads from one core to another over the course of execution, in order to help balance how much the various cores are being used.)
This means that optimising a multi-threaded program for performance isn't as simple as creating a fixed number of software threads based on the number of CPU cores you have – sometimes, a lower number is good due to overhead in the threading mechanism, and sometimes, it's optimal to have a number of threads that's substantially greater than the number of CPU cores on the system (because this gives the OS more flexibility as to how to schedule them). As such, people generally recommend benchmarking various numbers of threads to see which is fastest for any given program.
If you're concerned about the threads of your program overwhelming the other tasks that the computer is trying to perform, there's an alternative approach to trying to fix that problem – on most operating systems, you can set a thread's priority. In cases where there are more threads trying to run than the CPU can handle simultaneously, lower-priority threads will be given less CPU time in order to reserve more CPU time for higher-priority threads. Lowering the priority of all your program's threads can thus be useful if it's causing the system to be less responsive when run.
My recommendation is to start by running the program with a number of threads equal to the maximum number your CPU can run simultaneously (8 in your case), but it's probably also worth bench-marking that number minus 1 (7) and 1.5 times that number (12) to see if either of those work better. (I'd also look at the computer's CPU utilisation when running the benchmarks – if your program is I/O-bound rather than CPU-bound, then the CPU won't be able to run at full capacity no matter how many threads you have, and in that case you might be able to run the CPU part of your program at full speed even with fewer threads than cores.)
If the computer becomes insufficiently responsive as a consequence, try lowering the program's priority first (technically: the priority of all the program's threads), and if that doesn't work, only then should you try lowering the number of threads (or banning them from using a specific CPU) in order to prevent them causing performance issues.

Understanding node.js [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have started reading node.js. I have a few questions:
Is node better than multi-threading just because it saves us from caring about deadlocks and reduces thread creation overhead, or are there are other factors too? Node does use threads internally, so we can't say that it saves thread creation overhead, just that it is managed internally.
Why do we say that node is not good for multi-core processors? It creates threads internally, so it must be getting benefits of multi-core. Why do we say it is not good for CPU intensive applications? We can always fork new processes for CPU intensive tasks.
Are only functions with callback dispatched as threads or there are other cases too?
Non-blocking I/O can be achieved using threads too. A main thread may be always ready to receive new requests. So what is the benefit?
Correct.
Node.js does scale with cores, through child processes, clusters, among other things.
Callbacks are just a common convention developers use to implement asynchronous methods. There is no technical reason why you have to include them. You could, for example, have all your async methods use promises instead.
Everything node does could be accomplished with threads, but there is less code/overhead involved with node.js's asynchronous IO than there is with multi-threaded code. You do not, for example, need to create an instance of thread or runnable every time like you would in Java.

Issues with using threading and multiprocessing python libraries? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
How bad is it to create multiple processes and make those processes create threads. My task is both I/O and cpu bound?
It really depends on the specifics of your workload. For parallelizing CPU-bound work in Python, you should absolutely be using the multiprocessing module. Generally you should be using as many processes as you have CPU cores. If you use any more than that, you end up hurting performance because your OS has to do more context switching to give CPU time to each process.
Things are complicated somewhat by the addition of I/O-bound work. Generally, it's ok to handle I/O-bound work with threading in Python, because the GIL will be released while blocking I/O calls occur. However, it's important to remember that everything else that goes on in that thread will require the GIL - once the I/O operation completes, bubbling it back up into Python from the C-code that ran it, passing that data somewhere to be processed, looping back around to make the blocking I/O call again, etc. All that requires the GIL. So there is a GIL-related performance cost to using threads, even for I/O-bound operations. If your I/O-bound threads that are reading from a socket are frequently getting data, they'll end up needing to acquire the GIL quite a bit, which will probably have a noticeable impact on performance. If your I/O-bound thread spends most of its time blocking, it will spend most of its time without the GIL, and probably won't have a noticeable performance impact.
So TL;DR- it might be fine to do what you're describing, or it might not. It's extremely dependent on the specifics of your workload. Really, your best option is to try it out and see how performance looks, then make tweaks to the number of processes/threads you're running and compare.

Best strategy to execute tasks with high branch divergency [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have a project written a few years ago that do compute N similar tasks in a row on a single CPU core.
These N tasks are completely independent so they could be computed in parallel.
However, the problem with these tasks is that the control flow inside each task differs much from one task to other, so the SIMT approach implemented in CUDA will more likely impede than help.
I came up with an idea to launch N blocks with 1 thread in each to break the warp dependency for threads.
Can anyone suggest a better way to optimise the computations in this situation, or point out possible pitfalls with my solution.
You are right with your comment what causes and what is caused by divergence of threads in a warp. However, launching configuration mentioned by you (1 thread in each block) totally diminishes potential of GPU. Threads in a warp/half-warp is the maximal unit of threads that is eventually executed in parallel on a single multiprocessor. So, having one thread in the block and having 32 these blocks is actually as having 32 threads in the warp with different paths. First case is even worse because number resident blocks per multiprocessors is quite limited (8 or 16, depending on compute capability).
Therefore, if you want to fully exploit potential of GPU, keep in mind Jack's comment and try to reorganize threads so that threads of a single warp would follow equal execution path.

Two threads (x milliseconds time for one action) in One Core == 2x time? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I will explain the above question, I have two threads that each one of them do the same action that takes x milliseconds. if I have computer that has one core , Is it take about 2x milliseconds to do the two action ?
If the action is CPU-bound, basically meaning it only consists in computations, then yes the total wall-time will be a bit more than twice the time take by one thread due to context switching overhead.
If the action has some non negligible IO-related operations (read from memory, disk, or network), then two threads on a single core might take a bit more than the time needed with one thread, but not necessarily twice that time. If the OS is able to have one thread do IO while the other does computations, and alternate, then you might have both threads running in the same wall time as one single thread.
Yes. They will be executed one after the other or somehow interleaved but it total it will take the double time.
Yes of course. If you have two threads and one CPU core the threads will run one after the other, or in time slices. But it is not possible for one core to run more than one thread of execution at a time.
Unless hyperthreading is being used. But that makes one core look like two (or more) cores so that does not apply here.

Resources