Best strategy to execute tasks with high branch divergency [closed] - multithreading

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have a project written a few years ago that do compute N similar tasks in a row on a single CPU core.
These N tasks are completely independent so they could be computed in parallel.
However, the problem with these tasks is that the control flow inside each task differs much from one task to other, so the SIMT approach implemented in CUDA will more likely impede than help.
I came up with an idea to launch N blocks with 1 thread in each to break the warp dependency for threads.
Can anyone suggest a better way to optimise the computations in this situation, or point out possible pitfalls with my solution.

You are right with your comment what causes and what is caused by divergence of threads in a warp. However, launching configuration mentioned by you (1 thread in each block) totally diminishes potential of GPU. Threads in a warp/half-warp is the maximal unit of threads that is eventually executed in parallel on a single multiprocessor. So, having one thread in the block and having 32 these blocks is actually as having 32 threads in the warp with different paths. First case is even worse because number resident blocks per multiprocessors is quite limited (8 or 16, depending on compute capability).
Therefore, if you want to fully exploit potential of GPU, keep in mind Jack's comment and try to reorganize threads so that threads of a single warp would follow equal execution path.

Related

How many threads should I spawn for maximum performance? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 months ago.
Improve this question
I am writing a Rust script that needs to brute force the solution to some calculation and is likely to run 2^80 times. That is a lot! I am trying to make it run as fast as possible and thus want to divide the burden to multiple threads. However if I understand correctly this only accelerates my script if the threads actually run on different cores, otherwise they will not truly run simultaneously but switch between one another when running..
How can I make sure they use different cores, and how can I know that no more cores are available?
TL;DR: Use std::thread::available_parallelism (or alternatively the num-cpus crate) to know how many threads to run and let your OS handle the rest.
Typically when you create a thread, the OS thread scheduler is given free liberty to decide where and when those threads execute, however it will do so in a way that best takes advantage of CPU resources. So of course if you use less threads than the system has available, you are potentially missing out on performance. If you use more than the number of available threads, that's not particularly a problem since the thread scheduler will try its best to balance the threads that have work to do, but more than the available threads would be a mall waste of memory, OS resources, and context-switches. Creating your threads to match the number of logical CPU cores on your system is the sweetspot, and the above function will get that.
You could tell the OS exactly which cores to run which threads by setting their affinity, however that isn't really advisable since it wouldn't particularly make anything faster unless you start really configuring your kernel or are really taking advantage of your NUMA nodes.

Understanding node.js [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have started reading node.js. I have a few questions:
Is node better than multi-threading just because it saves us from caring about deadlocks and reduces thread creation overhead, or are there are other factors too? Node does use threads internally, so we can't say that it saves thread creation overhead, just that it is managed internally.
Why do we say that node is not good for multi-core processors? It creates threads internally, so it must be getting benefits of multi-core. Why do we say it is not good for CPU intensive applications? We can always fork new processes for CPU intensive tasks.
Are only functions with callback dispatched as threads or there are other cases too?
Non-blocking I/O can be achieved using threads too. A main thread may be always ready to receive new requests. So what is the benefit?
Correct.
Node.js does scale with cores, through child processes, clusters, among other things.
Callbacks are just a common convention developers use to implement asynchronous methods. There is no technical reason why you have to include them. You could, for example, have all your async methods use promises instead.
Everything node does could be accomplished with threads, but there is less code/overhead involved with node.js's asynchronous IO than there is with multi-threaded code. You do not, for example, need to create an instance of thread or runnable every time like you would in Java.

Is this a sane architecture for a multi-user network server? (How much overhead does pipes-concurrency introduce?) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I currently have it so that there is one thread handling the accept loop, one main thread for doing all the stateful logic stuff, and then 2 threads per connected client. One client thread is handling the input pipeline and using pipes-concurrency to send messages to the main logic thread. The other client thread is handling the output pipeline, getting messages from the main logic thread and sending them to the client.
My reasoning for doing it this way is that the main logic thread can use worker threads to do pure computations on an immutable state, then do all the state changes at once and loop back around with the new state. That way I can make use of multiple CPUs without having to worry about the problems of concurrent state modifications.
Is the overhead of STM/pipes-concurrency small enough that this is a reasonable approach when I will end up with a couple thousand connected clients sending two or three messages per second each?
Haskell green threads are cheap enough that I would definitely recommend the approach of 2 threads per client. Without seeing details, I can't comment on whether STM will be a bottleneck or not, but that's going to depend on your implementation. STM can definitely handle that level of workload, assuming it's used correctly.

Two threads (x milliseconds time for one action) in One Core == 2x time? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I will explain the above question, I have two threads that each one of them do the same action that takes x milliseconds. if I have computer that has one core , Is it take about 2x milliseconds to do the two action ?
If the action is CPU-bound, basically meaning it only consists in computations, then yes the total wall-time will be a bit more than twice the time take by one thread due to context switching overhead.
If the action has some non negligible IO-related operations (read from memory, disk, or network), then two threads on a single core might take a bit more than the time needed with one thread, but not necessarily twice that time. If the OS is able to have one thread do IO while the other does computations, and alternate, then you might have both threads running in the same wall time as one single thread.
Yes. They will be executed one after the other or somehow interleaved but it total it will take the double time.
Yes of course. If you have two threads and one CPU core the threads will run one after the other, or in time slices. But it is not possible for one core to run more than one thread of execution at a time.
Unless hyperthreading is being used. But that makes one core look like two (or more) cores so that does not apply here.

Multi-threading, how do concurrent threads work? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
If I have a dual core CPU, does it mean that it can run a maximum of 2 threads?
Then; if so, how does one run 4 concurrent threads, when they are seemingly limited by their CPU, to two? (Since it can only run maximum of 2 for a dual core PC).
This is a very big question.
Basically you are correct that with a dual-core CPU only two threads can be currently executing. However, more threads than two are actually scheduled to execute. Furthermore, a running thread can be interrupted at (almost) any time by the operating system, effectively halting execution of that thread to allow another thread to be run.
There are a lot of factors that weigh into how threads are interrupted and run. Each thread has a given "time-slice" in which to execute and after that time-slice has elapsed that thread may be stopped to allow other threads to execute (if any are waiting). There are also thread priorities that can be assigned that allow for higher priority tasks to take precedence over lower priority tasks.
Some work that can be offloaded from the main CPU (to the GPU or to a disk controller) can also be run in parallel with other threads.
Suggest that you read up on the basics.

Resources