Using Cyclic Barrier to ensure 3 thread will run after each other

Using Cyclic Barrier to ensure 3 thread will run after each other - multithreading

By using , wait/notify or join or 2 countdownlatch I can make 3 threads to run in sequence.
How can similar be achieved through cyclic barrier?

Related

Run 4 threads at same time, if complete to excute 1 thread then excute new thread?

I have 30 tasks.
I want to run 4 threads at a same time to do 4 first tasks.
If any threads completed, i want to excute next thread and it always has 4 threads at same time
When I completed 28 tasks (7 times), I only do 2 tasks (2 threads)
How to solve it ? i use threading namespace
Thank you

You have not mentioned any particular language here, but in case you are using java this is a classic use case of ThreadPoolExecutor.
If you are using some other coding language, you can have your own implementation of simplified ThreadPoolExecutor. Basically:
A thread safe list of tasks to be executed
4 threads reading from the queue and executing tasks
Implement the termination logic for your threads (you may want to terminate if thread finds that the queue is empty or may be wait for some time and then try again)

Thread synchronisation for very short tasks

I have a C++ application running on winapi. Portability is not an issue. All I want is maximum performance. I have a basic understanding of multithreading and synchronization issues, but limited experience with the multitude of options ranging from winapi over C++ threads to third party libraries.
In the performance critical core of my application I identified a loop, which could be parallelized. I managed to split the loop into 4 parts which do not depend on each other. I would like to delegate the job to 4 threads running in parallel. The main thread should wait until all 4 threads have done their job, before it continues.
Sounds very simple. However, currently the loop takes only about 10 microseconds when running on one thread. I'm afraid that synchronization methods which cause a switch to the kernel (events, mutexes, etc.) would produce more overhead than the parallelization could save. SRWLocks + condition variables claim to be very lightweight, but I didn't find a way to solve my synchronization with these tools.
Of course I could test all kinds of synchronization APIs, but I'm sure this has been done before.
So my question is: Is there a reasonable way to synchronize very short tasks and if so, what are the appropriate tools?

If you simply need to wait for threads to complete you would use WaitForMultipleObjects on the thread handles. The other direct option would be to use a synchronization barrier, a primitive that allows a group of threads to halt until all members of the group have reached the barrier, but that is generally for the case where there is more work for the spawned threads to perform after being released.
Your question of whether this would actually be of benefit in your particular case is one that can only be answered through implementation and timing. And note that if you are going to perform this testing it should be done on a release build with optimizations enabled. It may well be the case that if the amount of work to perform is short enough that the time involved in thread management dwarfs any benefit.

The update algorithm consists of two steps. Each of these steps can be applied to the knots in arbitrary order, but step 1 must be completed before step 2 can start. I can portion the whole net into four (or more) parts and delegate each part to a separate thread. My problem is: Each thread has to pause after step 1 and wait until all threads have finished their job. Then each thread makes step 2, wait for completion of the other threads and so on.
You want to break the work into a large number of small chunks and have a fixed pool of threads take chunks of work. Do not make 8 threads on an 8 core machine and split the work into 8 chunks. That algorithm will work poorly if, for one reason or another, only 7 of those cores winds up doing work for you. Your algorithm will need twice as long as the second half of the time only one core is working.
The easy way is to have an extra dispatch thread. Just keep a "work unit" count somewhere protected by a mutex. When a thread finishes a work unit, have it decrement the "work unit" count. When it hits zero, broadcast a condition variable. That will wake the dispatch thread which will then do whatever it takes to get the worker threads going again. It can start them by setting the "work unit" count to the right level and broadcasting another condition variable that the worker threads wait for.
You can also just keep a count of which node needs to be done next and the number of nodes currently doing work. That will require synchronization after each thread though (to figure out which node to do next) and it may make more sense to have each thread grab some number of nodes, iterate over them, and then synchronize to grab another few nodes.
Avoid breaking the work into large chunks early. That can lead to the problem where you have 8 cores but 2 large work units left at some point. Remember, many modern CPUs run their cores at different speeds based on temperature and power measurements.

What's the point of invoking sequential (synchronous) threads? Performance?

So with synchronous threads, you have some threads waiting for other threads to finish.
As we can see in the observer pattern, etc.
An Example:
Thread 3 Waiting for Thread 2 to execute Method C
Thread 2 waiting for Thread 1 to execute Method B
Thread 1 execute Method A
Does this scenario have any benefit in performance in contrast with:
Thread 1
Executing Method A
Executing Method B
Executing Method C
Maybe they just do it for sake of spliting tasks according to behavior?

Multi-threaded linear system solution in OpenBLAS

I have a code using Fortran 95 and the gfortran compiler. I am also using OpenMP and I have to handle very big arrays. In my code I also have to solve a system of linear equations using the solver DGTSV from OpenBLAS. I want to parallelize this solver as well using openblas which should be capable of that. But I have trouble with the syntax. Using the attached pseudo code all 4 CPUs are used to almost 100% but I am not sure if each kernel solves the linear equations separately or if they split it into parts and calculating it parallel.
The whole stuff is compiled using gfortran -fopenmp -lblas a.f95 -o a.out
So my pseudo code looks like
program a
implicit none
integer, parameter :: N = 200
real*8, dimension(numx) :: D = 0.0
real*8, dimension(numx-1):: DL = 0.0
real*8, dimension(numx-1):: DU = 0.0
real*8, dimension(numx) :: b = 0.0
integer :: info = 0
integer :: numthread=4
...
!$OMP PARALLEL NUM_THREADS(numthread)
...
!$OMP DO
...
!$OMP END DO
CALL DGTSV(N,1,DL,D,DU,b,N,info)
!$OMP DO
...
!$OMP END DO
...
!$OMP END PARALLEL
end program a
What does I have to do to make the solver parallelized, so each kernel calculates parts of the solver?

Inside an OpenMP parallel region, all the threads execute the same code (as in MPI), and the work is only split when the threads reach a loop/section/task.
In your example, the work inside the loops (OMP DO) is distributed among the available threads. After the loop is done, an implicit barrier synchronizes all the threads and then they execute in parallel the function DGTSV. After the subroutine has returned, the loop is split again.
#HristoIliev proposed using a OMP SINGLE clause. This restricts the piece of code inside to be executed by only one thread and forces all the other threads to wait for it (unless you specify nowait).
On the other hand, nested parallelism is called to the case where you declare a parallel region inside another parallel region. This also applies when you perform calls to a OpenMP parallelized library inside a parallel region.
By default, OpenMP does not increase parallelism nested parallel regions, instead, only the thread that enter the parallel region is able to execute it. This behavior can be changed using the environment variable OMP_NESTED to true.
The OMP SINGLE solution is far better than splitting the parallel region in two, as the resources are reused for the next loop:
$!OMP PARALLEL
$!OMP DO
DO ...
END DO
$!OMP SINGLE
CALL DGTSV(...)
$!OMP DO
DO ...
END DO
$!OMP END PARALLEL
To illustrate the usage of OMP_NESTED I'll show you some results I had from an application which used FFTW (a Fast Fourier Transform implementation) configured to use OpenMP. The execution was performed in a 16 core two-socket Intel Xeon E5 #2.46GHz node.
The following graphs show the amount of time spent in the whole application, where parallel regions appear when CPUs > 1, serialized regions when CPUs = 1 and synchronization regions when CPUs = 0.
The application is embarrassingly parallel, so in this particular case using nesting is not worthwhile (FFTW does not scale that good).
This is the OMP_NESTED=false execution. Observe how the amount of parallelism is limited by the amount of threads spent in the external parallel region (ftdock).
This is the OMP_NESTED=true execution. In this case, it is possible to increase parallelism further than the amount of threads spent on the external parallel region. The maximum parallelism possible in this case is 16, when either the 8 external threads create a single peer to execute the internal parallel region or they are 4 creating 3 additional threads each (8x2 = 4x4 = 16).

Misunderstanding the difference between single-threading and multi-threading programming

I have a misunderstanding of the difference between single-threading and multi-threading programming, so I want an answer to the following question to make everything clear.
Suppose that there are 9 independent tasks and I want to accomplish them with a single-threaded program and a multi-threaded program. Basically it will be something like this:
Single-thread:
- Execute task 1
- Execute task 2
- Execute task 3
- Execute task 4
- Execute task 5
- Execute task 6
- Execute task 7
- Execute task 8
- Execute task 9
Multi-threaded:
Thread1:
- Execute task 1
- Execute task 2
- Execute task 3
Thread2:
- Execute task 4
- Execute task 5
- Execute task 6
Thread3:
- Execute task 7
- Execute task 8
- Execute task 9
As I understand, only ONE thread will be executed at a time (get the CPU), and once the quantum is finished, the thread scheduler will give the CPU time to another thread.
So, which program will be finished earlier? Is it the multi-threaded program (logically)? or is it the single-thread program (since the multi-threading has a lot of context-switching which takes some time)? and why? I need a good explanation please :)

It depends.
How many CPUs do you have? How much I/O is involved in your tasks?
If you have only 1 CPU, and the tasks have no blocking I/O, then the single threaded will finish equal to or faster than multi-threaded, as there is overhead to switching threads.
If you have 1 CPU, but the tasks involve a lot of blocking I/O, you might see a speedup by using threading, assuming work can be done when I/O is in progress.
If you have multiple cpus, then you should see a speedup with the multi-threaded implementation over the single-threaded since more than 1 thread can execute in parallel. Unless of course the tasks are I/O dominated, in which case the limiting factor is your device speed, not CPU power.

As I understand, only ONE thread will be executed at a time
That would be the case if the CPU only had one core. Modern CPUs have multiple cores, and can run multiple threads in parallel.
The program running three threads would run almost three times faster. Even if the tasks are independent, there are still some resources in the computer that has to be shared between the threads, like memory access.

Well, this isn't entirely language agnostic. Some interpreted programming languages don't support real Threads. That is, threads of execution can be defined by the program, but the interpreter is single threaded so all execution is on one core of the CPU.
For compiled languages and languages that support true multi-threading, a single CPU can have many cores. Actually, most desktop computers now have 2 or 4 cores. So a multi-threaded program executing truely independent tasks can finish 2-4 times faster based on the number of available cores in the CPU.

Assumption Set:
Single core with no hyperthreading;
tasks are CPU bound;
Each task take 3 quanta of time;
Each scheduler allocation is limited to 1 quanta of time;
FIFO scheduler Nonpreemptive;
All threads hit the scheduler at the same time;
All context switches require the same amount of time;
Processes are delineated as follows:
Test 1: Single Process, single thread (contains all 9 tasks)
Test 2: Single Process, three threads (contain 3 tasks each)
Test 3: Three Processes, each single threaded (contain 3 tasks each)
Test 4: Three Processes, each with three threads (contain one task each)
With the above assumptions, they all finish at the same time. This is because there is an identicle amount of time scheduled for the CPU, context switches are identicle, there is no interrupt handling, and nothing is waiting for IO.
For more depth into the nature of this, please find this book.

The main difference between single thread and multi thread in Java is that single thread executes tasks of a process while in multi-thread, multiple threads execute the tasks of a process.
A process is a program in execution. Process creation is a resource consuming task. Therefore, it is possible to divide a process into multiple units called threads. A thread is a lightweight process. It is possible to divide a single process into multiple threads and assign tasks to them. When there is one thread in a process, it is called a single threaded application. When there are multiple threads in a process, it is called a multi-threaded application.

ruby vs python vs nodejs : performances in web app, which takes alot of I/O non blockingrest/dbQuery will impact alot. and being the only multi threaded of all 3, nodejs is the winner with big lead gap

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string