whats the difference between #spawn fetch and #sync #sync in julia - multithreading

to do multiple work asyncronously (sorting vectors, and function calculation mainly (where the calculations is compute or memory bound,at the moment, i can write those operations in the following ways:
using Threads.#spawn
_f1 = Threads.#spawn f1(x)
_f2 = Threads.#spawn f2(x)
_f3 = Threads.#spawn f3(x)
y1 = fetch(_f1)
y2 = fetch(_f2)
y3 = fetch(_f3)
There is also this pattern (seems cleaner):
#sync begin
#async y1 = f1(x)
#async y2 = f2(x)
#async y3 = f3(x)
end
for this particular use case, which one is preferred?

Julia has the following support for parallelization (also see https://docs.julialang.org/en/v1/manual/parallel-computing/)
generating SIMD assembly code using #simd and #inbounds macros (read https://docs.julialang.org/en/v1/base/simd-types/ for more details)
Green threading (co-routines) - these are not actual threads but allow one task to wait for another, typically where the other tasks do not consume CPU on the current thread. Good examples include waiting for intensive I/O operations or orchestration of distributed processes. Green threads are very light and there can be thousands of them but they all are executed within a single (calling) system thread.
Threads via the Threads module. This allows to spawn actual system threads. The advantage (compared to the next scenario in this list) is that all threads share the same process memory.
Multiprocessing/distributed computing via Distributed module. The nice thing here is that you can use the same API moving from a single machine to a cluster. In any HPC computing scenario this is the first choice to consider
GPU computing
In your post you are considering green threads vs real threads. The rule is simple:
if your functions f1, f2, f3 do mostly I/O and are not CPU intensive (e.g. download files from internet) or wait for other processes to complete - use green threading (#sync/#async)
otherwise use Threads (or Distributed computing). This is particularly important in case of CPU-intensive jobs where green threads will not give you any performance boost as they utilize a single system thread.

Related

Julia: #async and multiple CPU cores/threads

Suppose I'm running an expensive computation in the background with #async. Will this computation be performed on the same thread the Julia runtime is running on (i.e. on the same CPU core)? If yes, will I have to then start Julia like julia --threads 2 and use Base.Threads?
#async spawns a green thread coroutine. They will be all spawned in the same system thread and hence are good for types of parallelism where you are waiting for external resources (IO, remote jobs) it is not good for things such as parallelizing your numerical computations (unless they are done on remote workers).
Basically, in Julia you have the following parallelism types:
#simd - utilize Single instruction, multiple data (SIMD) feature of a CPU
#async - coroutines
#threads - multithreading, requires seting JULIA_NUM_THREADS environment variable
#distributed - multiprocessing on a single or multiple machines
GPU computing - start this journey with CUDA.jl
You can find several more detailed examples on StackOverflow for each topic above.

How Does LabVIEW Handle Multiprocessing and Multithreading?

INTRO
multiprocessing = using multiple CPU cores to complete a task (each core has separate memory, thus requires pipes and data structures for each core to "talk" to each other")
multithreading = using multiple threads (that are on a single CPU core) with a task scheduler to complete a task (all threads share same memory on CPU core)
static (temporal) multithreading - take advantage of idle I/O time by scheduling tasks to occur sequentially without pause during cache misses (i.e. waiting to read/write to an I/O device); used for I/O-bound tasks
dynamic (simultaneous) multithreading - take advantage of instructions that can happen at the same time (on Intel chips, this is called "Hyperthreading"); used for CPU-bound tasks
e.g.
a = b*c //Task 1
d = e*f //Task 2
g = a*d //Task 3
// Task 1 and 2 don't depend on each other, and hence can be run in parallel
QUESTION
Given the above, how can I control in LabVIEW which cores I use to multiprocess a task (not multithread)?
LabVIEW inherently parses the dataflow out to multiple processors and multiple threads to as much parallelism as the system is analyzed to stand. THERE ARE ALMOST ZERO CASES WHERE YOU SHOULD SPECIFY THE THREADING MODEL OF THE CODE. The Timed Loop and Timed Structure capabilities should be considered strictly for real-time systems, not for execution on the desktop systems (Windows, Mac, or Linux). If you attempt to specify the threading model, you will almost certainly get less performance than the sophisticated model already computed by the compiler and run-time engine.
As of NI LabVIEW version 8.5 the Timed Loop and Timed Sequence structures include a Processor input that allows you to manually assign available processors to handle the execution of the structures. You can configure the processor assignment by wiring an input to the Processor input of the Input Node for the structure or for frames of the structure.
http://www.ni.com/product-documentation/6400/en/

Why aren't there any software that force the use of multiple cores?

So this is a purely hypothetical question. I have to first put up a disclaimer: I have literally no clue how processors work on a low level, or even on a high level, however low level and high level explanations are appreciated, as I can still wrap my head around the answers (maybe taking me a few hours).
So the question is: how come there are software that just cannot take advantage of multiple cores or threads?
Or a better wording, how come multithread support has to be coded in the software, and isn't something the processor would automatically assign to all it's cores, regardless of the code?
My very naive way of looking at it is that the software will request some calculation from the CPU, so why can't the CPU have a "master thread", which does nothing but assign the calculations to each of the other threads, and then forward the result back to the software as they come?
Because I know that a lot of software can only use one core at a time, and from my naive understanding of how a CPU works, there shouldn't be a reason stopping it from just sending the computations to all available cores.
On that note, the main question: Is it possible to create a software (or driver) which enables ANY software to use all available cores, regardless of how it has been coded?
On that note, the main question: Is it possible to create a software (or driver) which enables ANY software to use all available cores, regardless of how it has been coded?
No, for the same reason two women cannot deliver a baby in four and a half month.
Computation is transformation of the data, from the input to the output, each step reading the data it needs and producing its result.
It's clear that this means there are dependencies between the steps: (x + 1)^2 for x = 3 is 16 but to get this result we first perform the step y = x + 1 and than the step y^2.
We cannot compute (y)^2 before, or even concurrently, with x + 1 to get the correct result.
In short, not everything is parallelizable.
The CPU, as Harold pointed out, can exploit the intrinsic parallelism of some computation: (x + 1) + (x + 2) can be split into computing y = ( x + 1) and z = (x + 2) in parallel and then doing y + z.
It's all about the dependency chains of the computations.
The hard thing about this optimisation is that, contrary to these examples, the instructions often have side effects and one must be very careful to take them into account.
Most effort nowadays goes into doing a fast prediction of when a normally forbidden optimisation is allowed, prediction that is accurate most but not all of the times, and doing a fast recovery from a misprediction.
Furthermore, the is a limit on the resources that are available when looking for or tracking these optimisations.
All this logic is packed into a core, it fetches, decodes, issues, dispatches, executes, and retires the instructions in a way that exploits the intrinsic parallelism.
Even with this help the core usually have more functional units than those usable by the program, this may be due to the use of integers only, for example.
Also, since modern CPUs are very complex, exploiting them fully is also complex.
That's why SMT (i.e. the two threads in each core) was introduced: each thread has its own program (context) but share every other resource in the core and while a program is using the integers another using the floating points can make the CPU fully used.
However each thread has its context, it's like each thread has its own value for x, y, z.
If we compute y = (x + 1) in Core0 we cannot send y^2 to Core1 because the y used will be the one in Core1 and thus the wrong one.
Thereby to parallelise a program it is necessary the human intervention to split a single program into two ore more.
Sending y^2 to Core1 would also require sending y and that would be too slow, below's why.
When the cost of adding another core became lower than the cost of further optimising the core microarchitecture, the manufacturers started including multiple cores.
Why can't the mechanism used to exploit the intrinsic parallelism be extend to dispatch instructions to multiple cores/threads?
Because, it's impossible electronically.
In order for it to work, there must be a shared context (set of variables x, y, ...) and having a single context being accessed by a lot of cores would make it slow.
It may not be intuitive to understand but choosing between 16 destinations is faster than choosing between 32. The same is true when managing 4 readers instead of 16.
Furthermore at the speeds of the modern CPUs, the geometry of the traces matters a lot.
So the core are designed to be fast, have fast internal busses and fast tightly coupled components working more or less at the same frequency.
The CPU uncore is designed to be as fast as possible with fast decoupling between the cores and other components working at different frequency.
In short, dispatching instructions to other cores would be slow, communication between cores is order of magnitude slower than communication intra-core.
For general purpose CPUs it is not convenient to send the data with the program.
It is more performant to have the programmer program each core/thread individually and exchange the data needed when it is needed.
For specific purpose ASIC may take a different approach, for example GPUs have a different parallelism than CPUs.
so why can't the CPU have a "master thread", which does nothing but assign the calculations to each of the other threads, and then forward the result back to the software as they come?
It's actually interesting that you mention this, because it is sort of how high-performance CPUs work. There is no actual thread in the sense of "some code that's running and doing that distribution" though, the hardware itself distributes instructions (or parts of instructions, for complex instructions) over multiple functional units. That's a very fine-grained level of parallelism called Instruction Level Parallelism, it plays a large role in how fast modern CPUs are and unlike other forms of parallelism it can be extracted automatically. The extent to which that happens is mostly limited by the availability of extractible parallelism in the code, and the ability of the CPU to extract it.
Multiple (real) cores are multiple copies of such internally-parallel cores, parallelism on top of parallelism. HyperThreading (and similar SMT implementations) uses that internal parallelism to emulate multiple cores, which usually enables a higher utilization of the actual core, in some sense that is the reverse of what you described.

Multiple pre-emptive thread pools in TBB

We have a requirement to create a number of real time processing chains one running at n Hz and others running at x, y and z Hz in a single process. Where x, y and z are some multiple (not necessarily a simple multiple) of n. For example one chain running at 1Hz and others running at 3Hz and 4Hz. Each processing chain needs to make use of TBB to parallelize some of their computations etc. and so needs to have a pool of TBB worker threads that match the number of hardware processors, but the higher frequency chains need to pre-empt the lower frequency chains in order for the system to work. How can this be achieved using TBB?
It seems from the documentation that it is not possible to create competing pools of TBB workers that will pre-empt each other, since TBBs task groups etc. seem to share a single pool of real threads and there does not appear to be anyway to set the real system priority of a TBB worker thread, or am I missing something?
First, TBB is not designed for real-time systems. Though, the requirement of a few Hz looks relaxed quite enough.
Second, TBB is not designed to serve as a thread pool. It suggests a user-level non-preemptive task scheduler instead of preemptive threads scheduling on OS-level. TBB supposes that tasks are finite and small enough to provide points where the scheduler can switch between them more efficiently than OS can switch execution contexts of threads. Thus except for some special cases, it does not make a sense to request more worker threads then HW provides (i.e. to oversubscribe) and so thread-level priorities do not make sense either with TBB.
Though if you are not convinced by the arguments above and want to try your design, there is a way using two community-preview features: task_arena for work isolation and task_scheduler_observer extensions to assign system priorities for worker thread entering an arena.
TBB task priority feature looks like a more native approach for TBB to the described requirements. Though, it is limited to only 3 levels of priorities, the priority of a task_group_context can be changed dynamically which can be used to reorder tasks on the fly.

Multi-threaded linear system solution in OpenBLAS

I have a code using Fortran 95 and the gfortran compiler. I am also using OpenMP and I have to handle very big arrays. In my code I also have to solve a system of linear equations using the solver DGTSV from OpenBLAS. I want to parallelize this solver as well using openblas which should be capable of that. But I have trouble with the syntax. Using the attached pseudo code all 4 CPUs are used to almost 100% but I am not sure if each kernel solves the linear equations separately or if they split it into parts and calculating it parallel.
The whole stuff is compiled using gfortran -fopenmp -lblas a.f95 -o a.out
So my pseudo code looks like
program a
implicit none
integer, parameter :: N = 200
real*8, dimension(numx) :: D = 0.0
real*8, dimension(numx-1):: DL = 0.0
real*8, dimension(numx-1):: DU = 0.0
real*8, dimension(numx) :: b = 0.0
integer :: info = 0
integer :: numthread=4
...
!$OMP PARALLEL NUM_THREADS(numthread)
...
!$OMP DO
...
!$OMP END DO
CALL DGTSV(N,1,DL,D,DU,b,N,info)
!$OMP DO
...
!$OMP END DO
...
!$OMP END PARALLEL
end program a
What does I have to do to make the solver parallelized, so each kernel calculates parts of the solver?
Inside an OpenMP parallel region, all the threads execute the same code (as in MPI), and the work is only split when the threads reach a loop/section/task.
In your example, the work inside the loops (OMP DO) is distributed among the available threads. After the loop is done, an implicit barrier synchronizes all the threads and then they execute in parallel the function DGTSV. After the subroutine has returned, the loop is split again.
#HristoIliev proposed using a OMP SINGLE clause. This restricts the piece of code inside to be executed by only one thread and forces all the other threads to wait for it (unless you specify nowait).
On the other hand, nested parallelism is called to the case where you declare a parallel region inside another parallel region. This also applies when you perform calls to a OpenMP parallelized library inside a parallel region.
By default, OpenMP does not increase parallelism nested parallel regions, instead, only the thread that enter the parallel region is able to execute it. This behavior can be changed using the environment variable OMP_NESTED to true.
The OMP SINGLE solution is far better than splitting the parallel region in two, as the resources are reused for the next loop:
$!OMP PARALLEL
$!OMP DO
DO ...
END DO
$!OMP SINGLE
CALL DGTSV(...)
$!OMP DO
DO ...
END DO
$!OMP END PARALLEL
To illustrate the usage of OMP_NESTED I'll show you some results I had from an application which used FFTW (a Fast Fourier Transform implementation) configured to use OpenMP. The execution was performed in a 16 core two-socket Intel Xeon E5 #2.46GHz node.
The following graphs show the amount of time spent in the whole application, where parallel regions appear when CPUs > 1, serialized regions when CPUs = 1 and synchronization regions when CPUs = 0.
The application is embarrassingly parallel, so in this particular case using nesting is not worthwhile (FFTW does not scale that good).
This is the OMP_NESTED=false execution. Observe how the amount of parallelism is limited by the amount of threads spent in the external parallel region (ftdock).
This is the OMP_NESTED=true execution. In this case, it is possible to increase parallelism further than the amount of threads spent on the external parallel region. The maximum parallelism possible in this case is 16, when either the 8 external threads create a single peer to execute the internal parallel region or they are 4 creating 3 additional threads each (8x2 = 4x4 = 16).

Resources