Julia: #async and multiple CPU cores/threads

Julia: #async and multiple CPU cores/threads - multithreading

Suppose I'm running an expensive computation in the background with #async. Will this computation be performed on the same thread the Julia runtime is running on (i.e. on the same CPU core)? If yes, will I have to then start Julia like julia --threads 2 and use Base.Threads?

#async spawns a green thread coroutine. They will be all spawned in the same system thread and hence are good for types of parallelism where you are waiting for external resources (IO, remote jobs) it is not good for things such as parallelizing your numerical computations (unless they are done on remote workers).
Basically, in Julia you have the following parallelism types:
#simd - utilize Single instruction, multiple data (SIMD) feature of a CPU
#async - coroutines
#threads - multithreading, requires seting JULIA_NUM_THREADS environment variable
#distributed - multiprocessing on a single or multiple machines
GPU computing - start this journey with CUDA.jl
You can find several more detailed examples on StackOverflow for each topic above.

Related

How Does LabVIEW Handle Multiprocessing and Multithreading?

INTRO
multiprocessing = using multiple CPU cores to complete a task (each core has separate memory, thus requires pipes and data structures for each core to "talk" to each other")
multithreading = using multiple threads (that are on a single CPU core) with a task scheduler to complete a task (all threads share same memory on CPU core)
static (temporal) multithreading - take advantage of idle I/O time by scheduling tasks to occur sequentially without pause during cache misses (i.e. waiting to read/write to an I/O device); used for I/O-bound tasks
dynamic (simultaneous) multithreading - take advantage of instructions that can happen at the same time (on Intel chips, this is called "Hyperthreading"); used for CPU-bound tasks
e.g.
a = b*c //Task 1
d = e*f //Task 2
g = a*d //Task 3
// Task 1 and 2 don't depend on each other, and hence can be run in parallel
QUESTION
Given the above, how can I control in LabVIEW which cores I use to multiprocess a task (not multithread)?

LabVIEW inherently parses the dataflow out to multiple processors and multiple threads to as much parallelism as the system is analyzed to stand. THERE ARE ALMOST ZERO CASES WHERE YOU SHOULD SPECIFY THE THREADING MODEL OF THE CODE. The Timed Loop and Timed Structure capabilities should be considered strictly for real-time systems, not for execution on the desktop systems (Windows, Mac, or Linux). If you attempt to specify the threading model, you will almost certainly get less performance than the sophisticated model already computed by the compiler and run-time engine.

As of NI LabVIEW version 8.5 the Timed Loop and Timed Sequence structures include a Processor input that allows you to manually assign available processors to handle the execution of the structures. You can configure the processor assignment by wiring an input to the Processor input of the Input Node for the structure or for frames of the structure.
http://www.ni.com/product-documentation/6400/en/

Understanding coroutine

From wikipedia the
paragraph Comparison with threads states:
... This means that coroutines provide concurrency but not parallelism ...
I understand that coroutine is lighter than thread, context switching is not involved, no critical sections so mutex is also not needed. What confuses me is that the way it works seems not to scale. According to wikipedia, coroutines provide concurrency, they work cooperatively. A program with coroutines still executes instructions sequentially, this is exactly the same as threads on a single core machine, but what about multicore machines? on which threads run in parallel, while coroutines work the same as on single core machines.
My question is how coroutines will perform better than threads on multicore machines?

...what about multicore machines?...
Coroutines are a model of concurrency (in which two or more stateful activities can be in-progress at the same time), but not a model of parallelism (in which the program would able to use more hardware resources than what a single, conventional CPU core can provide).
Threads can run independently of one another, and if your hardware supports it (i.e., if your machine has more than one core) then two or more threads can be performing their independent activities at the same instant in time.
But coroutines, by definition, are interdependent. A coroutine only runs when it is called by another coroutine, and the caller is suspended until the current coroutine calls it back. Only one coroutine from a set of coroutines can ever be actually running at any given instant in time.

Differences beteween Threading in Nvidias GPUs and CPUs

I am trying to understand the difference between the threading techniques used by Nividia GPUs and normal (multi threading) CPUs. In particular my two questions are:
Which part of the system is respondsible for the thread scheduling and according to which aspects are they scheduled?
Are threads processed synchronously?

CUDA cores and CPU cores are literally a complete different thing - the name is more a marketing thing;
What do you mean with responsible for thread scheduling? Its mostly both Software and Hardware. For instance the pure CPU has little to do with the actual thread scheduling, but provides the necessary functionality to implement a thread-scheduler as a part of the OS. So the scheduling parameter are defined by the software. Hence you should adopt your question to a specific OS.
One thing the CPU provides are the so called hardware-threads. Each hardware-thread allows the "parallel" execution of one software-thread. (Note: With Hyperthreading, the execution is not really parallel more interleaving). The scheduler distributes all running threads on these hardware-threads.
This is basically a MIMD-System.
The scheduling on graphic-cards are way more complicated. In short:
You have a few thousands CUDA-cores - but in contrast to the CPU you cannot assign a unique application to each of them. The CUDA-cores are organized in groups (so called warps) and all CUDA-cores inside the same group execute the same thread simultaneously.
This is called SIMT

how are the multiprocessing and threading and thread pooling working

https://code.tutsplus.com/articles/introduction-to-parallel-and-concurrent-programming-in-python--cms-28612
From this link I have studied, I have few questions
Q1 : How thread pool (Concurrent) and threading are different here? why do we see the performance improvement. Threading with Que is having 4 threads and each runs cooperatively during the idle time and picks the item from the Que once they get website response. As i see, the thread pool is also in a way doing the same. completing its work and waiting for the manager to assign a task; which is very similar to picking a new item from the Que. I'm not sure how this is different and why i see the perfroamcne improvment. Seems i'm wrong in interpreting the poling here. Could you expalin
Q2 : Question 2 : using multiprocessing the time taken is more. If I have multiprocessor which can handle multiple processes at a time, then all my 4 processes should be handled by it at a time. That is the real parallelization is happening. Also, I have a question here - in such case since 4 processes are running same function doesn't GIL try to stop them executing the same piece of code. Lets suppose all of them share a common variable that gets updated - like number of websites checked. So how does GIL work in these cases of multiprocessing?
Also, here are the same processes used again and again or they get killed and created every time after their job - I think same processes are used. Also, I think that the performance problem is because of the process creation compared to light weight threads at the concurrent threading phase - which is costly. So could you explain more in detail how the GIL is working here and process are running, are they running cooperatively (like each process wait for its turn - like threads in a process do). Or are these processes using the multiprocessors to run really parallel. Also, my other question is If I have a 8 core machine, I think I can run 8 threads of a same process simultaneously or parallel. if I have the 8 core machine can I run 2 processes with 4 threads each? can I run 8 processes on 8 cores? I think cores are only for threads of a process, which means I cant run the 8 process on 8 cores but I can run as many number of processes as many CPU's or multiprocessor system is mine, am i right? So can I run 2 processes with 4 threads each? on my 8 core machine with 2 multiprocessors and each processor having 4 cores each?

Python has a rich set of libraries for multitasking with Processes and Threads. However, there is overlap between the libraries, the choice depends on how abstractly you view the computational tasks. For example, the concurrent.futures library views threads as asynchronous tasks, while the Threading library deals with them as high-level threads. Further, the _thread implements a low-level interface for threading exposing all the synchronization mechanisms.
The GIL(Global Interpreter Lock) is just a synchronization primitive, specifically a mutex which prevents multiple threads of the same process from executing Python bytecode fragments(for certain objects which need to remain consistent with concurrent operations). This is exactly why Python threads excel with I/O operations in terms of speed when compared to compute intensive tasks.(owing to the fact that the GIL is released in case of certain blocking calls/computationally intensive libraries such as numpy). Note that only CPython and Pypy versions of Python are constrained by the GIL mechanism.
Now, let's see those questions...
How thread pool (Concurrent) and threading are different here? Why do we see the performance improvement?
Coming to the comparison between Threading and concurrent.futures.ThreadPoolExecutor (aka threading_squirrel vs future_squirrel), I've executed both programs with the same test case. There are two factors that contribute to this "performance improvement":
Network HEAD requests: Remember that network operations need not complete in the same time period every time you execute them... due to the very nature of packet transfer delays...
Order of thread execution: In the website you've linked, the author creates all threads initially, sets up the queue full of website links and then starts all of them in a list comprehension loop. In ThreadPoolExecutor of concurrent.futures, each time a task is submitted, a thread is assigned to it if the predefined maximum number of threads/workers have not been reached. I've changed the code to mirror this technique. It seems to give a speedup as the first thread begins work early on and doesn't need to wait for the queue to be filled up...
How does GIL work in these cases of multiprocessing?
Remember that the GIL comes into effect for threads of a process only, not among processes. GIL locks up the whole interpreter bytecode during a thread of execution, so the other threads have to wait for their turn. This is the reason multiprocessing used processes instead of threads, as each process has it's own interpreter and consequently, it's own GIL.
Are the same processes used again and again or they get killed and created every time after their job?
The concept of pooling is to reduce the overhead of creating and destroying workers(be it threads or processes) during computation. However, the processes are kind of "brand new" in the sense that the library effectively asks the OS to perform a fork in an UNIX based OS or spawn in an NT based OS...
Also, are the processes running co-operatively?
Maybe. They have to run in co-operation if they use shared memory...(need not be running together). There is definitely going to be a context switch if there are more processes than the OS can allocate to its processors' cores. They can run in parallel if there's no shared memory updates to make.
If I have the 8 core machine can I run 2 processes with 4 threads each? Can I run 8 processes on 8 cores?
Sure (subject to the GIL, in Python). Each process can be allocated to each processing unit for execution. A processing unit can be a physical or a virtual core of a CPU. As long as the OS scheduler supports it, it's possible. Any reasonable split up of processes and threads are possible. If all are allocatable, that's the best situation, else you will encounter context switches...(which are more expensive when it comes to processes)
Hope I've answered all those questions!
Here are a few resources:
MultiCore CPUs, Multithreading and context switching?
Why does multiprocessing use only a single core after I import numpy?
Bonus celery-squirrel resource

Misunderstanding the difference between single-threading and multi-threading programming

I have a misunderstanding of the difference between single-threading and multi-threading programming, so I want an answer to the following question to make everything clear.
Suppose that there are 9 independent tasks and I want to accomplish them with a single-threaded program and a multi-threaded program. Basically it will be something like this:
Single-thread:
- Execute task 1
- Execute task 2
- Execute task 3
- Execute task 4
- Execute task 5
- Execute task 6
- Execute task 7
- Execute task 8
- Execute task 9
Multi-threaded:
Thread1:
- Execute task 1
- Execute task 2
- Execute task 3
Thread2:
- Execute task 4
- Execute task 5
- Execute task 6
Thread3:
- Execute task 7
- Execute task 8
- Execute task 9
As I understand, only ONE thread will be executed at a time (get the CPU), and once the quantum is finished, the thread scheduler will give the CPU time to another thread.
So, which program will be finished earlier? Is it the multi-threaded program (logically)? or is it the single-thread program (since the multi-threading has a lot of context-switching which takes some time)? and why? I need a good explanation please :)

It depends.
How many CPUs do you have? How much I/O is involved in your tasks?
If you have only 1 CPU, and the tasks have no blocking I/O, then the single threaded will finish equal to or faster than multi-threaded, as there is overhead to switching threads.
If you have 1 CPU, but the tasks involve a lot of blocking I/O, you might see a speedup by using threading, assuming work can be done when I/O is in progress.
If you have multiple cpus, then you should see a speedup with the multi-threaded implementation over the single-threaded since more than 1 thread can execute in parallel. Unless of course the tasks are I/O dominated, in which case the limiting factor is your device speed, not CPU power.

As I understand, only ONE thread will be executed at a time
That would be the case if the CPU only had one core. Modern CPUs have multiple cores, and can run multiple threads in parallel.
The program running three threads would run almost three times faster. Even if the tasks are independent, there are still some resources in the computer that has to be shared between the threads, like memory access.

Well, this isn't entirely language agnostic. Some interpreted programming languages don't support real Threads. That is, threads of execution can be defined by the program, but the interpreter is single threaded so all execution is on one core of the CPU.
For compiled languages and languages that support true multi-threading, a single CPU can have many cores. Actually, most desktop computers now have 2 or 4 cores. So a multi-threaded program executing truely independent tasks can finish 2-4 times faster based on the number of available cores in the CPU.

Assumption Set:
Single core with no hyperthreading;
tasks are CPU bound;
Each task take 3 quanta of time;
Each scheduler allocation is limited to 1 quanta of time;
FIFO scheduler Nonpreemptive;
All threads hit the scheduler at the same time;
All context switches require the same amount of time;
Processes are delineated as follows:
Test 1: Single Process, single thread (contains all 9 tasks)
Test 2: Single Process, three threads (contain 3 tasks each)
Test 3: Three Processes, each single threaded (contain 3 tasks each)
Test 4: Three Processes, each with three threads (contain one task each)
With the above assumptions, they all finish at the same time. This is because there is an identicle amount of time scheduled for the CPU, context switches are identicle, there is no interrupt handling, and nothing is waiting for IO.
For more depth into the nature of this, please find this book.

The main difference between single thread and multi thread in Java is that single thread executes tasks of a process while in multi-thread, multiple threads execute the tasks of a process.
A process is a program in execution. Process creation is a resource consuming task. Therefore, it is possible to divide a process into multiple units called threads. A thread is a lightweight process. It is possible to divide a single process into multiple threads and assign tasks to them. When there is one thread in a process, it is called a single threaded application. When there are multiple threads in a process, it is called a multi-threaded application.

ruby vs python vs nodejs : performances in web app, which takes alot of I/O non blockingrest/dbQuery will impact alot. and being the only multi threaded of all 3, nodejs is the winner with big lead gap

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string