Asynchronous execution in Matlab - multithreading

The bottleneck of my Matlab code looks like this:
x = randn(10^4,1);
y = CustomFunction(x) + CustomFunction(x+0.5);
Each call to CustomFunction(x) consumes about 0.1 seconds. As you can see, both evaluations are independent from each other and in theory could be run in parallel. Is it possible to implement this operation in parallel and gain speed by using multithreads, multicores or GPU?
Thanks!

Related

Optimising python3 multiprocessing when each process requires multiple threads

I have a list with 12 items (each of which is a sample), and I need to analyse each of these samples using a function. The function is a wrapper for an external pipeline which needs four threads to run. Some pseudocode:
def my_wrapper(sample, other_arg):
cmd = 'external_pipeline --threads 4 --sample {0} --other {1}'.format(sample, other_arg)
subprocess.Popen(shlex.split(cmd),stderr=subprocess.PIPE,stdout=subprocess.PIPE).communicate()
Previously I ran the function for each sample serially using a loop, which worked, but is relatively inefficient given the my CPU has 20 cores. Example code:
sample_list = ['sample_'+str(x) for x in range(1,13)]
for sample in sample_list:
my_wrapper(sample)
I've tried to use multiprocessing to up the efficiency. I've done this successfully in the past using the starmap function from multiprocessing. Example:
with mp.Pool(mp.cpu_count()) as pool:
results = pool.starmap(my_wrapper, [(sample, other_arg) for sample in sample_list])
This approach has worked well previously when the function I'm calling requires only 1 thread/core per process. However, it doesn't seem to work as I naively expect/hope in my current circumstance. There are 12 samples, each needing to be analysed with 4 threads, but I only have 20 threads in total. Accordingly, I'd expect/hope for 5 samples to be run at a time (5 samples * 4 threads for each = 20 threads total). Instead, all samples appear to be analysed simultaneously, with all 20 threads being used, despite 48 threads being required for this to be efficient.
How might I efficiently run these samples so that only 5 are run in parallel (with each of these processes/jobs using 4 threads)? Do I need to specify a chunk size, or am I barking up the wrong tree with this thought?
Apologies for the vague title and post content, I wasn't sure how to word any of it better!
Limiting the number of cores in your multiprocessing pool will then save some cores free to run when running your wrapper. This will do the chunking for you:
with mp.Pool(mp.cpu_count() / num_cores_used_in_wrapper) as pool:
results = pool.starmap(my_wrapper, [(sample, other_arg) for sample in sample_list])

Julia: Macro threads and parallel

as we know, Julia supports parallelism and this is something rooted in the language which is very good.
I recently saw that Julia supports threads but it seems to me to be experimental. I noticed that in the case of using the Threads.#Threads macro there is no need for Shared Arrays which is perhaps a computational advantage since no copies of the objects are performed. I also saw that there is the advantage of not declaring all functions with #everywhere.
Can anyone tell me the advantage of using the #parallel macro instead of the #threads macro?
Below are two simple examples of using non-synchronized macros for parallelism.
Using the #threads macro
addprocs(Sys.CPU_CORES)
function f1(b)
b+1
end
function f2(c)
f1(c)
end
result = Vector(10)
#time Threads.#threads for i = 1:10
result[i] = f2(i)
end
0.015273 seconds (6.42 k allocations: 340.874 KiB)
Using the #parallel macro
addprocs(Sys.CPU_CORES)
#everywhere function f1(b)
b+1
end
#everywhere function f2(c)
f1(c)
end
result = SharedArray{Float64}(10)
#time #parallel for i = 1:10
result[i] = f2(i)
end
0.060588 seconds (68.66 k allocations: 3.625 MiB)
It seems to me that for Monte Carlo simulations where loops are mathematically independent and there is a need for a lot of computational performance the use of the #threads macro is more convenient. What do you think the advantages and disadvantages of using each of the macros?
Best regards.
Here is my experience:
Threads
Pros:
shared memory
low cost of spawning Julia with many threads
Cons:
constrained to a single machine
number of threads must be specified at Julia start
possible problems with false sharing (https://en.wikipedia.org/wiki/False_sharing)
often you have to use locking or atomic operations for the program to work correctly; in particular many functions in Julia are not threadsafe so you have to be careful using them
not guaranteed to stay in the current form past Julia 1.0
Processess
Pros:
better scaling (you can spawn them e.g. on a cluster of multiple machines)
you can add processes while Julia is running
Cons:
low efficiency when you have to pass a lot of data between processes
slower to start
you have to explicitly share code and data to/between workers
Summary
Processes are much easier to work with and scale better. In most situations they give you enough performance. If you have large data transfers between parallel jobs threads will be better but are much more delicate to correctly use and tune.

Parallel computation

All,
I would like to use Ilnumerics for computations to be made in parallel. They are completely uncoupled. I would need it for
1) random restarts for an optimiser (especially stochastic optimiser, e.g. simulated annealing) : solving the same optimisation problems starting in parallel from different points:
e.g.: argmin_x f(x) starting from x0_h h = 1,2,..,K
2) same optimisation to be run over a sets of uncoupled data; as an example, consider the following unconstrained optimisation problem:
given a function f (R^d x R^p) --> R of x \in R^d and p parameters p\in R^d
solve argmin_x f(x,p_h), h = 1, 2, ..., K.
I hope the notation is clear enough.
Would it be possible to run this loop in parallel, executing everytime some lambda expression involving ILnumerics objects and leveraging on multicores architectures?
Thanks in advance, as usual,
GL
It depends: ILNumerics automatically parallelizes mathematical expressions like
C = A + B[":;2"] / 0.4 * pinv(C) ...
By attempting to run multiple instances of such expressions in parallel, using multiple threads from the thread pool, you would end up producing a lot of contention by too many threads competing for the CPU time slots. In the result your algorithm may runs slower than without parallelizing it.
So, in that case you may disable the internal automatic parallelization ILNumerics does transparently for you:
Settings.MaxNumberThreads = 1;
Expressions like the one above will get evaluated within a single thread afterwards. However, now you are responsible for distributing computational tasks over multiple threads. And moreover, you will have to lock your arrays accordingly - because ILNumerics is not thread safe in general! This allows you to write concurrently to your output arrays but also brings the burdon of having to implement a correct locking scheme...

overriding default Parallel Collections behavior in scala

I have a large batched parallel computation that I use a parallel map for in scala. I have noticed that there appears to be a gradual downstepping of CPU usage as the workers finish. It all comes down to a call to a call inside of the Map object
scala.collection.parallel.thresholdFromSize(length, tasksupport.parallelismLevel)
Looking at the code, I see this:
def thresholdFromSize(sz: Int, parallelismLevel: Int) = {
val p = parallelismLevel
if (p > 1) 1 + sz / (8 * p)
else sz
}
My calculation works great on a large number of cores, and now I understand why..
thesholdFromSize(1000000,24) = 5209
thesholdFromSize(1000000,4) = 31251
If I have an array of length 1000000 on 24 CPU's it will partition all the way down to 5209 elements. If I pass that same array into the parallel collections on my 4 CPU machine, it will stop partitioning at 31251 elements.
It should be noted that the runtime of my calculations is not uniform. Runtime per unit can be as much as 0.1 seconds. At 31251 items, that's 3100 seconds, or 52 minutes of time where the other workers could be stepping in and grabbing work, but are not. I have observed exactly this behavior while monitoring CPU utilization during the parallel computation. Obviously I'd love to run on a large machine, but that's not always possible.
My question is this: Is there any way to influence the parallel collections to give it a smaller threshold number that is more suited to my problem? The only thing I can think of is to make my own implementation of the class 'Map', but that seems like a very non-elegant solution.
You want to read up on Configuring Scala parallel collections. In particular, you probably need to implement a TaskSupport trait.
I think all you need to do is something like this:
yourCollection.tasksupport = new ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(24))
The parallelism parameter defaults to the number of CPU cores that you have, but you can override it like above. This is shown in the source for ParIterableLike as well.
0.1 second is large time enough to handle it separately. Wrap processing of each unit (or 10 units) in a separate Runnable and submit all of them to a FixedThreadPool. Another approach is to use ForkJoinPool - then it is easier to control the end of all computations.

can i easily write a program to make use of Intel's Quad core or i7 chip if only 1 thread is used?

I wonder if in my program I have only 1 thread, can I write it so that the Quad core or i7 can actually make use of the different cores? Usually when i write programs on a Quad core computer, the CPU usage will only go to about 25%, and the work seems to be divided among the 4 cores, as the Task Manager shows. (the programs i wrote usually is Ruby, Python, or PHP, so they may not be so much optimized).
Update: what if i write it in C or C++ instead, and
for (i = 0; i < 100000000; i++) {
a = i * 2;
b = i + 1;
if (a == ... || b == ...) { ... }
}
and then use the highest level of optimization with the compiler. can the compiler make the multiplication happen on one core, and the addition happen on a different core, and therefore make 2 cores work at the same time? isn't that a fairly easy optimization to use 2 cores?
No. You need to use threads to execute multiple paths concurrently on multiple CPU's (be they real or virtual)... execution of one thread is inherently bound to one CPU as this maintains the "happens before" relationship between statements, which is central to how programs work.
First, unless multiple threads are created in the program, then there is only a single thread of execution in that program.
Seeing 25% of CPU resources being used for the program is an indication that a single core out of four is being utilized at 100%, but all other cores are not being used. If all cores were used, then it would be theoretically possible for the process to hog 100% of the CPU resources.
As a side note, the graphs shown in Task Manager in Windows is the CPU utilization by all processes running at the time, not only for one process.
Secondly, the code you present could be split into code which can execute on two separate threads in order to execute on two cores. I am guessing that you want to show that a and b are independent of each other, and they only depend on i. With that type of situation, separating the inside of the for loop like the following could allow multi-threaded operation which could lead to increased performance:
// Process this in one thread:
for (int i = 0; i < 1000; i++) {
a = i * 2;
}
// Process this in another thread:
for (int i = 0; i < 1000; i++) {
b = i + 1;
}
However, what becomes tricky is if there needs to be a time when the results from the two separate threads need to be evaluated, such as seems to be implied by the if statement later on:
for (i = 0; i < 1000; i++) {
// manipulate "a" and "b"
if (a == ... || b == ...) { ... }
}
This would require that the a and b values which reside in separate threads (which are executing on separate processors) to be looked up, which is a serious headache.
There is no real good guarantee that the i values of the two threads are the same at the same time (after all, multiplication and addition probably will take different amount of times to execute), and that means that one thread may need to wait for another for the i values to get in sync before comparing the a and b that corresponds to the dependent value i. Or, do we make a third thread for value comparison and synchronization of the two threads? In either case, the complexity is starting to build up very quickly, so I think we can agree that we're starting to see a serious mess arising -- sharing states between threads can be very tricky.
Therefore, the code example you provide is only partially parallelizable without much effort, however, as soon as there is a need to compare the two variables, separating the two operations becomes very difficult very quickly.
Couple of rules of thumbs when it comes to concurrent programming:
When there are tasks which can be broken down into parts which involve processing of data that is completely independent of other data and its results (states), then parallelizing can be very easy.
For example, two functions which calculates a value from an input (in pseudocode):
f(x) = { return 2x }
g(x) = { return x+1 }
These two functions don't rely on each other, so they can be executed in parallel without any pain. Also, as they are no states to share or handle between calculations, even if there were multiple values of x that needed to be calculated, even those can be split up further:
x = [1, 2, 3, 4]
foreach t in x:
runInThread(f(t))
foreach t in x:
runInThread(g(t))
Now, in this example, we can have 8 separate threads performing calculations. Not having side effects can be very good thing for concurrent programming.
However, as soon as there is dependency on data and results from other calculations (which also means there are side effects), parallelization becomes extremely difficult. In many cases, these types of problems will have to be performed in serial as they await results from other calculations to be returned.
Perhaps the question comes down to, why can't compilers figure out parts that can be automatically parallelized and perform those optimizations? I'm not an expert on compilers so I can't say, but there is an article on automatic parallization at Wikipedia which may have some information.
I know Intel chips very well.
Per your code, "if (a == ... || b == ...)" is a barrier, otherwise the processor cores will execute all code parallelly, regardless of compiler had done what kind of optimization. That only requires that the compiler is not a very "stupid" one. It means that the hardware has the capability itself, not software. So threaded programming or OpenMP is not necessary in such cases though they will help on improving parallel computing. Note here doesn't mean Hyper-threading, just normal multi-core processor functionalities.
Please google "processor pipeline multi port parallel" to learn more.
Here I'd like to give a classical example which could be executed by multi-core/multi-channel IMC platforms (e.g. Intel Nehalem family such as Core i7) parallelly, no extra software optimization would be needed.
char buffer0[64];
char buffer1[64];
char buffer2[64];
char buffer[192];
int i;
for (i = 0; i < 64; i++) {
*(buffer + i) = *(buffer0 + i);
*(buffer + 64 + i) = *(buffer1 + i);
*(buffer + 128 + i) = *(buffer2 + i);
}
Why? 3 reasons.
1 Core i7 has a triple-channel IMC, its bus width is 192 bits, 64 bits per channel; and memory address space is interleaved among the channels on a per cache-line basis. cache-line length is 64 bytes. so basicly buffer0 is on channel 0, buffer1 will be on channel and buffer2 on channel 2; while for buffer[192], it was interleaved among 3 channels evently, 64 per channel. The IMC supports loading or storing data from or to multiple channels concurrently. That's multi-channel MC burst w/ maximum throughput. While in my following description, I'll only say 64 bytes per channel, say w/ BL x8 (Burst Length 8, 8 x 8 = 64 bytes = cache-line) per channel.
2 buffer0..2 and buffer are continuous in the memory space (on a specific page both virtually and physically, stack memroy). when run, buffer0, 1, 2 and buffer are loaded/fetched into the processor cache, 6 cache-lines in total. so after start the execution of above "for(){}" code, accessing memory is not necessary at all because all data are in the cache, L3 cache, a non-core part, which is shared by all cores. We'll not talk about L1/2 here. In this case every core could pick the data up and then compute them independently, the only requirement is that the OS supports MP and stealing task is allowed, say runtime scheduling and affinities sharing.
3 there're no any dependencies among buffer0, 1, 2 and buffer, so there're no execution stall or barriers. e.g. execute *(buffer + 64 + i) = *(buffer1 + i) doesn't need to wait the execution of *(buffer + i) = *(buffer0 + i) for done.
Though, the most important and difficult point is "stealing task, runtime scheduling and affinities sharing", that's because for a give task, there's only one task exection context and it should be shared by all cores to perform parallel execution. Anyone if could understand this point, s/he is among the top experts in the world. I'm looking for such an expert to cowork on my open source project and be responsible for parallel computing and latest HPC architectures related works.
Note in above example code, you also could use some SIMD instructions such as movntdq/a which will bypass processor cache and write memory directly. It's a very good idea too when perform software level optimization, though accessing memory is extremely expensive, for example, accessing cache (L1) may need just only 1 cycle, but accessing memory needs 142 cycles on former x86 chips.
Please visit http://effocore.googlecode.com and http://effogpled.googlecode.com to know the details.
Implicit parallelism is probably what you are looking for.
If your application code is single-threaded multiple processors/cores will only be used if:
the libraries you use are using multiple threads (perhaps hiding this usage behind a simple interface)
your application spawns other processes to perform some part of its operation
Ruby, Python and PHP applications can all be written to use multiple threads, however.
A single threaded program will only use one core. The operating system might well decide to shift the program between cores from time to time - according to some rules to balance the load etc. So you will see only 25% usage overall and the all four cores working - but only one at once.
The only way to use multiple cores without using multithreading is to use multiple programs.
In your example above, one program could handle 0-2499999, the next 2500000-4999999, and so on. Set all four of them off at the same time, and they will use all four cores.
Usually you would be better off writing a (single) multithreaded program.
With C/C++ you can use OpenMP. It's C code with pragmas like
#pragma omp parallel for
for(..) {
...
}
to say that this for will run in parallel.
This is one easy way to parallelize something, but at some time you will have to understand how parallel programs execute and will be exposed to parallel programming bugs.
If you want to parallel the choice of the "i"s that evaluate to "true" your statement if (a == ... || b == ...) then you can do this with PLINQ (in .NET 4.0):
//note the "AsParallel"; that's it, multicore support.
var query = from i in Enumerable.Range(0, 100000000).AsParallel()
where (i % 2 == 1 && i >= 10) //your condition
select i;
//while iterating, the query is evaluated in parallel!
//Result will probably never be in order (eg. 13, 11, 17, 15, 19..)
foreach (var selected in query)
{
//not parallel here!
}
If, instead, you want to parallelize operations, you will be able to do:
Parallel.For(0, 100000000, i =>
{
if (i > 10) //your condition here
DoWork(i); //Thread-safe operation
});
Since you are talking about 'task manager', you appear to be running on Windows. However, if you are running a webserver on there (for Ruby or PHP with fcgi or Apache pre-forking, ant to a lesser extent other Apache workers), with multiple processes, then they would tend to spread out across the cores.
If only a single program without threading is running, then, no, no significant advantage will come from that - you're only ruinning one thing at a time, other than OS-driven background processes.

Resources