Parallelization without Parallel Computing Toolbox

Parallelization without Parallel Computing Toolbox - multithreading

I have the "standard" version of Matlab without any additional toolboxes installed.
Is it somehow possible to make use of multithreading (use all cores of a quad-core instead of only one) without installing the Parallel Computing Toolbox?
I guess it is not, but maybe someone figured out a workaround?
Thank you very much!

There are several functions, that are implemented using multi-threading. If you use these functions, all cores will be used: http://www.mathworks.com/matlabcentral/answers/95958

You can use threads/parallelism in C, C++ or Java, all of which can be called from Matlab (Java being probably the fastest/simplest way?).
A couple of observations:
a) Matlab's parallel construct are quite heavyweight and will not give you a super-speedup. I personally prefer calling C/C++ code with OpenMP if I want fast-to-write parallelism.
b) Matlab's functions, in general, are not thread-safe, therefore calling them from multithreaded non-Matlab code is dangerous.
c) In image processing, some of the functions in Matlab are GPU-accelerated, therefore they are quite fast on their own.

Related

C++11 threading vs. OpenMP for simple parallel loops. Which, when?

This is something of a follow-up to this other question of mine.
I would like to know if parallelized loops with a reduction operation, like a parallelized integration, belongs to the domain of applicability of C++11 threading or if OpenMP is best suited for tasks like this.
Now, consider the same setting but with threads executing computations that may throw exceptions. Does it change the scenario? Would now C++11 threading be best suited?
Thank you.

IMO, I would prefer OpenMP for any HPC / scientific and engineering computing codes. It more directly targets data parallelism. C++11 threading represents more task parallelism, which is preferable for other kinds of software (e.g., network server applications).
The situations might change in the future, there are some efforts to integrate more parallelism into C++, such as parallel STL algorithms. However, we now even do not know how this parallelism will look like.
You also rarely build codes from scratch. There are many performance-aware multi-threaded libraries that support OpenMP (sorting, linear algebra, ...), however few that support C++11 threads.

As best as I can determine, OpenMP represents greater performance potential, simply because there are a lot more tricks a compiler can use (particularly if your cpu supports vectorized computations) if it can be directly instructed to parallelize a construct. Host/dispatch threading models (like the threading models in Java and C++11) can't really do that without remarkably intelligent code analysis tools.
However, OpenMP does represent a tax on both code readability and design flexibility. Parallel execution of heterogeneous tasks is possible in OpenMP, but much more verbose to implement, and much more difficult to parse. And because it depends on preprocessor macros (which C++ purists don't like anyways) it's virtually impossible to set dynamic state about the threading model itself.
Personally, having worked on enterprise level code, I think I prefer Host/dispatch threading (aka, C++11 threads). It may represent a performance sacrifice, but as the saying goes: "Processor Cycles are much cheaper than Developer Cycles". And if you really, really are in a performance constrained environment, it either means an algorithm problem, and switching to OpenMP probably wouldn't fix it; or, it means you should probably be looking into compute cards or OpenCL/Cuda programming.

Why devectorization in Julia is encouraged?

Seems like writing devectorized code is encouraged in Julia.
There is even a package that tries to do that for you.
My question is why?
First of all, speaking from the user experience aspect, vectorized code is more concise (less code, then less likelihood of bugs), more clear (hence easier to debug), more natural way of writing code (at least for someone who comes from scientific computing background, whom Julia tries to cater to). Being able to write something like vector'vector or vector'Matrix*vector is very important, because it corresponds to actual mathematical representation, and this is how scientific computing guys think of it in their head (not in nested loops). And I hate the fact that this is not the best way to write this, and reparsing it into loops will be faster.
At the moment it seems like there is a conflict between the goal of writing the code that is fast and the code that is concise/clear.
Secondly, what is the technical reason for this? Ok, I understand that vectorized code creates extra temporaries, etc., but vectorized functions (for example, broadcast(), map(), etc.) have a potential of multithreading them, and I think that the benefit of multithreading can outweigh the overhead of temporaries and other disadvantages of vectorized functions making them faster than regular for loops.
Do current implementations of vectorized functions in Julia do implicit multithreading under the hood?
If not, is there work / plans to add implicit concurrency to vectorized functions and to make them faster than loops?

For easy reading I decided to turn my comment marathon above into an answer.
The core development statement behind Julia is "we are greedy". The core devs want it to do everything, and do it fast. In particular, note that the language is supposed to solve the "two-language problem", and at this stage, it looks like it will accomplish this by the time v1.0 hits.
In the context of your question, this means that everything you are asking about is either already a part of Julia, or planned for v1.0.
In particular, this means that if your programming problem lends itself to vectorized code, then write vectorized code. If it is more natural to use loops, use loops.
By the time v1.0 hits, most vectorized code should be as fast, or faster, than equivalent code in Matlab. In many cases, this development goal has already been achieved, since many vector/matrix operations in Julia are sent to the appropriate BLAS routines by the compiler.
Regarding multi-threading, native multi-threading is currently being implemented for Julia, and I believe an experimental set of routines is already available on the master branch. The relevant issue page is here. Implicit multithreading for some vector/matrix operations is already in theory available in Julia, since Julia calls BLAS. I'm not sure if it is switched on by default.
Be aware though, that many vectorized operations will still (currently) be much faster in MATLAB, since MATLAB have been writing specialised multi-threaded C libraries for years and then calling them under the hood. Once Julia has native multi-threading, I expect Julia will overtake MATLAB, since at that point the entire dev community can scour the standard Julia packages and upgrade them to take advantage of native multi-threading wherever possible.
In contrast, MATLAB does not have native multi-threading, so you are relying on Mathworks to provide specialised multi-threaded routines in the form of underlying C libraries.

You can and should write vector'*matrix*vector (or perhaps dot(vector, matrix*vector) if you prefer a scalar output). For things like matrix multiplication, you're much better off using vectorized notation, because it calls the underlying BLAS libraries which are more heavily optimized than code produced by basically any language/compiler combination.
In other places, as you say you can benefit from devectorization by avoiding temporary intermediates: for example, if x is a vector, the expression
y = exp(x).*x + 5
creates 3 temporary vectors: one for a = exp(x), one for b = a.*x and one for y = b + 5. In contrast,
y = [exp(z)*z+5 for z in x]
creates no temporary intermediates. Since loops and comprehensions in julia are fast, there is no disadvantage to writing the devectorized version, and in fact it should perform slightly better (especially with performance annotations like #simd, where appropriate).
The arrival of threads may change things (making vectorized exp faster than a "naive" exp), but in general I'd say you should regard this as an "orthogonal" issue: julia will likely make multithreading so easy to use that you yourself might write operations using multiple threads, and consequently the vectorized "library" routine still has no advantage over code you might write yourself. In other words, you might use multiple threads but still write devectorized code to avoid those temporaries.
In the longer term, a "sufficiently smart compiler" may avoid temporaries by automatically devectorizing some of these operations, but this is a much harder route, with potential traps for the unwary, than it may seem.
Your statement that "vectorized code is always more concise, and easier to understand" is, however, not true: many times while writing Matlab code, you have to go to extremes to come up with a vectorized way of writing what are actually simple operations when thought of in terms of loops. You can search the mailing lists for countless examples; one that I recall on SO is How to find connected components in a matrix using Julia.

Easiest way to use GPU for parallel for loop

I currently have a parallel for loop similar to this:
int testValues[16]={5,2,2,10,4,4,2,100,5,2,4,3,29,4,1,52};
parallel_for (1, 100, 1, [&](int i){
int var4;
int values[16]={-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1};
/* ...nested for loops */
for (var4=0; var4<16; var4++) {
if (values[var4] != testValues[var4]) break;
}
/* ...end nested loops */
}
I have optimised as much as I can to the point that the only thing more I can do is add more resources.
I am interested in utilising the GPU to help process the task in parallel. I have read that embarassingly parallel tasks like this can make use of a modern GPU quite effectively.
Using any language, what is the easiest way to use the GPU for a simple parallel for loop like this?
I know nothing about GPU architectures or native GPU code.

as Li-aung Yip said in comments, the simplest way to use a GPU is with something like Matlab that supports array operations and automatically (more or less) moves those to the GPU. but for that to work you need to rewrite your code as pure matrix-based operations.
otherwise, most GPU use still requires coding in CUDA or OpenCL (you would need to use OpenCL with an AMD card). even if you use a wrapper for your favourite language, the actual code that runs on the GPU is still usually written in OpenCL (which looks vaguely like C). and so this requires a fair amount of learning/effort. you can start by downloading OpenCL from AMD and reading through the docs...
both those options require learning new ideas, i suspect. what you really want, i think, is a high level, but still traditional-looking, language targeted at the gpu. unfortunately, they don't seem to exist much, yet. the only example i can think of is theano - you might try that. even there, you still need to learn python/numpy, and i am not sure how solid the theano implementation is, but it may be the least painful way forwards (in that it allows a "traditional" approach - using matrices is in many ways easier, but some people seem to find that very hard to grasp, conceptually).
ps it's not clear to me that a gpu will help your problem, btw.

You might want to check out array fire.
http://www.accelereyes.com/products/arrayfire
If you use openCL, you need to download separate implementations for different device vendors, intel, AMD, and Nvidia.

You might want to look into OpenACC which enables parallelism via directives. You can port your codes (C/C++/Fortran) to heterogeneous systems while maintaining a source code that still runs well on a homogeneous system. Take a look into this introduction video. OpenACC is not GPU programming, but expressing parallelism into your code, which may be helpful to achieve performance improvements without too much knowledge in low-level languages such as CUDA or OpenCL. OpenACC is available in commercial compilers from PGI, Cray, and CAPS (PGI offers new users a free 30 day trial).

OpenMP with OCAML

Does anyone know if it's possible use OpenMP with OCaml source code?
Or another application/ambient of work, compatible with OCaml, that allows me to run parallel programs that exploit multiple cores?
If yes, how? Have you got an easy example?

Currently there is OC4MC (ocaml 4 multi-core) to perform shared memory multi-processing. I have not used the project, but there are fairly recent updates, so I can only assume the project is still moving forward.
JOCAML is another concurrent extension to ocaml implementing the join calculus. I have also not used this project, but their site is updated to mention ocaml 3.12, which came out fairly recently. Disregard; see comment.
If you can pry yourself away from the openMP paradigm, then there are ocaml bindings for mpi. I use this project, and have not had problems with it, and it's pretty easy to use if you are familiar with MPI.
Lastly, some (possibly unmaintained) packages pertaining to multi-core / parallel processing can be found on the ocaml hump.

Which scripting languages support multi-core programming?

I have written a little python application and here you can see how Task Manager looks during a typical run.
(source: weinzierl.name)
While the application is perfectly multithreaded, unsurprisingly it uses only one CPU core.
Regardless of the fact that most modern scripting languages support multithreading, scripts can run on one CPU core only.
Ruby, Python, Lua, PHP all can only run on a single core.
Even Erlang, which is said to be especially good for concurrent programming, is affected.
Is there a scripting language that has built in
support for threads that are not confined to a single core?
WRAP UP
Answers were not quite what I expected, but the TCL answer comes close.
I'd like to add perl, which (much like TCL) has interpreter-based threads.
Jython, IronPython and Groovy fall under the umbrella of combining a proven language with the proven virtual machine of another language. Thanks for your hints in this
direction.
I chose Aiden Bell's answer as Accepted Answer.
He does not suggest a particular language but his remark was most insightful to me.

You seem use a definition of "scripting language" that may raise a few eyebrows, and I don't know what that implies about your other requirements.
Anyway, have you considered TCL? It will do what you want, I believe.
Since you are including fairly general purpose languages in your list, I don't know how heavy an implementation is acceptable to you. I'd be surprised if one of the zillion Scheme implementations doesn't to native threads, but off the top of my head, I can only remember the MzScheme used to but I seem to remember support was dropped. Certainly some of the Common LISP implementations do this well. If Embeddable Common Lisp (ECL) does, it might work for you. I don't use it though so I'm not sure what the state of it's threading support is, and this may of course depend on platform.
Update Also, if I recall correctly, GHC Haskell doesn't do quite what you are asking, but may do effectively what you want since, again, as I recall, it will spin of a native thread per core or so and then run its threads across those....

You can freely multi-thread with the Python language in implementations such as Jython (on the JVM, as #Reginaldo mention Groovy is) and IronPython (on .NET). For the classical CPython implementation of the Python language, as #Dan's comment mentions, multiprocessing (rather than threading) is the way to freely use as many cores as you have available

Thread syntax may be static, but implementation across operating systems and virtual machines may change
Your scripting language may use true threading on one OS and fake-threads on another.
If you have performance requirements, it might be worth looking to ensure that the scripted threads fall through to the most beneficial layer in the OS. Userspace threads will be faster, but for largely blocking thread activity kernel threads will be better.

As Groovy is based on the Java virtual machine, you get support for true threads.

F# on .NET 4 has excellent support for parallel programming and extremely good performance as well as support for .fsx files that are specifically designed for scripting. I do all my scripting using F#.

An answer for this question has already been accepted, but just to add that besides tcl, the only other interpreted scripting language that I know of that supports multithreading and thread-safe programming is Qore.
Qore was designed from the bottom up to support multithreading; every aspect of the language is thread-safe; the language was designed to support SMP scalability and multithreading natively. For example, you can use the background operator to start a new thread or the ThreadPool class to manage a pool of threads. Qore will also throw exceptions with common thread errors so that threading errors (like potential deadlocks or errors with threading APIs like trying to grab a lock that's already held by the current thread) are immediately visible to the programmer.
Qore additionally supports and thread resources; for example, a DatasourcePool allocation is treated as a thread-local resource; if you forget to commit or roll back a transaction before you end your thread, the thread resource handling for the DatasourcePool class will roll back the transaction automatically and throw an exception with user-friendly information about the problem and how it was solved.
Maybe it could be useful for you - an overview of Qore's features is here: Why use Qore?.

CSScript in combination with Parallel Extensions shouldn't be a bad option. You write your code in pure C# and then run it as a script.

It is not related to the threading mechanism. The problem is that (for example in python) you have to get interpreter instance to run the script. To acquire the interpreter you have to lock it as it is going to keep the reference count and etc and need to avoid concurrent access to this objects. Python uses pthread and they are real threads but when you are working with python objects just one thread is running an others waiting. They call this GIL (Global Interpreter Lock) and it is the main problem that makes real parallelism impossible inside a process.
https://wiki.python.org/moin/GlobalInterpreterLock
The other scripting languages may have kind of the same problem.

Guile supports POSIX threads which I believe are hardware threads.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string