From here, I learned how to run the spatstat functions on multiprocessors, I was wondering whether we can run spatstat functions on GPU. If yes, I am very thankful if you mention how we can run the following code on GPU.
Best regards
library(spatstat)
ppplist <- replicate(4, cells, simplify = FALSE)
envlist <- parallel::mclapply(ppplist, spatstat::envelope, savefuns = TRUE, nsim = 10)
envfinal <- do.call(pool, envlist)
envfinal
If you're asking whether spatstat makes use of GPU hardware at a low level (in its internal C code), the answer is no. The internal C code is designed to be portable across a wide range of systems, rather than exploiting special hardware.
It may be possible to install R with support for GPU hardware. That would have an effect on the performance of spatstat because it would accelerate the base R functionality such as sorting, on which spatstat relies heavily.
Using GPU's does not necessarily make things run faster - it depends on the task and on the code - but we imagine that the spatstat simulation engine rmh could be made to run faster with GPU's.
Related
I am new to the subject "modeling of physical systems". I read some basic literature and did some tutorials in Modelica and Simulink/Simscape. I wanted to ask you, if I understand the following content correctly:
Symbolic manipulation is the process of transforming a differential-algebraic system of equation (physical model: DAE) into a system of differential equations (ODE) that can be solved by standard solvers (Runge, Kutta, BDF, ...)
There are also solver that can solve DAE's directly. But Modelica (openModelica, Dymola) and Simscape transfer the System into an ODE (why are this methods better compared to direct DAE solvers?)
A "flat Modelica code" is the result ( = ODE) of the transformation.
Thank you very much for your answers.
Symbolic processing for Modelica includes:
remove object oriented structure and obtain an hybrid DAE (flat Modelica)
perform matching, index reduction, casualization to get an ODE
perform optimization (tearing, common subexpression elimination, etc)
generate code for a particular solver
OpenModelica can also solve the system in DAE mode without transforming it to ODE and I guess other Modelica tools can also do that.
A "flat Modelica code" is Modelica code where the object orientation is removed, connect equations are expanded to normal equations. The result is a hybrid DAE.
See Modelica Spec 3.3 for more info about all this (for example Appendix C):
https://modelica.org/documents/ModelicaSpec33Revision1.pdf
So I think your understanding of the terminology is very good too.
Due to the declarative way (opposed to imperative) of programming in modelica, we get immediately very high numbers of algebraic equations. Solving these (partly) symbolically has, above all, these essential advantages:
Speed. Without eliminating algebraic loops, modelica would not be practically usable for any real-world problem and even then only in simple cases no algebraic equations remain. It would be too slow and would force you to do transformations manually yourself in modelica too (as in imperative languages e.g. in C/C++ or Simulink). Even today modelica can still be slower than manually transformed and optimized solutions.
Moreover modelica applications often need simulations in real-time.
Correctness. Symbolic transformations are based on proofs and modelica applications often are in the area of safety critical or cyber-physical systems.
One additional consideration is that there are different forms of DAEs, and modeling often lead to high-index DAEs that are complicated to solve numerically (*). (Note "high" means index greater than 1, typically 2 - but sometimes even higher.)
Symbolic transformations can reduce high-index DAEs to semi-explicit index 1 DAEs, and then by (numerically) solving the systems of equations they are transformed into ODEs.
Thus even if a tool solves DAEs directly it is normally the semi-explicit index 1 DAEs that are solved, not the original high index DAE.
(I know this answer is late. The hybrid part for the symbolic transformations is more complicated, still working on that.)
For more information see https://en.wikipedia.org/wiki/Differential-algebraic_system_of_equations
(*): There are some solvers for high index DAEs (in particular index 2), but typically they rely on a specific structure of the model and finding that structure requires similar techniques as reducing the index to 1.
Seems like writing devectorized code is encouraged in Julia.
There is even a package that tries to do that for you.
My question is why?
First of all, speaking from the user experience aspect, vectorized code is more concise (less code, then less likelihood of bugs), more clear (hence easier to debug), more natural way of writing code (at least for someone who comes from scientific computing background, whom Julia tries to cater to). Being able to write something like vector'vector or vector'Matrix*vector is very important, because it corresponds to actual mathematical representation, and this is how scientific computing guys think of it in their head (not in nested loops). And I hate the fact that this is not the best way to write this, and reparsing it into loops will be faster.
At the moment it seems like there is a conflict between the goal of writing the code that is fast and the code that is concise/clear.
Secondly, what is the technical reason for this? Ok, I understand that vectorized code creates extra temporaries, etc., but vectorized functions (for example, broadcast(), map(), etc.) have a potential of multithreading them, and I think that the benefit of multithreading can outweigh the overhead of temporaries and other disadvantages of vectorized functions making them faster than regular for loops.
Do current implementations of vectorized functions in Julia do implicit multithreading under the hood?
If not, is there work / plans to add implicit concurrency to vectorized functions and to make them faster than loops?
For easy reading I decided to turn my comment marathon above into an answer.
The core development statement behind Julia is "we are greedy". The core devs want it to do everything, and do it fast. In particular, note that the language is supposed to solve the "two-language problem", and at this stage, it looks like it will accomplish this by the time v1.0 hits.
In the context of your question, this means that everything you are asking about is either already a part of Julia, or planned for v1.0.
In particular, this means that if your programming problem lends itself to vectorized code, then write vectorized code. If it is more natural to use loops, use loops.
By the time v1.0 hits, most vectorized code should be as fast, or faster, than equivalent code in Matlab. In many cases, this development goal has already been achieved, since many vector/matrix operations in Julia are sent to the appropriate BLAS routines by the compiler.
Regarding multi-threading, native multi-threading is currently being implemented for Julia, and I believe an experimental set of routines is already available on the master branch. The relevant issue page is here. Implicit multithreading for some vector/matrix operations is already in theory available in Julia, since Julia calls BLAS. I'm not sure if it is switched on by default.
Be aware though, that many vectorized operations will still (currently) be much faster in MATLAB, since MATLAB have been writing specialised multi-threaded C libraries for years and then calling them under the hood. Once Julia has native multi-threading, I expect Julia will overtake MATLAB, since at that point the entire dev community can scour the standard Julia packages and upgrade them to take advantage of native multi-threading wherever possible.
In contrast, MATLAB does not have native multi-threading, so you are relying on Mathworks to provide specialised multi-threaded routines in the form of underlying C libraries.
You can and should write vector'*matrix*vector (or perhaps dot(vector, matrix*vector) if you prefer a scalar output). For things like matrix multiplication, you're much better off using vectorized notation, because it calls the underlying BLAS libraries which are more heavily optimized than code produced by basically any language/compiler combination.
In other places, as you say you can benefit from devectorization by avoiding temporary intermediates: for example, if x is a vector, the expression
y = exp(x).*x + 5
creates 3 temporary vectors: one for a = exp(x), one for b = a.*x and one for y = b + 5. In contrast,
y = [exp(z)*z+5 for z in x]
creates no temporary intermediates. Since loops and comprehensions in julia are fast, there is no disadvantage to writing the devectorized version, and in fact it should perform slightly better (especially with performance annotations like #simd, where appropriate).
The arrival of threads may change things (making vectorized exp faster than a "naive" exp), but in general I'd say you should regard this as an "orthogonal" issue: julia will likely make multithreading so easy to use that you yourself might write operations using multiple threads, and consequently the vectorized "library" routine still has no advantage over code you might write yourself. In other words, you might use multiple threads but still write devectorized code to avoid those temporaries.
In the longer term, a "sufficiently smart compiler" may avoid temporaries by automatically devectorizing some of these operations, but this is a much harder route, with potential traps for the unwary, than it may seem.
Your statement that "vectorized code is always more concise, and easier to understand" is, however, not true: many times while writing Matlab code, you have to go to extremes to come up with a vectorized way of writing what are actually simple operations when thought of in terms of loops. You can search the mailing lists for countless examples; one that I recall on SO is How to find connected components in a matrix using Julia.
Using the traditional, sequential reduction approach, the following graph is reduced as:
(+ (+ 1 2) (+ 3 4)) ->
(+ 3 (+ 3 4)) ->
(+ 3 7) ->
10
Graph reductions are, though, inherently parallel. One could, instead, reduce it as:
(+ (+ 1 2) (+ 3 4)) ->
(+ 3 7) ->
10
As far as I know, every functional programming language uses the first approach. I believe this is mostly because, on the CPU, scheduling threads overcompensate the benefits of doing parallel reductions. Recently, though, we've been starting to use the GPU more than the CPU for parallel applications. If a language ran entirely on the GPU, those communication costs would vanish.
Are there functional languages making use of that idea?
What makes you think on GPU scheduling would not overcomponsate the benefits?
In fact, the kind of parallelism used in GPUs is far harder to schedule: it's SIMD parallelism, i.e. a whole batch of stream processors do all essentially the same thing at a time, except each one crushes a different bunch of numbers. So, not only would you need to schedule the subtasks, you would also need to keep them synchronised. Doing that automatically for general computations is virtually impossible.
Doing it for specific tasks works out quite well and has been embedded into functional languages; check out the Accelerate project.
SPOC provides some GPGPU access from OCaml.
on the CPU, scheduling threads overcompensate the benefits of doing parallel reduction
Thread scheduling is very effective in modern OSes. Thread initialization and termination may be a matter of concern, but there are plenty of techniques to eliminate those costs.
Graph reductions are, though, inherently parallel
As it was mentioned in other answer, GPUs are very special devices. One can't simply take arbitrary algorithm and make it 100 times faster just by rewriting on CUDA. Speaking of CUDA, it is not exactly a SIMD (Single Instruction on Multiple Data), but SIMT (Single Instruction on Multiple Thread). This is something far more complex, but let's think of it as a mere vector processing language. As name suggests, vector processors designed to handle dense vectors and matrices, i.e. simple linear data structures. So any branching within warp reduces efficiency of parallelism and performance down to zero. Modern architectures (Fermi+) are capable to process even some trees, but this is rather tricky and performance isn't that shining. So you simply can't accelerate arbitrary graph reduction.
What about functional languages for GPGPU. I believe it can't be serious. Most of valuable CUDA code exists inside hardly optimized libraries made by PhDs, and it is aimed straight toward performance. Readability, declarativity, clearness and even safety of functional languages don't matter there.
The language Obsidian is a domain specific language embedded in Haskell which targets GPGPU computations. It's rather more low-level than what you're asking for but I thought I'd mention it anyway.
https://github.com/BenjaminTrapani/FunGPU provides a Racket-like functional language that runs entirely on GPUs and other devices that can be targeted with SYCL. The runtime automatically schedules the independent sub-trees in a way that efficiently utilizes the GPU (multiple evaluations of the same instructions with different data are evaluated concurrently). It is still in early stages but could be worth experimenting with. It is already outperforming the Racket VM.
I have the "standard" version of Matlab without any additional toolboxes installed.
Is it somehow possible to make use of multithreading (use all cores of a quad-core instead of only one) without installing the Parallel Computing Toolbox?
I guess it is not, but maybe someone figured out a workaround?
Thank you very much!
There are several functions, that are implemented using multi-threading. If you use these functions, all cores will be used: http://www.mathworks.com/matlabcentral/answers/95958
You can use threads/parallelism in C, C++ or Java, all of which can be called from Matlab (Java being probably the fastest/simplest way?).
A couple of observations:
a) Matlab's parallel construct are quite heavyweight and will not give you a super-speedup. I personally prefer calling C/C++ code with OpenMP if I want fast-to-write parallelism.
b) Matlab's functions, in general, are not thread-safe, therefore calling them from multithreaded non-Matlab code is dangerous.
c) In image processing, some of the functions in Matlab are GPU-accelerated, therefore they are quite fast on their own.
I currently have a parallel for loop similar to this:
int testValues[16]={5,2,2,10,4,4,2,100,5,2,4,3,29,4,1,52};
parallel_for (1, 100, 1, [&](int i){
int var4;
int values[16]={-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1};
/* ...nested for loops */
for (var4=0; var4<16; var4++) {
if (values[var4] != testValues[var4]) break;
}
/* ...end nested loops */
}
I have optimised as much as I can to the point that the only thing more I can do is add more resources.
I am interested in utilising the GPU to help process the task in parallel. I have read that embarassingly parallel tasks like this can make use of a modern GPU quite effectively.
Using any language, what is the easiest way to use the GPU for a simple parallel for loop like this?
I know nothing about GPU architectures or native GPU code.
as Li-aung Yip said in comments, the simplest way to use a GPU is with something like Matlab that supports array operations and automatically (more or less) moves those to the GPU. but for that to work you need to rewrite your code as pure matrix-based operations.
otherwise, most GPU use still requires coding in CUDA or OpenCL (you would need to use OpenCL with an AMD card). even if you use a wrapper for your favourite language, the actual code that runs on the GPU is still usually written in OpenCL (which looks vaguely like C). and so this requires a fair amount of learning/effort. you can start by downloading OpenCL from AMD and reading through the docs...
both those options require learning new ideas, i suspect. what you really want, i think, is a high level, but still traditional-looking, language targeted at the gpu. unfortunately, they don't seem to exist much, yet. the only example i can think of is theano - you might try that. even there, you still need to learn python/numpy, and i am not sure how solid the theano implementation is, but it may be the least painful way forwards (in that it allows a "traditional" approach - using matrices is in many ways easier, but some people seem to find that very hard to grasp, conceptually).
ps it's not clear to me that a gpu will help your problem, btw.
You might want to check out array fire.
http://www.accelereyes.com/products/arrayfire
If you use openCL, you need to download separate implementations for different device vendors, intel, AMD, and Nvidia.
You might want to look into OpenACC which enables parallelism via directives. You can port your codes (C/C++/Fortran) to heterogeneous systems while maintaining a source code that still runs well on a homogeneous system. Take a look into this introduction video. OpenACC is not GPU programming, but expressing parallelism into your code, which may be helpful to achieve performance improvements without too much knowledge in low-level languages such as CUDA or OpenCL. OpenACC is available in commercial compilers from PGI, Cray, and CAPS (PGI offers new users a free 30 day trial).