Multi-core J -- Parallelisation - multithreading

Is there a way to get J to use multiple cores ? I thought part of the benefit of APL/J was that the language constructs lent themselves well to parallel solutions.
Looking at my CPU usage (I'm on OSX) there's clearly only a single processor in use.
I've got a heavy-ish function f acting on a list, and I don't see why it couldn't divide the list into 4 pieces, and re-assemble the results ?

ArrayFire may be worth looking into. Its OpenCL with support for AMD/nvidia and backward fallback to CPU. Its array processing. It should bind easily to J, as it does in matlab.

Related

Are floating point operations deterministic when running in multiple threads?

Suppose I have a function that runs calculations, example being something like a dot product - I pass in an arrays A, B of vectors and a float array C, and the functions assigns:
C[i] = dot(A[i], B[i]);
If I create and start two threads that will run this function, and pass in the same three arrays to both the threads, under what circumstances is this type of action (perhaps using a different non-random mathematical operation etc.) not guaranteed give the same result (running the same application without any recompilation, and on the same machine)? I'm only interested in the context of a consumer PC.
I know that float operations are in general deterministic, but I do wonder whether perhaps something weird could happen and maybe on one thread the calculations will use an intermediate 80 bit register, but not in the other.
I would assume it's pretty much guaranteed the same binary code should run in both threads (is there some way this could not happen? The function being compiled multiple times for some reason, the compiler somehow figuring out it will run in multiple threads, and compiling it again, for some reason, for the second thread?).
But I'm a a bit more worried that CPU cores might not have the same instruction sets, even on consumer level PCs.
Side question - what about GPUs in a similar scenario?
//
I'm assuming x86_64, Windows, c++, and dot is a.x * b.x + a.y * b.y. Can't give more info than that - using Unity IL2CPP, don't know how it compiles/with what options.
Motivation for the question: I'm writing a computational geometry procedure that modifies a mesh - I'll call this the "geometric mesh". The issue is that it could happen that the "rendering mesh" has multiple vertices for certain geometric positions - it's needed for flat shading for example - you have multiple vertices with different normals. However, the actual computational geometry procedure only uses purely geometric data of the positions in space.
So I see two options:
Create a map from the rendering mesh to the geometric mesh (example - duplicate vertices being mapped to one unique vertex), run the procedure on the geometric mesh, then somehow modify the rendering mesh based on the result.
Work with the rendering mesh directly. Slightly more inefficient as the procedure does calculations for all vertices, but much easier from a code perspective. But most of all I'm a bit worried that I could get two different values for two vertices that actually have the same position and that shouldn't happen. Only the position is used, and the position would be the same for both such vertices.
Floating point (FP) operations are not associative (but it is commutative). As a result, (x+y)+z can give different results than x+(y+z). For example, (1e-13 + (1 - 1e-13)) == ((1e-13 + 1) - 1e-13) is false with 64-bit IEEE-754 floats. The C++ standard is not very restrictive about floating-point numbers. However, the widely-used IEEE-754 standard is. It specifies the precision of 32-bit and 64-bit number operations, including rounding modes. x86-64 processors are IEEE-754 compliant and mainstream compilers (eg. GCC, Clang and MSVC) are also IEEE-754 compliant by default. ICC is not compliant by default since it assumes the FP operations are associative for the sake of performance. Mainstream compilers have compilation flags to make such assumption so to speed up codes. It is generally combined with other ones like the assumption that all FP values are not NaN (eg. -ffast-math). Such flags break the IEEE-754 compliance, but they are often used in the 3D or video game industry so to speed up codes. IEEE-754 is not required by the C++ standard, but you can check this with std::numeric_limits<T>::is_iec559.
Threads can have different rounding modes by default. However, you can set the rounding mode using the C code provided in this answer. Also, please note that denormal numbers are sometimes disabled on some platforms because of their very-high overhead (see this for more information).
Assuming the IEEE-754 compliance is not broken, the rounding mode is the same and the threads does the operations in the same order, then the result should be identical up to at least 1 ULP. In practice, if they are compiled using a same mainstream compiler, the result should be exactly the same.
The thing is using multiple threads often result in a non-deterministic order of the applied FP operations which causes non-deterministic results. More specifically, atomic operations on FP variables often cause such an issue because the order of the operations often changes at runtime. If you want deterministic results, you need to use a static partitioning, avoid atomic operations on FP variables or more generally atomic operations that could result in a different ordering. The same thing applies for locks or any synchronization mechanisms.
The same thing is true for GPUs. In fact, such problem is very frequent when developers use atomic FP operations for example to sum values. They often do that because implementing fast reductions is complex (though it is more deterministic) and atomic operations as pretty fast on modern GPUs (since they use dedicated efficient units).
According to the accepted answer to floating point processor non-determinism?, C++ floating point is not non-deterministic. The same sequence of instructions will give the same results.
There are a few things to take into account though:
Firstly, the behavior (i.e. the result) of a particular piece of C++ source code doing a FP calculation may depend on the compiler and the chosen compiler options. For example, it may depend on whether the compiler chooses to emit 64 or 80 bit FP instructions. But this is deterministic.
Secondly, similar C++ source code may give different results; e.g. due to non-associative behavior of certain FP instructions. This also is deterministic.
Determinism won't be affected by multi-threading by default. The C++ compiler will probably be unaware of whether the code is multi-threaded or not. And it definitely has no reason to emit different FP code.
Admittedly, FP behavior depends on the rounding mode selected, and that can be set on a per-thread basis. However, for this to happen, something (application code) would have to explicitly set different rounding modes for different threads. Once again, that is deterministic. (And a pretty daft thing for the application code to do, IMO.)
The idea that a PC would would use different FP hardware with different behavior for different threads seems far-fetched to me. Sure a PC could have (say) an Intel chipset and an ARM chipset, but it is not plausible that different threads of the same C++ application (executable) would simultaneously run on both chipsets.
Likewise for GPUs. Indeed, given that you need to program GPUs in a way that is radically different to ordinary (or threaded) C++, I would doubt that they could even share the same source code.
In short, I think that you are worrying about a hypothetical problem that you are unlikely to encounter in reality ... given the current state of the art in hardware and C++ compilers.

Parallelism in functional languages

One of FP features advertised is that a program is "parallel by default" and that naturally fits modern multi-core processors. Indeed, reducing a tree is parallel by its nature. However, I don't understand how it maps to multi-threading. Consider the following fragment (pseudo code):
let y = read-and-parse-a-number-from-console
let x = get-integer-from-web-service-call
let r = 5 * x - y * 4
write-r-to-file
How can a translator determine which of tree branches should be run on a thread? After you obtained x or y it would be stupid to reduce 5 * x or y * 4 expressions on a separate thread (even if we grab it from a thread pool), wouldn't it? So how different functional languages handle this?
We're not quite there yet.
Programs in pure declarative style (functional style is included in this category, but so are some other styles) tend to be much more amenable to parallelisation, because all data dependencies are explicit. This makes it very easy for the programmer to manually use primitives the language provides for specifying that two independent computations should be done in parallel, regardless of whether they share access to any data; if everything's immutable and there are no side effects, then changing the order in which things are done can't affect the result.
If purity is enforced by the language (as in Haskell, Mercury, etc, but unlike in Scala, F#, etc where purity is encouraged but unenforced), then it is possible for the compiler to attempt to automatically parallelise the program, but no existing language that I know of does this by default. If the language allows unchecked impure operations then it's generally impossible for the compiler to do the analysis necessary to prove that a given attempt to auto-parallelise the program is valid. So I do not expect any such language to ever support auto-parallelisation very effectively.
Note that the pseudo program you wrote is probably not pure declarative code. let y = read-and-parse-a-number-from-console and let x = get-integer-from-web-service-call are calculating x and y from impure external operations, and there's nothing in the program that fixes the order in which they should run. It's possible in general that executing two impure operations in either order will produce different results, and running those two operations in different threads gives up control of the order in which they run. So if a language like that was to auto-parallelise your program, it would almost certainly either introduce horrible concurrency bugs, or refuse to significantly parallelise anything.
However the functional style still makes it easy to manually parallelise such programs. A human programmer can tell that it almost certainly doesn't matter in which order you read from the console and the network. Knowing that there's no shared mutable state can decide to run those two operations in parallel without digging into their implementations (which you'd have to do in imperative algorithms where there might be mutable shared state even if it doesn't look like there is from the interface).
But the big complication that's in the way of auto-parallelising compilers for enforced-purity languages is knowing how much parallelisation to do. Running every computation possible in parallel vastly overwhelms any possible benefit with all the startup cost of spawning new threads (not to mention the context switches), as you try to run huge numbers of very short-lived threads on a small number of processors. The compiler needs to identify a much smaller number of "chunks" of computation that are reasonably large, and run the chunks in parallel while running the sub-computations of each chunk sequentially.
But only "embarrassingly parallel" programs decompose nicely into very large completely independent computations. Most programs are much more interdependent. So unless you only want to be able to auto-parallelise programs that are very easy to manually parallelise, your auto-parallelisation probably needs to be able to identify and run "chunks" in parallel which are partially dependent on each other, having them wait when they get to points that really need a result that's supposed to be computed by another "chunk". This introduces extra overhead of synchronisation between the threads, so the logic that chooses what to run in parallel needs to be even better in order to beat the trivial strategy of just running everything sequentially.
The developers of Mercury (a pure logical programming language) are working on various methods of tackling these problem, from static analysis to using profiling data. If you're interested, their research papers have a lot more information. I presume other researches are working on this area in other languages, but I don't know much about any other projects.
In that specific example, the third statement depends on the first and the second, but there is no interdependency between the first and the second. Therefore, a runtime environment could execute read-and-parse-a-number-from-console on a different thread from get-integer-from-web-service-call, but the execution of the third statement would have to wait until the first two are finished.
Some languages or runtime environments may be able to calculate a partial result (such as y * 4) before obtaining an actual value of x. As a high level programmer though, you would be unlikely to be able to detect this.

Hypothesis testing and GPGPU

I'm very new to GPGPU and Programming. I'm interested to know if statistical hypothesis testing like one-sample Kolmogorov-Smirnov test (K–S test) and Levene's test could be implemented in GPGPU (SIMD) using CUDA? If so what will be the limitations?
I just read web definitions about these tests, but, if I understood correctly, they can be properly accelerated by the kind of parallelism expressed by SIMD (in particular as implemented by CUDA).
In K-S test, one has to compute the difference between a function and an estimate on N samples, then take the maximum difference. In other words, one has to perform the same operation on N different values, which is exactly SIMD (single instruction, multiple data).
In Levene's test, there is again the same difference, square and multiplication over N different values.
What SIMD can do is a sort of FOR statement over N value sets, provided that the iterations are independent from each other. Thus, in CUDA for example, the compiler can allocate the iterations to the processing elements of the graphic device, so that, executing in parallel, the FOR loop is run for all the data in the time of a single iteration.
The CUDA toolkit provides a specific C/C++ compiler (NVCC) where special instructions are dispatched to the GPGPU rather than to the CPU, therefore distributed to its parallel processing elements.

Multithreaded operators

How can an operator +, -, *, /, etc... be multithreaded?
In this paper on the language IDL, it claims that all operators use the 'thread pool' for increased execution speed.
How is it that one can use multiple threads to execute a statement such as 'b = a - a' (as on page 42) of the paper?
Can someone explain this? (I currently consider IDL a total ripoff, but maybe someone can change my mind.)
(Really this applies to any language, how can an operator by mulithreaded in any computer programming language?)
I think it's important to also consider that not all operations with + are created equal. If you're using some sort of bignum library, for example, you might be able to seperate a large number into smaller parts, do distinct sums of integers( in parallel), then carry over. In any case, it's not going to be a single-cycle addition of to integers. multiplication involves a couple steps, and division, a lot of steps.
In the example given, the floating points(floating point means a non-trivial adding process) had "4.2 million data points": I doubt they were storing that in a small 32-bit register.
The "simple" operation for addition has suddenly become a huge iterative process... or maybe something a lot faster if they are able to do it in parallel.
While simple operations with small integers might not be worth threading, it's worthwhile to note that B=A+A, while seeming simple, could actually lead to many calculations. 1 line of code doesn't necessarily mean 1 operation.
I don't know about IDL, but it's certainly possible if you have some higher level types. For instance you could conveniently parallelize array operations. Presumably that's what the "4200000 pts" refers to, although someone decided to make the graphs really hard to read.
For comparison, in C (with possible OpenMP parallelization) you might have something like:
#pragma omp parallel for
for (int i=0; i<sizeof(B)/sizeof(B[0]); i++) {
B[i]-=A[i];
}
In a higher level language, such as NumPy, Matlab or C++, it could be just B=B-A. All that said, B=A-A sounds confusingly like B=0 to me.
You asked for a parallel operator in a favorite language? Here's a bit of Haskell:
import Control.Parallel
pmap _ [] = []
pmap f (x:xs) =
let rest=pmap f xs
in rest `par` (f x):rest
parOp op a b = pmap (uncurry op) (zip a b)
parAdd = parOp (+)
main = do
putStrLn$show ([0..300] `parAdd` [500..800])
Yes, it's still a loop. Multitude of operations (not operators) is key to this type of parallelism.
Primitive operations on matrices, arrays, etc. can be parallelised - scroll up to page 41 and you'll find:
For system comparison plots, the results are reported for arrays of 4.2 million elements.
Edit: Assume you have an array A = [1, 2, 3, 4, 5, 6].
Calculating B = A - A = [0, 0, 0, 0, 0, 0] involves 6 subtraction operations (1-1, 2-2, etc).
with a single CPU, regardless of number of threads, the subtractions must be performed in series.
with multiple CPUs but only one thread, the subtractions are also performed in series - that's by definition of a thread.
multiple CPUs, multiple threads - the subtractions can be divided amongst threads/CPUs and thus occur simultaneously (up to the number of CPUs available).

Why doesn't a swap / exchange operator exist in imperative or OO languages like C/C++/C#/Java...?

I was always wondering why such a simple and basic operation like swapping the contents of two variables is not built-in for many languages.
It is one of the most basic programming exercises in computer science classes; it is heavily used in many algorithms (e.g. sorting); every now and then one needs it and one must use a temporary variable or use a template/generic function.
It is even a basic machine instruction on many processors, so that the standard scheme with a temporary variable will get optimized.
Many less obvious operators have been created, like the assignment operators (e.g. +=, which was probably created for reflecting the cumulative machine instructions, e.g. add ax,bx), or the ?? operator in C#.
So, what is the reason? Or does it actually exist, and I always missed it?
In my experience, it isn't that commonly needed in actual applications, apart from the already-mentioned sort algorithms and occasionally in low level hardware poking, so in my view it's a bit too special-purpose to have in a general-purpose language.
As has also been mentioned, not all processors support it as an instruction (and many do not support it for objects bigger than a word). So if it were supported with some useful additional semantics (e.g. being an atomic operation) it would be difficult to support on some processors, and if it didn't have the additional semantics then it's just (seldom used) synatatic sugar.
The assigment operators (+= etc) were supported because these are much more common in real-world programs - and so the syntacic sugar they provide was more useful, and also as an optimisation - remember C dates from the late 60s/early 70s, and compiler optimisation wasn't as advanced (and the machines less capable, so you didn't want lengthy optimisation passes anyway).
Paul
C++ does have swapping.
#include <algorithm>
#include <cassert>
int
main()
{
using std::swap;
int a(3), b(5);
swap(a, b);
assert(a == 5 && b == 3);
}
Furthermore, you can specialise swap for custom types too!
It's a widely used example in computer science courses, but I almost never find myself needing it in real code - whereas I use += very frequently.
Yes, in sorting it would be handy - but you don't tend to need to implement sorting yourself, so the number of actual uses in source code would still be pretty low.
You do have the XOR operator that does a variable substitution for primitive type...
I think they just forgot to add it :-) Yes, not all CPUs have this kind of instructions, so what ? We have bunch of other things that most CPUs don't have instructions to compute. It would be much easier/clearer and also faster ( by intrinsic ) if we had it !!!

Resources