Number of assembly instruction execution orders

Number of assembly instruction execution orders - multithreading

I am working on a presentation on multithreading and I want to demonstrate how instructions can increase in a factorially large way.
Consider the trivial program
a++;
b++;
c++;
In a single-threaded program the three assembly instructions (read, add one, write) that make up the ++ operation only has one order (read a, add one to a, write a to memory, read b,...)
In a program with just three threads executing these three lines in parallel there are many more configurations. The compiler can optimize and re-write these instruction in any order with the constraint that 'read', 'add one' and 'write' occur in order for a b and c. How many valid orders are there?
Initial thoughts:
(3+3+3)!* 1/(3!+3!+3!)=20160
where (3+3+3)! is the total number of permutations without constraint and 1/(3!+3!+3!) is the proportion of permutations that have the correct order.

This might be more of a elaborate comment but...
In the single thread version the compiler can reorder those additions with out the change in the output. c++ compilers are allowed to do so. So there is 3! possibilities for the single thread. And that is assuming the ++ is atomic.
When you go into multithreading the sense of order of operations loses its meaning, depending on architecture it can be done in precisely at the same time. In fact you do not even have threads. E.g. SSE instructions.
What you are trying to count is executing 3 additions where load->inc->store are not atomic, on a single thread. IMO, the way to impose order on the total of 9 elements would be similar to yours, but the factor would be (3!*3!*3!).
1st you take 9! then you impose order on 3 elements by dividing it by 3!, and then repeat process 2 more times. However I get the feeling that this factor is too big.
I would ask a mathematician that's good with combinatorics. The equivalent question is, having NxM coloured balls. N is the number of variables, M is the number of atomic operations you need to execute on each. What is the number of different orders for the balls. The colour is the variable. Because you know that 1st of a colour must be the load, 2nd ++ and 3rd store. So you get M=3 balls for each of N=3 colours. Maybe this representation would be better for a pure mathematician.
EDIT: Well apparently according to wikipedia on permutations of multisets my initial guess was right. Still I would check myself.

Related

Why are OpenMP Reduction Clauses Non-deterministic for Statically Scheduled Loops?

I have been working on a multi-GPU project where I have had problems with obtaining non-deterministic results. I was surprised when it turned out that I obtained non-deterministic results due to a reduction clause executed on the CPU.
In the book Using OpenMP - The Next Step it is written that
"[...] the order in which threads combine their value to construct the
value for the shared result is non-deterministic."
Maybe I just don't understand how the reduction clauses are implemented. Does it mean that if I use schedule(monotonic:static) in combination with a reduction clause each thread will execute its chunk of the iterations in a deterministic order, but that the order in which the partial results are combined at the end of the parallel region is non-deterministic?

Does it mean that if I use schedule(monotonic:static) in combination
with a reduction clause each thread will execute its chunk of the
iterations in a deterministic order, but that the order in which the
partial results are combined at the end of the parallel region is
non-deterministic?
It is known that the end result is non-determinist, detailed information can be found in:
What Every Computer Scientist Should Know about Floating Point Arithmetic. For instance:
Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the expression (x+y)+z has a totally different answer than x+(y+z) when x = 1e30, y = -1e30 and z = 1 (it is 1 in the former case, 0 in the latter).
Now regarding the order in which the threads perform the reduction action, as far as I know, the OpenMP standard does not enforce any order, or requires that the order has to be deterministic. Hence, this is an implementation detail that is left up to the compiler that is implementing the OpenMP standard to decide, and consequently, it is something that your code should not reply upon.

Programming language semantics usually declares that a+b+c+d is evaluated as ((a+b)+c)+d. This is not parallel, so an OpenMP reduction is probably evaluated as (a+b)+(c+d). And so on for larger numbers of summands.
So you immediately have that, because of the non-associativity of floating point arithmetic, the result may be subtly different from the sequential value.
But more importantly, the exact value will depend on precisely how the combination is done. Is it a+(b+c) (on 2 threads) or (a+b)+c? So the result is at least "indeterministic" in the sense that you can not reconstruct how it was formed. It could probably even be done in two different ways, if you run the same reduction twice. That's what I would call "non-deterministic", but look in the standard for the exact definition of the term.
By the way, if you want to get some idea of how OpenMP actually does it, write your own reduction operator, and let each invocation print out what it computes. Here is a decent illustration: https://victoreijkhout.github.io/pcse/omp-reduction.html#Initialvalueforreductions
By the way, the standard actually doesn't use the word "non-deterministic" for this case. The following passage explains the issue:
Furthermore, using different numbers of threads may result in
different numeric results because of changes in the association of
numeric operations. For example, a serial addition reduction may have
a different pattern of addition associations than a parallel
reduction.

partial functions vs input verification

I really love using total functions. That said, sometimes I'm not sure what the best approach is for guaranteeing that. Lets say that I'm writing a function similar to chunksOf from the split package, where I want to split up a list into sublists of a given size. Now I'd really rather say that the input for sublist size needs to be a positive int (so excluding 0). As I see it I have several options:
1) all-out: make a newtype for PositiveInt, hide the constructor, and only expose safe functions for creating a PositiveInt (perhaps returning a Maybe or some union of Positive | Negative | Zero or what have you). This seems like it could be a huge hassle.
2) what the split package does: just return an infinite list of size-0 sublists if the size <= 0. This seems like you risk bugs not getting caught, and worse: those bugs just infinitely hanging your program with no indication of what went wrong.
3) what most other languages do: error when the input is <= 0. I really prefer total functions though...
4) return an Either or Maybe to cover the case that the input might have been <= 0. Similar to #1, it seems like using this could just be a hassle.
This seems similar to this post, but this has more to do with error conditions than just being as precise about types as possible. I'm looking for thoughts on how to decide what the best approach for a case like this is. I'm probably most inclined towards doing #1, and just dealing with the added overhead, but I'm concerned that I'll be kicking myself down the road. Is this a decision that needs to be made on a case-by-case basis, or is there a general strategy that consistently works best?

Bitwise operations Python

This is a first run-in with not only bitwise ops in python, but also strange (to me) syntax.
for i in range(2**len(set_)//2):
parts = [set(), set()]
for item in set_:
parts[i&1].add(item)
i >>= 1
For context, set_ is just a list of 4 letters.
There's a bit to unpack here. First, I've never seen [set(), set()]. I must be using the wrong keywords, as I couldn't find it in the docs. It looks like it creates a matrix in pythontutor, but I cannot say for certain. Second, while parts[i&1] is a slicing operation, I'm not entirely sure why a bitwise operation is required. For example, 0&1 should be 1 and 1&1 should be 0 (carry the one), so binary 10 (or 2 in decimal)? Finally, the last bitwise operation is completely bewildering. I believe a right shift is the same as dividing by two (I hope), but why i>>=1? I don't know how to interpret that. Any guidance would be sincerely appreciated.

[set(), set()] creates a list consisting of two empty sets.
0&1 is 0, 1&1 is 1. There is no carry in bitwise operations. parts[i&1] therefore refers to the first set when i is even, the second when i is odd.
i >>= 1 shifts right by one bit (which is indeed the same as dividing by two), then assigns the result back to i. It's the same basic concept as using i += 1 to increment a variable.
The effect of the inner loop is to partition the elements of _set into two subsets, based on the bits of i. If the limit in the outer loop had been simply 2 ** len(_set), the code would generate every possible such partitioning. But since that limit was divided by two, only half of the possible partitions get generated - I couldn't guess what the point of that might be, without more context.

I've never seen [set(), set()]
This isn't anything interesting, just a list with two new sets in it. So you have seen it, because it's not new syntax. Just a list and constructors.
parts[i&1]
This tests the least significant bit of i and selects either parts[0] (if the lsb was 0) or parts[1] (if the lsb was 1). Nothing fancy like slicing, just plain old indexing into a list. The thing you get out is a set, .add(item) does the obvious thing: adds something to whichever set was selected.
but why i>>=1? I don't know how to interpret that
Take the bits in i and move them one position to the right, dropping the old lsb, and keeping the sign. Sort of like this
Except of course that in Python you have arbitrary-precision integers, so it's however long it needs to be instead of 8 bits.
For positive numbers, the part about copying the sign is irrelevant.
You can think of right shift by 1 as a flooring division by 2 (this is different from truncation, negative numbers are rounded towards negative infinity, eg -1 >> 1 = -1), but that interpretation is usually more complicated to reason about.
Anyway, the way it is used here is just a way to loop through the bits of i, testing them one by one from low to high, but instead of changing which bit it tests it moves the bit it wants to test into the same position every time.

How can natural numbers be represented to offer constant time addition?

Cirdec's answer to a largely unrelated question made me wonder how best to represent natural numbers with constant-time addition, subtraction by one, and testing for zero.
Why Peano arithmetic isn't good enough:
Suppose we use
data Nat = Z | S Nat
Then we can write
Z + n = n
S m + n = S(m+n)
We can calculate m+n in O(1) time by placing m-r debits (for some constant r), one on each S constructor added onto n. To get O(1) isZero, we need to be sure to have at most p debits per S constructor, for some constant p. This works great if we calculate a + (b + (c+...)), but it falls apart if we calculate ((...+b)+c)+d. The trouble is that the debits stack up on the front end.
One option
The easy way out is to just use catenable lists, such as the ones Okasaki describes, directly. There are two problems:
O(n) space is not really ideal.
It's not entirely clear (at least to me) that the complexity of bootstrapped queues is necessary when we don't care about order the way we would for lists.

As far as I know, Idris (a dependently-typed purely functional language which is very close to Haskell) deals with this in a quite straightforward way. Compiler is aware of Nats and Fins (upper-bounded Nats) and replaces them with machine integer types and operations whenever possible, so the resulting code is pretty effective. However, that's not true for custom types (even isomorphic ones) as well as for compilation stage (there were some code samples using Nats for type checking which resulted in exponential growth in compile-time, I can provide them if needed).
In case of Haskell, I think a similar compiler extension may be implemented. Another possibility is to make TH macros which would transform the code. Of course, both of options aren't easy.

My understanding is that in basic computer programming terminology the underlying problem is you want to concatenate lists in constant time. The lists don't have cheats like forward references, so you can't jump to the end in O(1) time, for example.
You can use rings instead, which you can merge in O(1) time, regardless if a+(b+(c+...)) or ((...+c)+b)+a logic is used. The nodes in the rings don't need to be doubly linked, just a link to the next node.
Subtraction is the removal of any node, O(1), and testing for zero (or one) is trivial. Testing for n > 1 is O(n), however.
If you want to reduce space, then at each operation you can merge the nodes at the insertion or deletion points and weight the remaining ones higher. The more operations you do, the more compact the representation becomes! I think the worst case will still be O(n), however.

We know that there are two "extremal" solutions for efficient addition of natural numbers:
Memory efficient, the standard binary representation of natural numbers that uses O(log n) memory and requires O(log n) time for addition. (See also Chapter "Binary Representations" in the Okasaki's book.)
CPU efficient which use just O(1) time. (See Chapter "Structural Abstraction" in the book.) However, the solution uses O(n) memory as we'd represent natural number n as a list of n copies of ().
I haven't done the actual calculations, but I believe for the O(1) numerical addition we won't need the full power of O(1) FIFO queues, it'd be enough to bootstrap standard list [] (LIFO) in the same way. If you're interested, I could try to elaborate on that.
The problem with the CPU efficient solution is that we need to add some redundancy to the memory representation so that we can spare enough CPU time. In some cases, adding such a redundancy can be accomplished without compromising the memory size (like for O(1) increment/decrement operation). And if we allow arbitrary tree shapes, like in the CPU efficient solution with bootstrapped lists, there are simply too many tree shapes to distinguish them in O(log n) memory.
So the question is: Can we find just the right amount of redundancy so that sub-linear amount of memory is enough and with which we could achieve O(1) addition? I believe the answer is no:
Let's have a representation+algorithm that has O(1) time addition. Let's then have a number of the magnitude of m-bits, which we compute as a sum of 2^k numbers, each of them of the magnitude of (m-k)-bit. To represent each of those summands we need (regardless of the representation) minimum of (m-k) bits of memory, so at the beginning, we start with (at least) (m-k) 2^k bits of memory. Now at each of those 2^k additions, we are allowed to preform a constant amount of operations, so we are able to process (and ideally remove) total of C 2^k bits. Therefore at the end, the lower bound for the number of bits we need to represent the outcome is (m-k-C) 2^k bits. Since k can be chosen arbitrarily, our adversary can set k=m-C-1, which means the total sum will be represented with at least 2^(m-C-1) = 2^m/2^(C+1) ∈ O(2^m) bits. So a natural number n will always need O(n) bits of memory!

Multithreaded operators

How can an operator +, -, *, /, etc... be multithreaded?
In this paper on the language IDL, it claims that all operators use the 'thread pool' for increased execution speed.
How is it that one can use multiple threads to execute a statement such as 'b = a - a' (as on page 42) of the paper?
Can someone explain this? (I currently consider IDL a total ripoff, but maybe someone can change my mind.)
(Really this applies to any language, how can an operator by mulithreaded in any computer programming language?)

I think it's important to also consider that not all operations with + are created equal. If you're using some sort of bignum library, for example, you might be able to seperate a large number into smaller parts, do distinct sums of integers( in parallel), then carry over. In any case, it's not going to be a single-cycle addition of to integers. multiplication involves a couple steps, and division, a lot of steps.
In the example given, the floating points(floating point means a non-trivial adding process) had "4.2 million data points": I doubt they were storing that in a small 32-bit register.
The "simple" operation for addition has suddenly become a huge iterative process... or maybe something a lot faster if they are able to do it in parallel.
While simple operations with small integers might not be worth threading, it's worthwhile to note that B=A+A, while seeming simple, could actually lead to many calculations. 1 line of code doesn't necessarily mean 1 operation.

I don't know about IDL, but it's certainly possible if you have some higher level types. For instance you could conveniently parallelize array operations. Presumably that's what the "4200000 pts" refers to, although someone decided to make the graphs really hard to read.
For comparison, in C (with possible OpenMP parallelization) you might have something like:
#pragma omp parallel for
for (int i=0; i<sizeof(B)/sizeof(B[0]); i++) {
B[i]-=A[i];
}
In a higher level language, such as NumPy, Matlab or C++, it could be just B=B-A. All that said, B=A-A sounds confusingly like B=0 to me.
You asked for a parallel operator in a favorite language? Here's a bit of Haskell:
import Control.Parallel
pmap _ [] = []
pmap f (x:xs) =
let rest=pmap f xs
in rest `par` (f x):rest
parOp op a b = pmap (uncurry op) (zip a b)
parAdd = parOp (+)
main = do
putStrLn$show ([0..300] `parAdd` [500..800])
Yes, it's still a loop. Multitude of operations (not operators) is key to this type of parallelism.

Primitive operations on matrices, arrays, etc. can be parallelised - scroll up to page 41 and you'll find:
For system comparison plots, the results are reported for arrays of 4.2 million elements.
Edit: Assume you have an array A = [1, 2, 3, 4, 5, 6].
Calculating B = A - A = [0, 0, 0, 0, 0, 0] involves 6 subtraction operations (1-1, 2-2, etc).
with a single CPU, regardless of number of threads, the subtractions must be performed in series.
with multiple CPUs but only one thread, the subtractions are also performed in series - that's by definition of a thread.
multiple CPUs, multiple threads - the subtractions can be divided amongst threads/CPUs and thus occur simultaneously (up to the number of CPUs available).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string