parallel loop over pairs - multithreading

What is the best way in C++11 to perform a pairwise computation in multiple threads? What I mean is, I have a vector of elements, and I want to compute a function for each pair of distinct elements. The caveat is that I cannot use the same element in multiple threads at the same time, e.g. the elements have states that evolve during the computation, and the computation relies on that.

An easy way would be to group the pairs by offsets.
If v is a vector, then the elements N apart (mod v.size()) form two collections of pairs. Each of those collections of pairs contain no overlaps inside themselves.
Examine a 10 element vector 0 1 2 3 4 5 6 7 8 9. The pairs 1 apart are:
0 1, 1 2, 2 3, 3 4, 4 5, 5 6, 6 7, 7 8, 8 9, 9 0
if you split these by "parity" into two collections we get:
0 1, 2 3, 4 5, 6 7, 8 9
1 2, 3 4, 5 6, 7 8, 9 0
You can work, in parallel, on each of the above collections. When the collection is finished, sync up, then work on the next collection.
Similar tricks work for 2 apart.
0 2, 1 3, 4 6, 5 7
2 4, 3 5, 6 8, 7 9
with leftovers:
8 0, 9 1
For every offset from 1 to n/2 there is are 2 "collections" and leftovers.
Here is offset of 4:
0 4, 1 5, 2 6, 3 7
4 8, 5 9, 6 0, 7 1
and leftovers
8 2, 9 3
(I naively think the size of leftovers is vector size mod offset)
Calculating these collections (and the leftovers) isn't hard; arranging to queue up threads and get the right tasks efficiently in the right threads is harder.
There are N choose 2, or (n^2+n)/2, pairs. This split gives you O(1.5n) collections and leftovers, each of size at most n/2, and full parallelism within each collection.
If you have a situation where some elements are far more expensive than others, and thus waiting for each collection to finish idles threads too much, you could add fine-grained synchronization.
Maintain a vector of atomic bools. Use that to indicate that you are currently processing an element. Always "lock" (set to true, and check that it was false before you set it to true) the lower index one before the upper one.
If you manage to lock both, process away. Then clear them both.
If you fail, remember the task for later, and work on other tasks. When you have too many tasks queued, wait on a condition variable, trying to check and set the atomic bool you want to lock in the spin-lambda.
Periodically kick the condition variable when you clear the locks. How often you do this will depend on profiling. You can kick without aquiring the mutex mayhap (but you must sometimes acquire the mutex after clearing the bools to deal with a race condition that could starve a thread).
Queue the tasks in the order indicated by the above collection system, as that reduces the likelihood of threads colliding. But with this system, work can still progress even if there is one task that is falling behind.
It adds complexity and synchronization, which could easily make it slower than the pure collection/cohort one.

Related

What are handy Haskell concepts to generate numbers of the form 2^m*3^n*5^l [duplicate]

This question already has answers here:
New state of the art in unlimited generation of Hamming sequence
(3 answers)
Closed 10 months ago.
I am trying generate numbers of the form 2^m*3^n*5^l where m, n, and l are natural numbers including 0.
The sequence follows: 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24, 25, 27, 30, 32, .....
I am testing it by getting the one millionth number. I implemented it using list comprehension and sorting, but it takes too long. I want a faster solution. I have been spending days trying to do this to no avail.
I do not want a complete solution. I just want to know what Haskell concepts are necessary in accomplishing it.
Here's an approach that doesn't need any Haskell concepts, just some math and computer science.
Grab a library that offers priority queues.
Initialize a priority queue containing only the number 1.
Loop the following indefinitely: extract the minimum value from the queue. Put it next in the output list. Insert that number times 2, 3, and 5 as three individual entries in the queue. Make sure the queue insert function merges duplicates, because there will be a lot of them thanks to commutativity of multiplication.
If you have a maximum you're working up to, you can use it to prune insertions to the queue as a minor optimization. Alternatively, you could take advantage of actual Haskell properties and just return an infinite list using laziness.
First, write a function of type Int -> Bool that dermines if a given integer is in the sequence you defined. It would divide the number by 2 as many times as possible (without creating a fraction), then divide it by 3 as many times as possible, and finally divide it by 5 as many times as possible. After all of this, if the number is larger than 1, then it cannot be expressed as a products of twos, threes, and fives, so the function would return false. Otherwise, the number is in your sequence, so the function returns true.
Then take the infinite sequence of integers greater than 0, and use the function above to filter out all numbers that are not in the sequence.
Carl's approach can be improved by inserting less elements when removing the minimal element x: As 2<3<4<5<6 you can just
append 3*x/2 if x is even but not divisible by 4
append 4*x/3 if x is divisible by 3
append 5*x/4 if x is divisible by 4
append 6*x/5 if x is divisible by 5
In code it looks like this:
g2 x | mod x 4 == 0 = [5*div x 4]
| even x = [3*div x 2]
| otherwise = []
g3 x | mod x 3 == 0 = [4*div x 3]
| otherwise = []
g5 x | mod x 5 == 0 = [6*div x 5]
| otherwise = []
g x = concatMap ($ x) [g2, g3, g5]
So you if your remove the minimal element x from the priority queue, you have to insert the elements of g x into the priority queue. On my laptop I get the millionth element after about 8 min, even if I use just a list instead of the better priority queue, as the list grows only to a bit more than 10000 elements.

Multithreading - Calculations between all pairs in a set

I have n elements (e.g. A, B, C and D) and need to do calculations between all of those.
Calculation 1 = A with B
Calculation 2 = A with C
Calculation 3 = A with D
Calculation 4 = B with C
Calculation 5 = B with D
Calculation 6 = C with D
In reality there are more than 1000 elements and I want to parallelise the process.
Note that I can't access an element from 2 threads simultaneously. This for example makes it impossible to do Calculation 1 and Calculation 2 at the same time because they both use the element A.
Edit: I could access an element from 2 threads but it makes everything very slow if i just split up the calculations and depend on locks for threadsafety.
Is there already an distribution algorithm for these kind of problems?
It seems like a lot of people must have had the same problem already but i couldn't find anything in the great internet. ;)
Single thread example code:
for (int i = 0; i < elementCount; i++)
{
for (int j = i + 1; j < elementCount; j++)
{
Calculate(element[i], element[j]);
}
}
You can apply round-robin tournament algorithm that allows to organize all possible pairs (N*(N-1) results).
All set elements (players) form two rows, column is pair at the
current round. First element is fixed, others are shifted in cyclic manner.
So you can run up to N/2 threads to get results for the first pairs set, then reorder indexes and continue
Excerpt from wiki:
The circle method is the standard algorithm to create a schedule for a round-robin tournament. All competitors are assigned to numbers, and then paired in the first round:
Round 1. (1 plays 14, 2 plays 13, ... )
1 2 3 4 5 6 7
14 13 12 11 10 9 8
then fix one of the contributors in the first or last column of the table (number one in this example) and rotate the others clockwise one position
Round 2. (1 plays 13, 14 plays 12, ... )
1 14 2 3 4 5 6
13 12 11 10 9 8 7
Round 3. (1 plays 12, 13 plays 11, ... )
1 13 14 2 3 4 5
12 11 10 9 8 7 6
until you end up almost back at the initial position
Round 13. (1 plays 2, 3 plays 14, ... )
1 3 4 5 6 7 8
2 14 13 12 11 10 9
It is simple enough to prove there is no way to distribute your calculations so that collisions never occur (that is, unless you manually order the computations and place round-boundaries, like #Mbo suggests), meaning that there is no distribution amongst multiple threads that will allow you to never lock.
Proof :
Given your requirement that any computation involving data object A should happen on a given thread T (only way to make sure you never lock on A).
Then it follows that thread T has to deal with at least one pair containing each other objects (B, C, D) of the input list.
It follows from the basic requirement that T is also to handle everything object-B related. And C. And D. So everything.
Therefore, only T can work.
QED. There is no possible parallelization that will never lock.
Way around #1 : map/reduce
That said... This is a typical case of divide and conquer. You are right that simple additions can require critical section locks, without the order of execution mattering. That is because your critical operation (addition) has a nice property, associativeness : A+(B+C) = (A+B)+C, on top of being commutative.
In other words, this operation is a candidate for a (parralel-friendly) reduce operation.
So the key here is probably :
Emit a stream of all interesting pairs
Map each pair to one or more partial results
Group each partial result by its master object (A, B, C)
Reduce each group by combining the partial results
A sample (pseudo) code
static class Data { int i = 0; }
static class Pair { Data d1; Data d2; }
static class PartialComputation { Data d; int sum; }
Data[] data = ...
Stream<Pair> allPairs = ... // Something like IntStream(0, data.length-1).flatMap(idx -> IntStream(idx+1 to data.length ).map(idx2 -> new Pair(data[idx], data[idx2])))
allPairs.flatMap(pair -> Stream.of(new ParticalComputation(pair.d1, pair.d1.i + pair.d2.i), new PartialComputation(pair.d2, pair.d2.i+pair.d1.i)) // Map everything, parallely, to partial results keyable by the original data object
allPairs.collect(Collectors.groupByParallel(
partialComp -> partialComp.d, // Regroup by the original data object
Collectors.reducing(0, (sum1, sum2) -> sum1.sum + sum2.sum)) // reduce by summing
))
Way around 2 : trust the implementations
Fact is, uncontended locks in java have gotten cheaper. On top of that, pure locking sometimes has better alternatives, like Atomic types in Java (e.g. AtomicLong if you are summing stuff), that use CAS instead of locking, which can be faster (google for it, I usually refer to the Java Concurrency In Practice book for hard numbers.)
The fact is, if you have 1000 to 10k different elements (which translates to at least millions of pairs) and, like, 8 CPUs, the contention (or probability that at least 2 of your 8 threads will be processing the same element) is pretty low. And I would rather measure it first-hand rather than saying upfront "I can not affor the locks", especially if the operation can be implemented using Atomic types.

More than expected jobs running in apache spark

I am trying to learn apache-spark. This is my code which i am trying to run. I am using pyspark api.
data = xrange(1, 10000)
xrangeRDD = sc.parallelize(data, 8)
def ten(value):
"""Return whether value is below ten.
Args:
value (int): A number.
Returns:
bool: Whether `value` is less than ten.
"""
if (value < 10):
return True
else:
return False
filtered = xrangeRDD.filter(ten)
print filtered.collect()
print filtered.take(8)
print filtered.collect() gives this as output [1, 2, 3, 4, 5, 6, 7, 8, 9].
As per my understanding filtered.take(n) will take n elements from RDD and print it.
I am trying two cases :-
1)Giving value of n less than or equal to number of elements in RDD
2)Giving value of n greater than number of elements in RDD
I have pyspark application UI to see number of jobs that run in each case. In first case only one job is running but in second five jobs are running.
I am not able to understand why is this happening. Thanks in advance.
RDD.take tries to evaluate as few partitions as possible.
If you take(9) it will fetch partition 0 (job 1) find 9 items and happily terminate.
If you take(10) it will fetch partition 0 (job 1) and find 9 items. It needs one more. Since partition 0 had 9, it thinks partition 1 will probably have at least one more (job 2). But it doesn't! In 2 partitions it has found 9 items. So 4.5 items per partition so far. The formula divides it by 1.5 for pessimism and decides 10 / (4.5 / 1.5) = 3 partitions will do it. So it fetches partition 2 (job 3). Still nothing. So 3 items per partition so far, divided by 1.5 means we need 10 / (3 / 1.5) = 5 partitions. It fetches partitions 3 and 4 (job 4). Nothing. We have 1.8 items per partition, 10 / (1.8 / 1.5) = 8. It fetches the last 3 partitions (job 5) and that's it.
The code for this algorithm is in RDD.scala. As you can see it's nothing but heuristics. It saves some work usually, but it can lead to unnecessarily many jobs in degenerate cases.

find the longest increasing subsequence (LIS)

Given A= {1,4,2,9,7,5,8,2}, find the LIS. Show the filled dynamic programming table and how the solution is found.
My book doesnt cover LIS so im a bit lost on how to start. For the DP table, ive done something similar with Longest Common Subsequences. Any help on how to start this would be much appreciated.
Already plenty of answers on this topic but here's my walkthrough, I view this site as a repository of answers for future posterity and this is just to provide additional insight when I worked through it myself.
The longest Increasing Subsequence (LIS) problem is to find the length of the longest subsequence of a given sequence such that all elements of the
subsequence are sorted in increasing order. For example, length of LIS for
{ 10, 22, 9, 33, 21, 50, 41, 60, 80 } is 6 and LIS is {10, 22, 33, 50, 60, 80}.
Let S[pos] be defined as the smallest integer that ends an increasing sequence of length pos.
Now iterate through every integer X of the input set and do the following:
If X > last element in S, then append X to the end of S. This essentialy means we have found a new largest LIS.
Otherwise find the smallest element in S, which is >= than X, and change it to X. Because S is sorted at any time, the element can be found
using binary search in log(N).
Total runtime - N integers and a binary search for each of them - N * log(N) = O(N log N)
Now let's do a real example:
Set of integers: 2 6 3 4 1 2 9 5 8
Steps:
0. S = {} - Initialize S to the empty set
1. S = {2} - New largest LIS
2. S = {2, 6} - 6 > 2 so append that to S
3. S = {2, 3} - 6 is the smallest element > 3 so replace 6 with 3
4. S = {2, 3, 4} - 4 > 3 so append that to s
5. S = {1, 3, 4} - 2 is the smallest element > 1 so replace 2 with 1
6. S = {1, 2, 4} - 3 is the smallest element > 2 so replace 3 with 2
7. S = {1, 2, 4, 9} - 9 > 4 so append that to S
8. S = {1, 2, 4, 5} - 9 is the smallest element > 5 replace 9 with 5
9. S = {1, 2, 4, 5, 8} - 8 > 5 so append that to S
So the length of the LIS is 5 (the size of S).
Let's take some other sequences to see that this will cover all possible caveats, each presents its own issue
say we have 1,2,3,4,9,2,3,4,5,6,7,8,10
basically it builds out 12349 first, then 2 will replace 3, 3 will replace 4, 4 will replace 9, then append 5,6,7,8,10
so will look like 1,2,2,3,4,6,7,8,10
take the other case we have 1,2,3,4,5,9,2,10
this will give us 1,2,2,4,5,9,10
or take the case we have 1,2,3,4,5,9,6,7,8,10
this will give us 1,2,3,4,5,7,8,10
so that kind of illuminates what goes on, in the first case the critical juncture being what happens when you hit the 2 after the 9,
how do you deal with these. well the block of 2,3,4 won't do anything really, when you hit 5 you replace the 9 because the 5 and 9
are virtually indifferentiable 9 ends the block of the first 5 increasing elements, you replace 9 with 5 because 5 is smaller so there
is greater potential to hit something > 5 later on. but you only replace the smallest element > itself. for ex. in the last case,
if your 6 doesn't replace 9 but instead replaces 1 and 7 replaces 2 and 8 replaces 3, then we get a final array of 7 elements instead
of 9. So just do a couple of these and figure out the pattern, this logic isn't the easiest to translate to paper.
There's a very strong relation between LIS and LCS.
http://en.wikipedia.org/wiki/Longest_increasing_subsequence
This article explains it pretty well I think. Basically the idea is, you can reduce one problem to the other (this is the case in many situations involving Dynamic programming).

How to filter a list in J?

I'm currently learning the fascinating J programming language, but one thing I have not been able to figure out is how to filter a list.
Suppose I have the arbitrary list 3 2 2 7 7 2 9 and I want to remove the 2s but leave everything else unchanged, i.e., my result would be 3 7 7 9. How on earth do I do this?
The short answer
2 (~: # ]) 3 2 2 7 7 2 9
3 7 7 9
The long answer
I have the answer for you, but before you should get familiar with some details. Here we go.
Monads, dyads
There are two types of verbs in J: monads and dyads. The former accept only one parameter, the latter accept two parameters.
For example passing a sole argument to a monadic verb #, called tally, counts the number of elements in the list:
# 3 2 2 7 7 2 9
7
A verb #, which accepts two arguments (left and right), is called copy, it is dyadic and is used to copy elements from the right list as many times as specified by the respective elements in the left list (there may be a sole element in the list also):
0 0 0 3 0 0 0 # 3 2 2 7 7 2 9
7 7 7
Fork
There's a notion of fork in J, which is a series of 3 verbs applied to their arguments, dyadically or monadically.
Here's the diagram of a kind of fork I used in the first snippet:
x (F G H) y
G
/ \
F H
/ \ / \
x y x y
It describes the order in which verbs are applied to their arguments. Thus these applications occur:
2 ~: 3 2 2 7 7 2 9
1 0 0 1 1 0 1
The ~: (not equal) is dyadic in this example and results in a list of boolean values which are true when an argument doesn't equal 2. This was the F application according to diagram.
The next application is H:
2 ] 3 2 2 7 7 2 9
3 2 2 7 7 2 9
] (identity) can be a monad or a dyad, but it always returns the right argument passed to a verb (there's an opposite verb, [ which returns.. Yes, the left argument! :)
So far, so good. F and H after application returned these values accordingly:
1 0 0 1 1 0 1
3 2 2 7 7 2 9
The only step to perform is the G verb application.
As I noted earlier, the verb #, which is dyadic (accepts two arguments), allows us to duplicate the items from the right argument as many times as specified in the respective positions in the left argument. Hence:
1 0 0 1 1 0 1 # 3 2 2 7 7 2 9
3 7 7 9
We've just got the list filtered out of 2s.
Reference
Slightly different kind of fork, hook and other primitves (including abovementioned ones) are described in these two documents:
A Brief J Reference (175 KiB)
Easy-J. An Introduction to the World's most Remarkable Programming Language (302 KiB)
Other useful sources of information are the Jsoftware site with their wiki and a few mail list archives in internets.
Just to be sure it's clear, the direct way - to answer the original question - is this:
3 2 2 7 7 2 9 -. 2
This returns
3 7 7 9
The more elaborate method - generating the boolean and using it to compress the vector - is more APLish.
To answer the other question in the very long post, to return the first element and the number of times it occurs, is simply this:
({. , {. +/ .= ]) 1 4 1 4 2 1 3 5
1 3
This is a fork using "{." to get the first item, "{. +/ . = ]" to add up the number of times the first item equals each element, and "," as the middle verb to concatenate these two parts.
Also:
2 ( -. ~ ]) 3 2 2 7 7 2 9
3 7 7 9
There are a million ways to do this - it bothers me, vaguely, that these these things don't evaluate strictly right to left, I'm an old APL programmer and I think of things as right to left even when they ain't.
If it were a thing that I was going to put into a program where I wanted to pull out some number and the number was a constant, I would do the following:
(#~ 2&~:) 1 3 2 4 2 5
1 3 4 5
This is a hook sort of thing, I think. The right half of the expression generates the truth vector regarding which are not 2, and then the octothorpe on the left has its arguments swapped so that the truth vector is the left argument to copy and the vector is the right argument. I am not sure that a hook is faster or slower than a fork with an argument copy.
+/3<+/"1(=2&{"1)/:~S:_1{;/5 6$1+i.6
156
This above program answers the question, "For all possible combinations of Yatzee dice, how many have 4 or 5 matching numbers in one roll?" It generates all the permutations, in boxes, sorts each box individually, unboxing them as a side effect, and extracts column 2, comparing the box against their own column 2, in the only successful fork or hook I've ever managed to write. The theory is that if there is a number that appears in a list of 5, three or more times, if you sort the list the middle number will be the number that appears with the greatest frequency. I have attempted a number of other hooks and/or forks and every one has failed because there is something I just do not get. Anyway that truth table is reduced to a vector, and now we know exactly how many times each group of 5 dice matched the median number. Finally, that number is compared to 3, and the number of successful compares (greater than 3, that is, 4 or 5) are counted.
This program answers the question, "For all possible 8 digit numbers made from the symbols 1 through 5, with repetition, how many are divisible by 4?"
I know that you need only determine how many within the first 25 are divisible by 4 and multiply, but the program runs more or less instantly. At one point I had a much more complex version of this program that generated the numbers in base 5 so that individual digits were between 0 and 4, added 1 to the numbers thus generated, and then put them into base 10. That was something like 1+(8$5)#:i.5^8
+/0=4|,(8$10)#. >{ ;/ 8 5$1+i.5
78125
As long as I have solely verb trains and selection, I don't have a problem. When I start having to repeat my argument within the verb so that I'm forced to use forks and hooks I start to get lost.
For example, here is something I can't get to work.
((1&{~+/)*./\(=1&{))1 1 1 3 2 4 1
I always get Index Error.
The point is to output two numbers, one that is the same as the first number in the list, the second which is the same as the number of times that number is repeated.
So this much works:
*./\(=1&{)1 1 1 3 2 4 1
1 1 1 0 0 0 0
I compare the first number against the rest of the list. Then I do an insertion of an and compression - and this gives me a 1 so long as I have an unbroken string of 1's, once it breaks the and fails and the zeros come forth.
I thought that I could then add another set of parens, get the lead element from the list again, and somehow record those numbers, the eventual idea would be to have another stage where I apply the inverse of the vector to the original list, and then use $: to get back for a recursive application of the same verb. Sort of like the quicksort example, which I thought I sort of understood, but I guess I don't.
But I can't even get close. I will ask this as a separate question so that people get proper credit for answering.

Resources