Multithreading - Calculations between all pairs in a set - multithreading

I have n elements (e.g. A, B, C and D) and need to do calculations between all of those.
Calculation 1 = A with B
Calculation 2 = A with C
Calculation 3 = A with D
Calculation 4 = B with C
Calculation 5 = B with D
Calculation 6 = C with D
In reality there are more than 1000 elements and I want to parallelise the process.
Note that I can't access an element from 2 threads simultaneously. This for example makes it impossible to do Calculation 1 and Calculation 2 at the same time because they both use the element A.
Edit: I could access an element from 2 threads but it makes everything very slow if i just split up the calculations and depend on locks for threadsafety.
Is there already an distribution algorithm for these kind of problems?
It seems like a lot of people must have had the same problem already but i couldn't find anything in the great internet. ;)
Single thread example code:
for (int i = 0; i < elementCount; i++)
{
for (int j = i + 1; j < elementCount; j++)
{
Calculate(element[i], element[j]);
}
}

You can apply round-robin tournament algorithm that allows to organize all possible pairs (N*(N-1) results).
All set elements (players) form two rows, column is pair at the
current round. First element is fixed, others are shifted in cyclic manner.
So you can run up to N/2 threads to get results for the first pairs set, then reorder indexes and continue
Excerpt from wiki:
The circle method is the standard algorithm to create a schedule for a round-robin tournament. All competitors are assigned to numbers, and then paired in the first round:
Round 1. (1 plays 14, 2 plays 13, ... )
1 2 3 4 5 6 7
14 13 12 11 10 9 8
then fix one of the contributors in the first or last column of the table (number one in this example) and rotate the others clockwise one position
Round 2. (1 plays 13, 14 plays 12, ... )
1 14 2 3 4 5 6
13 12 11 10 9 8 7
Round 3. (1 plays 12, 13 plays 11, ... )
1 13 14 2 3 4 5
12 11 10 9 8 7 6
until you end up almost back at the initial position
Round 13. (1 plays 2, 3 plays 14, ... )
1 3 4 5 6 7 8
2 14 13 12 11 10 9

It is simple enough to prove there is no way to distribute your calculations so that collisions never occur (that is, unless you manually order the computations and place round-boundaries, like #Mbo suggests), meaning that there is no distribution amongst multiple threads that will allow you to never lock.
Proof :
Given your requirement that any computation involving data object A should happen on a given thread T (only way to make sure you never lock on A).
Then it follows that thread T has to deal with at least one pair containing each other objects (B, C, D) of the input list.
It follows from the basic requirement that T is also to handle everything object-B related. And C. And D. So everything.
Therefore, only T can work.
QED. There is no possible parallelization that will never lock.
Way around #1 : map/reduce
That said... This is a typical case of divide and conquer. You are right that simple additions can require critical section locks, without the order of execution mattering. That is because your critical operation (addition) has a nice property, associativeness : A+(B+C) = (A+B)+C, on top of being commutative.
In other words, this operation is a candidate for a (parralel-friendly) reduce operation.
So the key here is probably :
Emit a stream of all interesting pairs
Map each pair to one or more partial results
Group each partial result by its master object (A, B, C)
Reduce each group by combining the partial results
A sample (pseudo) code
static class Data { int i = 0; }
static class Pair { Data d1; Data d2; }
static class PartialComputation { Data d; int sum; }
Data[] data = ...
Stream<Pair> allPairs = ... // Something like IntStream(0, data.length-1).flatMap(idx -> IntStream(idx+1 to data.length ).map(idx2 -> new Pair(data[idx], data[idx2])))
allPairs.flatMap(pair -> Stream.of(new ParticalComputation(pair.d1, pair.d1.i + pair.d2.i), new PartialComputation(pair.d2, pair.d2.i+pair.d1.i)) // Map everything, parallely, to partial results keyable by the original data object
allPairs.collect(Collectors.groupByParallel(
partialComp -> partialComp.d, // Regroup by the original data object
Collectors.reducing(0, (sum1, sum2) -> sum1.sum + sum2.sum)) // reduce by summing
))
Way around 2 : trust the implementations
Fact is, uncontended locks in java have gotten cheaper. On top of that, pure locking sometimes has better alternatives, like Atomic types in Java (e.g. AtomicLong if you are summing stuff), that use CAS instead of locking, which can be faster (google for it, I usually refer to the Java Concurrency In Practice book for hard numbers.)
The fact is, if you have 1000 to 10k different elements (which translates to at least millions of pairs) and, like, 8 CPUs, the contention (or probability that at least 2 of your 8 threads will be processing the same element) is pretty low. And I would rather measure it first-hand rather than saying upfront "I can not affor the locks", especially if the operation can be implemented using Atomic types.

Related

pandas how to flatten a list in a column while keeping list ids for each element

I have the following df,
A id
[ObjectId('5abb6fab81c0')] 0
[ObjectId('5abb6fab81c3'),ObjectId('5abb6fab81c4')] 1
[ObjectId('5abb6fab81c2'),ObjectId('5abb6fab81c1')] 2
I like to flatten each list in A, and assign its corresponding id to each element in the list like,
A id
ObjectId('5abb6fab81c0') 0
ObjectId('5abb6fab81c3') 1
ObjectId('5abb6fab81c4') 1
ObjectId('5abb6fab81c2') 2
ObjectId('5abb6fab81c1') 2
I think the comment is coming from this question ? you can using my original post or this one
df.set_index('id').A.apply(pd.Series).stack().reset_index().drop('level_1',1)
Out[497]:
id 0
0 0 1.0
1 1 2.0
2 1 3.0
3 1 4.0
4 2 5.0
5 2 6.0
Or
pd.DataFrame({'id':df.id.repeat(df.A.str.len()),'A':df.A.sum()})
Out[498]:
A id
0 1 0
1 2 1
1 3 1
1 4 1
2 5 2
2 6 2
This probably isn't the most elegant solution, but it works. The idea here is to loop through df (which is why this is likely an inefficient solution), and then loop through each list in column A, appending each item and the id to new lists. Those two new lists are then turned into a new DataFrame.
a_list = []
id_list = []
for index, a, i in df.itertuples():
for item in a:
a_list.append(item)
id_list.append(i)
df1 = pd.DataFrame(list(zip(alist, idlist)), columns=['A', 'id'])
As I said, inelegant, but it gets the job done. There's probably at least one better way to optimize this, but hopefully it gets you moving forward.
EDIT (April 2, 2018)
I had the thought to run a timing comparison between mine and Wen's code, simply out of curiosity. The two variables are the length of column A, and the length of the list entries in column A. I ran a bunch of test cases, iterating by orders of magnitude each time. For example, I started with A length = 10 and ran through to 1,000,000, at each step iterating through randomized A entry list lengths of 1-10, 1-100 ... 1-1,000,000. I found the following:
Overall, my code is noticeably faster (especially at increasing A lengths) as long as the list lengths are less than ~1,000. As soon as the randomized list length hits the ~1,000 barrier, Wen's code takes over in speed. This was a huge surprise to me! I fully expected my code to lose every time.
Length of column A generally doesn't matter - it simply increases the overall execution time linearly. The only case in which it changed the results was for A length = 10. In that case, no matter the list length, my code ran faster (also strange to me).
Conclusion: If the list entries in A are on the order of a few hundred elements (or less) long, my code is the way to go. But if you're working with huge data sets, use Wen's! Also worth noting that as you hit the 1,000,000 barrier, both methods slow down drastically. I'm using a fairly powerful computer, and each were taking minutes by the end (it actually crashed on the A length = 1,000,000 and list length = 1,000,000 case).
Flattening and unflattening can be done using this function
def flatten(df, col):
col_flat = pd.DataFrame([[i, x] for i, y in df[col].apply(list).iteritems() for x in y], columns=['I', col])
col_flat = col_flat.set_index('I')
df = df.drop(col, 1)
df = df.merge(col_flat, left_index=True, right_index=True)
return df
Unflattening:
def unflatten(flat_df, col):
flat_df.groupby(level=0).agg({**{c:'first' for c in flat_df.columns}, col: list})
After unflattening we get the same dataframe except column order:
(df.sort_index(axis=1) == unflatten(flatten(df)).sort_index(axis=1)).all().all()
>> True
To create unique index you can call reset_index after flattening

Sort numbers into groups so that the difference of their sums is minimal

I found a few threads that were similar however I believe mine is a bit unique. This will be difficult to write so please bear with me.
I have a strain of 10 accounts, each account has a static number that can not be split up. I have 3 employees that need these accounts split as even as possible. They cannot share an account.
For example:
(A)lpha = 15
(B)eta = 30
(C)harlie = 22
(D)elta = 19
(E)cho = 28
(F)ranklin = 3
(G)roto = 7
(H)enry = 28
(I)ndia = 38
(J)uliet = 48
The total sum is = 238. In the perfect world, 2 people would get 79 and one person would have 80. However, remember we cannot break apart an account so we would need to add accounts together to get as close to evenly spread as possible.
I need a formula for this since situations like this occur regularly and it takes some time to figure this out. I believe this would be best executed with a helper column.
The closest I have come to is:
FHJ = 79
ABCG = 74
DEI = 85
But since this is reoccurring and can happen over even more accounts, I need something I can reuse over and over.
Another less complex but approximated solution would be to
sort your accounts from highest to lowest number.
Start sorting the numbers into 3 groups (A, B, C)
starting with the 3 highest numbers (48|J, 38|I, 30|B) sorting to group A, B and C
next highest number (28|E) goes to the group with the lowest sum (C)
next highest number (28|H) goes to the group with the lowest sum (B)
and so on …
You should end up with this:
Which is different from your manual solution but closer. If you see the differences:
Solution from above: 81 - 77 = 4
Your manual solution: 85 - 74 = 11
This algorithm is an approximation, it will not always find the best solution but if the difference between the lowest and highest number is not too large then the result is very close to the best solution.
This is known as a partition problem. You could try implementing the pseudo-polynomial time algorithm from the Wikipedia page. You'll have to modify it for 3 partitions instead of 2.
INPUT: A list of integers S
OUTPUT: True if S can be partitioned into two subsets that have equal sum
1 function find_partition(S):
2 n ← |S|
3 K ← sum(S)
4 P ← empty boolean table of size (floor(K/2)+ 1) by (n + 1)
5 initialize top row (P(0,x)) of P to True
6 initialize leftmost column (P(x, 0)) of P, except for P(0, 0) to False
7 for i from 1 to floor(K/2)
8 for j from 1 to n
9 if (i-S[j-1]) >= 0
10 P(i, j) ← P(i, j-1) or P(i-S[j-1], j-1)
11 else
12 P(i, j) ← P(i, j-1)
13 return P(floor(K/2), n)

x-.y And what about intersection?

x-.y includes all items of x except for those that are cells of y
But what if I want to get all items that are cells of x and of y?
I can achieve this by
x -.^:2 y
But it require running expensive operation twice.
Is there a better solution?
e. is often useful when working with sets.
x e. y
gives a list of matches:
for each item of x return 1 if it exists in the "set" y, 0 otherwise.
1 2 3 4 e. 5 9 2
0 1 0 0
Then,
x (e. # [) y
selects those elements that do exist in both lists.
1 2 3 4 (e. # [) 5 9 2
2
5 8 (e. # [) i.12
5 8
Doing -. twice is the classic way of implementing intersection in J.
The inefficiency is minor (a constant factor - and, in general, you should not concern yourself with efficiency issues in J unless they exceed a factor of 2 - when you have resource problems you're generally going to want to focus on the factor of 1000 or greater issues).
Put differently, if ([-.-.) or -.^:2 is too slow for you then -. would also be too slow for you. (This can happen on extremely large data sets where the underlying implementation has been inefficient. Recent versions of J have had some work done, to correct this issue.)
Disappointing, perhaps, but practical.

loop rolling algorithm

I have come up with the term loop rolling myself with the hope that it does
not overlap with an existing term. Basically I'm trying to come up with an
algorithm to find loops in a printed text.
Some examples from simple to complicated
Example1
Given:
a a a a a b c d
I want to say:
5x(a) b c d
or algorithmically:
for 1 .. 5
print a
end
print b
print c
print d
Example2
Given:
a b a b a b a b c d
I want to say:
4x(a b) c d
or algorithmically:
for 1 .. 4
print a
print b
end
print c
print d
Example3
Given:
a b c d b c d b c d b c e
I want to say:
a 3x(b c d) b c e
or algorithmically:
print a
for 1 .. 3
print b
print c
print d
end
print b
print c
print d
It didn't remind me of any algorithm that I know of. I feel like some of the
problems can be ambiguous but finding one of the solutions is enough to me for
now. Efficiency is always welcome but not mandatory. How can I do this?
EDIT
First of all, thanks for all the discussion. I have adapted an LZW algorithm
from rosetta and ran it on my
input:
abcdbcdbcdbcdef
which gave me:
a
b
c
d
8 => bc
10 => db
9 => cd
11 => bcd
e
f
where I have a dictionary of:
a a
c c
b b
e e
d d
f f
8 bc
9 cd
10 db
11 bcd
12 dbc
13 cdb
14 bcde
15 ef
7 ab
It looks good for compression but it's not quite what I wanted. What I need
is more like compression in the algorithmic representation from my examples
which would have:
subsequent sequences (if a sequence is repeating, there would be no other
sequence in between)
no dictionary but only loops
irreducable
with maximum sequence sizes (which would minimize the algorithmic
representation)
and let's say nested loops are allowed (contrary to what I said before in
the comment)
I start with an algorithm, which gives maximum sequence sizes. Though it would not always minimize the algorithmic representation, it may be used as an approximation algorithm. Or it may be extended to optimal algorithm.
Start with constructing Suffix array for your text along with LCP array.
Sort an array of indexes of LCP array, indexes of larger elements of LCP array come first. This groups together repeating sequences of the same length and allows to process sequences in greedy manner, starting from maximum sequence sizes.
Extract suffix array entries, grouped by LCP value (by group I mean all the entries with selected LCP value as well as all entries with larger LCP values), and sort them by position in the text.
Filter out entries with positional difference not equal to LCP. For remaining entries, get prefixes of length, equal to LCP. This gives all possible sequences in the text.
Add sequences, sorted by starting position, to ordered collection (for example, binary search tree). Sequences are added in order of appearance in sorted LCP, so longer sequences are added first. Sequences are added only if they are independent or if one of them is completely nested inside the other one. Intersecting intervals are ignored. For example, in caba caba bab sequence ab intersects with caba and so it is ignored. But in cababa cababa babab one instance of ab is dropped, 2 instances are completely inside larger sequence, and 2 instances are completely outside of it.
At the end, this ordered collection contains all the information, needed to produce the algorithmic representation.
Example:
Text ababcabab
Suffix array ab abab ababcabab abcabab b bab babcabab bcabab cabab
LCP array 2 4 2 0 1 3 1 0
Sorted LCP 4 3 2 2 1 1 0 0
Positional difference 5 5 2 2 2 2 - -
Filtered LCP - - 2 2 - - - -
Filtered prefixes (ab ab) (ab ab)
Sketch of an algorithm, producing the minimal algorithmic representation.
Start with the first 4 steps of previous algorithm. Fifth step should be modified. Now it is not possible to ignore intersecting intervals, so every sequence is added to the collection. Since the collection now contains intersecting intervals, it is better to implement it as some advanced data structure, for example, Interval tree.
Then recursively determine the length of algorithmic representation for all sequences, that contain any nested sequences, starting from the smallest ones. When every sequence is evaluated, compute optimal algorithmic representation for whole text. Algorithm for processing either a sequence or whole text uses dynamic programming: allocate a matrix with number of columns, equal to text/sequence length and number of rows, equal to the length of algorithmic representation; doing in-order traversal of interval tree, update this matrix with all sequences, possible for each text position; when more than one value for some cell is possible, either choose any of them, or give preference to longer or shorter sub-sequences.

Minimum and maximum possible value of a shared variable when incremented by multiple threads

I have a global shared variable and that is being updated 5 times by each of the 5 threads spawned. As per my understanding the increment operation is consisting of 3 instructions
load reg, M
inc reg
store reg, M
So I want to ask that in this scenario what would be the maximum and minimum value given arbitrary interleaving in the 5 threads.
So according to me the maximum value will be 25 ( I am 100% sure that it can be more than 25) and the minimum value is 5. But I am not so sure on minimum value. Can it be less that 5 in some arbitrary interleaving ?
Any inputs will be greatly appreciated.
/* Global Variable */
int var = 0;
/* Thread function */
void thread_func()
{
for(int c = 0; c < 5; c++)
var++;
}
Given your definition of increment, I agree with your max of 25.
However, I believe the min can be 2 under the following scenario. I've named the 5 threads A, B, C, D and E.
A loads 0
C, D, E run to completion
B runs through 4 of its 5 iterations.
A increments 0 to 1 and stores the result (1).
B loads 1
A runs to completion
B increments 1 to 2 and stores 2.
If I use the same logic given by jtdubs, the minimum value should be 1 in the following case.
Lets use the same naming of 5 threads as A, B, C, D, and E.
A loads 0
B, C, D, E run to completion and incremented to maximum value 20 (5 increments by each of 4 threads).
A increments 0 to 1 and store the result 1.
I agree with a minimum of 2 (not 1).
The minimum equals 1 solution ignores the fact that A still hasn't run to completion after it stores 1 in the shared memory.
With no other thread left to "interfere", thread A must still run through the remaining 4 iterations ending with the result 5.
What the minimum of 2 solution enables is an end-game between the two remaining threads A and B, after all other threads have finished running, leading to the minimum possible outcome.
B "wastes" 4 iterations only to load 1 again, increment it and store 2 after A has run to completion.

Resources