I am very confused why the concurrent.futures module is giving me different results each time. I have a function, say foo(), which runs on segments of a larger set of data d.
I consistently break this larger data set d into parts and make a list
d_parts = [d1, d2, d3, ...]
Then following the documentation, I do the following
results = [executor.submit(foo, d) for d in d_parts]
which is supposed to give me a list of "futures" objects in the order of foo(d1), foo(d2), and so on.
However, when I try to compile results with
done, _ = concurrent.futures.wait(results)
The list of results stored in done seem to be out of order, i.e. they are not the returns of foo(d1), foo(d2) but some different ordering. Hence, running this program on the same data set multiple times yields different results, as a result of the indeterminacy of which finishes first (the d1, d2... are roughly same size) Is there a reason why, since it seems that wait() should preserve the ordering in which the jobs were submitted?
Thanks!
Related
I have a uniform 2D coordinate grid stored in a numpy array. The values of this array are assigned by a function that looks roughly like the following:
def update_grid(grid):
n, m = grid.shape
for i in range(n):
for j in range(m):
#assignment
Calling this function takes 5-10 seconds for a 100x100 grid, and it needs to be called several hundred times during the execution of my main program. This function is the rate limiting step in my program, so I want to reduce the process time as much as possible.
I believe that the assignment expression inside can be split up in a manner which accommodates multiprocessing. The value at each gridpoint is independent of the others, so the assignments can be split something like this:
def update_grid(grid):
n, m = grid.shape
for i in range (n):
for j in range (m):
p = Process(target=#assignment)
p.start()
So my questions are:
Does the above loop structure ensure each process will only operate
on a single gridpoint? Do I need anything else to allow each
process to write to the same array, even if they're writing to
different placing in that array?
The assignment expression requires a set of parameters. These are
constant, but each process will be reading at the same time. Is this okay?
To explicitly write the code I've structured above, I would need to
define my assignment expression as another function inside of
update_grid, correct?
Is this actually worthwhile?
Thanks in advance.
Edit:
I still haven't figured out how to speed up the assignment, but I was able to avoid the problem by changing my main program. It no longer needs to update the entire grid with each iteration, and instead tracks changes and only updates what has changed. This cut the execution time of my program down from an hour to less than a minute.
I would like to create a variable number of threads in Prolog and make the main thread wait for all of them.
I have tried to make a join for each one of them in the predicate but it seems like they are waiting one for the other in a sequential order.
I have also tried storing the ids of the threads in a list and join each one after but it still isn't working.
In the code sample, I have also tried passing the S parameters in thread_join in the recursive call.
thr1(0):-!.
thr1(N):-
thread_create(someFunction(N),Id, []),
thread_join(Id, S),
N1 is N-1,
thr1(N1).
I expect the N predicates to overlap results when doing some print, but they are running in a sequential order.
Most likely the calls to your someFunction/1 predicate succeed faster than the time it takes to create the next thread, which is a relatively heavy process as SWI-Prolog threads are mapped to POSIX threads. Thus, to actually get overlapping results, the computation time of the thread goals must exceed thread creation time. For a toy example of accomplishing that, see:
https://github.com/LogtalkDotOrg/logtalk3/tree/master/examples/threads/sync
I have to calculate a result of an stochastic algorithm multiple times. In the end I want to have all results in an array. The executions of the algorithm are independent of one another. In Julia this can be parallelized easily with a parallel for-loop like this:
#parallel (vcat) for i=1:10
rand() # or any other algorithm yielding a number
end
But it seems a little inefficient if one thread gets the result of another thread and the two results are merged after every iteration of the for loop.
Is this correct? In this case, it could be that one thread yields a 100-element array and another one has a 200-element array and these arrays are merged into a 300-element array?
Could I somehow prevent this and rewrite the above code to prevent multiple array allocations and maybe put the result that is calculated inside the for-loop into a pre-allocated array?
Or can I make the reduction operator smarter somehow?
You could use pmap for this. It can distribute the work in parallel over your workers, and then store the results of each job as a separate element in an array. You can then combine this array at the end.
Consider this example, where each job is to create a random vector of differing length, all of which are combined at the end:
addprocs(3)
Results = pmap(rand, 1:10)
Result = vcat(Results...) ## array of length 55.
pmap will assign each worker a job as soon as it finishes the job it is working on. As such, it can be more efficient than #parallel if your jobs are of variable length. (see here for details).
The ... syntax breaks the elements of Results (i.e. the 10 vectors of varying length) into separate arguments to feed to the vcat function.
Problem
Summary: Parallely apply a function F to each element of an array where F is NOT thread safe.
I have a set of elements E to process, lets say a queue of them.
I want to process all these elements in parallel using the same function f( E ).
Now, ideally I could call a map based parallel pattern, but the problem has the following constraints.
Each element contains a pair of 2 objects.( E = (A,B) )
Two elements may share an object. ( E1 = (A1,B1); E2 = (A1, B2) )
The function f cannot process two elements that share an object. so E1 and E2 cannot be processing in parallel.
What is the right way of doing this?
My thoughts are like so,
trivial thought: Keep a set of active As and Bs, and start processing an Element only when no other thread is already using A OR B.
So, when you give the element to a thread you add the As and Bs to the active set.
Pick the first element, if its elements are not in the active set spawn a new thread , otherwise push it to the back of the queue of elements.
Do this till the queue is empty.
Will this cause a deadlock ? Ideally when a processing is over some elements will become available right?
2.-The other thought is to make a graph of these connected objects.
Each node represents an object (A / B) . Each element is an edge connecting A & B, and then somehow process the data such that we know the elements are never overlapping.
Questions
How can we achieve this best?
Is there a standard pattern to do this ?
Is there a problem with these approaches?
Not necessary, but if you could tell the TBB methods to use, that'll be great.
The "best" approach depends on a lot of factors here:
How many elements "E" do you have and how much work is needed for f(E). --> Check if it's really worth it to work the elements in parallel (if you need a lot of locking and don't have much work to do, you'll probably slow down the process by working in parallel)
Is there any possibility to change the design that can make f(E) multi-threading safe?
How many elements "A" and "B" are there? Is there any logic to which elements "E" share specific versions of A and B? --> If you can sort the elements E into separate lists where each A and B only appears in a single list, then you can process these lists parallel without any further locking.
If there are many different A's and B's and you don't share too many of them, you may want to do a trivial approach where you just lock each "A" and "B" when entering and wait until you get the lock.
Whenever you do "lock and wait" with multiple locks it's very important that you always take the locks in the same order (e.g. always A first and B second) because otherwise you may run into deadlocks. This locking order needs to be observed everywhere (a single place in the whole application that uses a different order can cause a deadlock)
Edit: Also if you do "try lock" you need to ensure that the order is always the same. Otherwise you can cause a lifelock:
thread 1 locks A
thread 2 locks B
thread 1 tries to lock B and fails
thread 2 tries to lock A and fails
thread 1 releases lock A
thread 2 releases lock B
Goto 1 and repeat...
Chances that this actually happens "endless" are relatively slim, but it should be avoided anyway
Edit 2: principally I guess I'd just split E(Ax, Bx) into different lists based on Ax (e.g one list for all E's that share the same A). Then process these lists in parallel with locking of "B" (there you can still "TryLock" and continue if the required B is already used.
Assume there are 2 threads performing operations on a shared queue q. The lines of code for each thread are numbered, and initially the queue is empty.
Thread A:
A1) q.enq(x)
A2) q.deq()
Thread B:
B1) q.enq(y)
Assume that the order of execution is as follows:
A1) q.enq(x)
B1) q.enq(y)
A2) q.deq()
and as a result we get y (i.e. q.deq() returns y)
This execution is based on a well-known book and is said to be sequentially consistent. Notice that the method calls don’t overlap. How is that even possible? I believe that Thread A executed A1 without actually updating the queue until it proceeded to line A2 but that's just my guess. I'm even more confused if I look at this explanation from The Java Language Specification:
Sequential consistency is a very strong guarantee that is made about visibility and ordering in an execution of a program. Within a sequentially consistent execution, there is a total order over all individual actions (such as reads and writes) which is consistent with the order of the program, and each individual action is atomic and is immediately visible to every thread.
If that was the case, we would have dequeue x.
I'm sure I'm somehow wrong. Could somebody throw a light on this?
Note that the definition of sequential consistency says "consistent with program order", not "consistent with the order in which the program happens to be executed".
It goes on to say:
If a program has no data races, then all executions of the program will appear to be sequentially consistent.
(my emphasis of "appear").
Java's memory model does not enforce sequential consistency. As the JLS says:
If we were to use sequential consistency as our memory model, many of the compiler and processor optimizations that we have discussed would be illegal. For example, in the trace in Table 17.3, as soon as the write of 3 to p.x occurred, subsequent reads of that location would be required to see that value.
So Java's memory model doesn't actually support sequential consistency. Just the appearance of sequential consistency. And that only requires that there is some sequentially consistent order of actions that's consistent with program order.
Clearly there is some execution of threads A and B that could result in A2 returning y, specifically:
B1) q.enq(y)
A1) q.enq(x)
A2) q.deq()
So, even if the program happens to be executed in the order you specified, there is an order in which it could have been executed that is "consistent with program order" for which A2 returns y. Therefore, a program that returns y in that situation still gives the appearance of being sequentially consistent.
Note that this shouldn't be interpreted as saying that it would be illegal for A2 to return x, because there is a sequentially consistent sequence of operations that is consistent with program order that could give that result.
Note also that this appearance of sequential consistency only applies to correctly synchronized programs. If your program is not correctly synchronized (i.e. has data races) then all bets are off.