Computing c𝑖 = √(a𝑖 Γ— b𝑖) in parallel using nested parallelism - multithreading

Let's say we have two vectors A=(ai) and B=(bi), each of size n and we have to compute a new vector C=(ci) as 𝑐𝑖 = √(π‘Žπ‘– Γ— 𝑏𝑖) for(i=1,...,n)
Main question: What would be the best way to compute the ci in parallel (using nested parallelism, i.e. using sync and spawn).
I think the below understanding is correct about the computation
for (i = 1 to n) {
C[i] = Math.sqrt(A[i] * B[i]);
And is there any way to use parallel for loops to compute C in parallel ?
If so, I think the approach will be the following:
parallel for (i = 1 to n) {
C[i] = Math.sqrt(A[i] * B[i]);
Is it correct ?

Assuming that by best you mean fastest, the usual approach would be to divide A and B into chunks, spawn a separate thread for handling each of these chunks in parallel, and wait for all the threads to finish their tasks.
The optimal number of chunks for such computation, most likely, will be the number of CPU cores you have on your computer. So, the pseudocode would look like:
chunkSize = ceiling(n / numberOfCPUs)
for (t = 1 to numberOfCPUs) {
startIndex = (t - 1) * chunkSize + 1
size = min(chunkSize, C.size - startIndex + 1)
threads.add(Thread.spawn(startIndex, size))
Where each thread, provided with the startIndex and size, computes:
for (i = startIndex to startIndex + size) {
C[i] = Math.sqrt(A[i] * B[i])
Another approach would be to have a pool of threads and give those threads a single shared queue of indices 1, 2, ... n. Each thread on each iteration polls the top index (let it be i) and calculates C[i]. As soon as the queue is empty, the work is done. The problem here is that you need additional synchronization mechanism that would guarantee that every index is processed by exactly one thread. For some simple tasks (like yours) such mechanism might consume more resources than actual calculation, but for relatively long-running tasks it works pretty well.
There's a mutual approach when you break the initial set of tasks into chunks, provide each thread in the pool with its own chunk, but when a thread is done with its chunk, it starts 'stealing' tasks from other threads in order not to sit idle. On many real tasks it gives better results than either of previous approaches.


How can I include a progress indicator in Octave for parallel computations?

I wrote a function in Octave that uses parcellfun from the parallel package to split calculations up across multiple threads.
Even with multithreading, though, some calculations may take multiple hours to finish, so I would like to include some kind of progress indicator along the way. In the non-parallel version, it was fairly simple to just send the iteration counter to a waitbox object. The parallel version causes some problems.
So far, I have tried to write an extra function that could be called by each parallel child. That function is as follows. It uses persistent variables to try and keep information between the threads.
function parallelWaitbox(i, s)
persistent n = 0; % Completed calculations
persistent m = 100; % Total calculations
persistent l = 0; % Last percentage done (0:0.01:1)
persistent h; % Waitbox handle
% Send 0 to initialize
if(0 == i)
n = 0;
m = s;
msg = sprintf("Total Operations: %i\r\n%i%% Complete",m,0);
h = waitbar(0,msg);
% Send 1 to increment
if(1 == i)
% Special case: max
if(n == m)
msg = sprintf("Total Operations: %i\r\n100%% Complete",m);
waitbar(1, h, msg);
p = floor(100*n/m)/100;
if p > l
msg = sprintf("Total Operations: %i\r\n%i%% Complete",m,p*100);
l = p;
It is initialized with a call of parallelWaitbox(0,max) before the parcellfun call, and the parallel function calls parallelWaitbox(1) when it finishes. Unfortunately, because each thread is its own instance of Octave, they don't share this function, even when mlock() is called.
I tried to pass a handle to the parallelWaitbox function to the parallel function, in hopes it would help the different threads access the same version of the function, but it did not work.
I am not sure if passing a handle to the waitbox object would work, but even if it did there is no way to read from the waitbox that I am aware of, so the problem of keeping track of the current state would remain.
I know that I could use a for loop to split my parcellfun call up to 100 chunks, but I'd really rather avoid slowing my processing down. If there's a better way to do this, I'd love to know about it. I am not tied to the waitbox object if there is an alternative.

why does a a nodejs array shift/push loop run 1000x slower above array length 87369?

Why is the speed of nodejs array shift/push operations not linear in the size of the array? There is a dramatic knee at 87370 that completely crushes the system.
Try this, first with 87369 elements in q, then with 87370. (Or, on a 64-bit system, try 85983 and 85984.) For me, the former runs in .05 seconds; the latter, in 80 seconds -- 1600 times slower. (observed on 32-bit debian linux with node v0.10.29)
q = [];
// preload the queue with some data
for (i=0; i<87369; i++) q.push({});
// fetch oldest waiting item and push new item
for (i=0; i<100000; i++) {
if (i%10000 === 0) process.stdout.write(".");
64-bit debian linux v0.10.29 crawls starting at 85984 and runs in .06 / 56 seconds. Node v0.11.13 has similar breakpoints, but at different array sizes.
Shift is a very slow operation for arrays as you need to move all the elements but V8 is able to use a trick to perform it fast when the array contents fit in a page (1mb).
Empty arrays start with 4 slots and as you keep pushing, it will resize the array using formula 1.5 * (old length + 1) + 16.
var j = 4;
while (j < 87369) {
j = (j + 1) + Math.floor(j / 2) + 16
So your array size ends up actually being 124517 items which makes it too large.
You can actually preallocate your array just to the right size and it should be able to fast shift again:
var q = new Array(87369); // Fits in a page so fast shift is possible
// preload the queue with some data
for (i=0; i<87369; i++) q[i] = {};
If you need larger than that, use the right data structure
I started digging into the v8 sources, but I still don't understand it.
I instrumented deps/v8/src/ (called from Builtin_ArrayShift, which implements the shift with a memmove), and it clearly shows the slowdown: only 1000 shifts per second because each one takes 1ms:
AR: at 1417982255.050970: MoveElements sec = 0.000809
AR: at 1417982255.052314: MoveElements sec = 0.001341
AR: at 1417982255.053542: MoveElements sec = 0.001224
AR: at 1417982255.054360: MoveElements sec = 0.000815
AR: at 1417982255.055684: MoveElements sec = 0.001321
AR: at 1417982255.056501: MoveElements sec = 0.000814
of which the memmove is 0.000040 seconds, the bulk is the heap->RecordWrites (deps/v8/src/heap-inl.h):
void Heap::RecordWrites(Address address, int start, int len) {
if (!InNewSpace(address)) {
for (int i = 0; i < len; i++) {
store_buffer_.Mark(address + start + i * kPointerSize);
which is (store-buffer-inl.h)
void StoreBuffer::Mark(Address addr) {
Address* top = reinterpret_cast<Address*>(heap_->store_buffer_top());
*top++ = addr;
if ((reinterpret_cast<uintptr_t>(top) & kStoreBufferOverflowBit) != 0) {
ASSERT(top == limit_);
} else {
ASSERT(top < limit_);
when the code is running slow, there are runs of shift/push ops followed by runs of 5-6 calls to Compact() for every MoveElements. When it's running fast, MoveElements isn't called until a handful of times at the end, and just a single compaction when it finishes.
I'm guessing memory compaction might be thrashing, but it's not falling in place for me yet.
Edit: forget that last edit about output buffering artifacts, I was filtering duplicates.
this bug had been reported to google, who closed it without studying the issue.
When shifting out and calling tasks (functions) from a queue (array)
the GC(?) is stalling for an inordinate length of time.
114467 shifts is OK
114468 shifts is problematic, symptoms occur
the response:
he GC has nothing to do with this, and nothing is stalling either.
Array.shift() is an expensive operation, as it requires all array
elements to be moved. For most areas of the heap, V8 has implemented a
special trick to hide this cost: it simply bumps the pointer to the
beginning of the object by one, effectively cutting off the first
element. However, when an array is so large that it must be placed in
"large object space", this trick cannot be applied as object starts
must be aligned, so on every .shift() operation all elements must
actually be moved in memory.
I'm not sure there's a whole lot we can do about this. If you want a
"Queue" object in JavaScript with guaranteed O(1) complexity for
.enqueue() and .dequeue() operations, you may want to implement your
Edit: I just caught the subtle "all elements must be moved" part -- is RecordWrites not GC but an actual element copy then? The memmove of the array contents is 0.04 milliseconds. The RecordWrites loop is 96% of the 1.1 ms runtime.
Edit: if "aligned" means the first object must be at first address, that's what memmove does. What is RecordWrites?

lock-free bounded MPMC ringbuffer failure

I've been banging my head against (my attempt) at a lock-free multiple producer multiple consumer ring buffer. The basis of the idea is to use the innate overflow of unsigned char and unsigned short types, fix the element buffer to either of those types, and then you have a free loop back to beginning of the ring buffer.
The problem is - my solution doesn't work for multiple producers (it does though work for N consumers, and also single producer single consumer).
#include <atomic>
template<typename Element, typename Index = unsigned char> struct RingBuffer
std::atomic<Index> readIndex;
std::atomic<Index> writeIndex;
std::atomic<Index> scratchIndex;
Element elements[1 << (sizeof(Index) * 8)];
RingBuffer() :
bool push(const Element & element)
const Index currentReadIndex = readIndex.load();
Index currentWriteIndex = writeIndex.load();
const Index nextWriteIndex = currentWriteIndex + 1;
if(nextWriteIndex == currentReadIndex)
return false;
currentWriteIndex, nextWriteIndex))
elements[currentWriteIndex] = element;
writeIndex = nextWriteIndex;
return true;
bool pop(Element & element)
Index currentReadIndex = readIndex.load();
const Index currentWriteIndex = writeIndex.load();
const Index nextReadIndex = currentReadIndex + 1;
if(currentReadIndex == currentWriteIndex)
return false;
element = elements[currentReadIndex];
currentReadIndex, nextReadIndex))
return true;
The main idea for writing was to use a temporary index 'scratchIndex' that acts a pseudo-lock to allow only one producer at any one time to copy-construct into the elements buffer, before updating the writeIndex and allowing any other producer to make progress. Before I am called heathen for implying my approach is 'lock-free' I realise that this approach isn't exactly lock-free, but in practice (if it would work!) it is significantly faster than having a normal mutex!
I am aware of a (more complex) MPMC ringbuffer solution here, but I am really experimenting with my idea to then compare against that approach and find out where each excels (or indeed whether my approach just flat out fails!).
Things I have tried;
Using compare_exchange_weak
Using more precise std::memory_order's that match the behaviour I want
Adding cacheline pads between the various indices I have
Making elements std::atomic instead of just Element array
I am sure that this boils down to a fundamental segfault in my head as to how to use atomic accesses to get round using mutex's, and I would be entirely grateful to whoever can point out which neurons are drastically misfiring in my head! :)
This is a form of the A-B-A problem. A successful producer looks something like this:
load currentReadIndex
load currentWriteIndex
cmpxchg store scratchIndex = nextWriteIndex
store element
store writeIndex = nextWriteIndex
If a producer stalls for some reason between steps 2 and 3 for long enough, it is possible for the other producers to produce an entire queue's worth of data and wrap back around to the exact same index so that the compare-exchange in step 3 succeeds (because scratchIndex happens to be equal to currentWriteIndex again).
By itself, that isn't a problem. The stalled producer is perfectly within its rights to increment scratchIndex to lock the queueβ€”even if a magical ABA-detecting cmpxchg rejected the store, the producer would simply try again, reload exactly the same currentWriteIndex, and proceed normally.
The actual problem is the nextWriteIndex == currentReadIndex check between steps 2 and 3. The queue is logically empty if currentReadIndex == currentWriteIndex, so this check exists to make sure that no producer gets so far ahead that it overwrites elements that no consumer has popped yet. It appears to be safe to do this check once at the top, because all the consumers should be "trapped" between the observed currentReadIndex and the observed currentWriteIndex.
Except that another producer can come along and bump up the writeIndex, which frees the consumer from its trap. If a producer stalls between steps 2 and 3, when it wakes up the stored value of readIndex could be absolutely anything.
Here's an example, starting with an empty queue, that shows the problem happening:
Producer A runs steps 1 and 2. Both loaded indices are 0. The queue is empty.
Producer B interrupts and produces an element.
Consumer pops an element. Both indices are 1.
Producer B produces 255 more elements. The write index wraps around to 0, the read index is still 1.
Producer A awakens from its slumber. It had previously loaded both read and write indices as 0 (empty queue!), so it attempts step 3. Because the other producer coincidentally paused on index 0, the compare-exchange succeeds, and the store progresses. At completion the producer lets writeIndex = 1, and now both stored indices are 1, and the queue is logically empty. A full queue's worth of elements will now be completely ignored.
(I should mention that the only reason I can get away with talking about "stalling" and "waking up" is that all the atomics used are sequentially consistent, so I can pretend that we're in a single-threaded environment.)
Note that the way that you are using scratchIndex to guard concurrent writes is essentially a lock; whoever successfully completes the cmpxchg gets total write access to the queue until it releases the lock. The simplest way to fix this failure is to just replace scratchIndex with a spinlockβ€”it won't suffer from A-B-A and it's what's actually happening.
bool push(const Element & element)
const Index currentReadIndex = readIndex.load();
Index currentWriteIndex = writeIndex.load();
const Index nextWriteIndex = currentWriteIndex + 1;
if(nextWriteIndex == currentReadIndex)
return false;
currentWriteIndex, nextWriteIndex))
elements[currentWriteIndex] = element;
// Problem here!
writeIndex = nextWriteIndex;
return true;
I've marked the problematic spot. Multiple threads can get to the writeIndex = nextWriteIndex at the same time. The data will be written in any order, although each write will be atomic.
This is a problem because you're trying to update two values using the same atomic condition, which is generally not possible. Assuming the rest of your method is fine, one way around this would be to combine both scratchIndex and writeIndex into a single value of double-size. For example, treating two uint32_t values as a single uint64_t value and operating atomically on that.

Design pattern for asynchronous while loop

I have a function that boils down to:
config = generateConfigurationForTesting();
result = executeWork(config);
doWork = isDone(result);
How can I rewrite this for efficient asynchronous execution, assuming all functions are thread safe, independent of previous iterations, and probably require more iterations than the maximum number of allowable threads ?
The problem here is we don't know how many iterations are required in advance so we can't make a dispatch_group or use dispatch_apply.
This is my first attempt, but it looks a bit ugly to me because of arbitrarily chosen values and sleeping;
int thread_count = 0;
bool doWork = true;
int max_threads = 20; // arbitrarily chosen number
dispatch_queue_t queue =
dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
if(thread_count < max_threads)
dispatch_async(queue, ^{ Config myconfig = generateConfigurationForTesting();
Result myresult = executeWork();
dispatch_async(queue, checkResult(myresult)); });
usleep(100); // don't consume too much CPU
void checkResult(Result value)
if(value == good) doWork = false;
Based on your description, it looks like generateConfigurationForTesting is some kind of randomization technique or otherwise a generator which can make a near-infinite number of configuration (hence your comment that you don't know ahead of time how many iterations you will need). With that as an assumption, you are basically stuck with the model that you've created, since your executor needs to be limited by some reasonable assumptions about the queue and you don't want to over-generate, as that would just extend the length of the run after you have succeeded in finding value ==good measurements.
I would suggest you consider using a queue (or OSAtomicIncrement* and OSAtomicDecrement*) to protect access to thread_count and doWork. As it stands, the thread_count increment and decrement will happen in two different queues (main_queue for the main thread and the default queue for the background task) and thus could simultaneously increment and decrement the thread count. This could lead to an undercount (which would cause more threads to be created than you expect) or an overcount (which would cause you to never complete your task).
Another option to making this look a little nicer would be to have checkResult add new elements into the queue if value!=good. This way, you load up the initial elements of the queue using dispatch_apply( 20, queue, ^{ ... }) and you don't need the thread_count at all. The first 20 will be added using dispatch_apply (or an amount that dispatch_apply feels is appropriate for your configuration) and then each time checkResult is called you can either set doWork=false or add another operation to queue.
dispatch_apply() works for this, just pass ncpu as the number of iterations (apply never uses more than ncpu worker threads) and keep each instance of your worker block running for as long as there is more work to do (i.e. loop back to generateConfigurationForTesting() unless !doWork).

Seeking help with a MT design pattern

I have a queue of 1000 work items and a n-proc machine (assume n =
4).The main thread spawns n (=4) worker threads at a time ( 25 outer
iterations) and waits for all threads to complete before processing
the next n (=4) items until the entire queue is processed
for(i= 0 to queue.Length / numprocs)
for(j= 0 to numprocs)
The work done by each (worker) thread is not homogeneous.Therefore in
1 batch (of n) if thread 1 spends 1000 s doing work and rest of the 3
threads only 1 s , above design is inefficient,becaue after 1 sec
other 3 processors are idling. Besides there is no pooling - 1000
distinct threads are being created
How do I use the NT thread pool (I am not familiar enough- hence the
long winded question) and QueueUserWorkitem to achieve the above. The
following constraints should hold
The main thread requires that all worker items are processed before
it can proceed.So I would think that a waitall like construct above
is required
I want to create as many threads as processors (ie not 1000 threads
at a time)
Also I dont want to create 1000 distinct events, pass to the worker
thread, and wait on all events using the QueueUserWorkitem API or
Exisitng code is in C++.Prefer C++ because I dont know c#
I suspect that the above is a very common pattern and was looking for
input from you folks.
I'm not a C++ programmer, so I'll give you some half-way pseudo code for it
tcount = 0
maxproc = 4
while queue_item = queue.get_next() # depends on implementation of queue
# may well be:
# for i=0; i<queue.length; i++
while tcount == maxproc
wait 0.1 seconds # or some other interval that isn't as cpu intensive
# as continously running the loop
tcount += 1 # must be atomic (reading the value and writing the new
# one must happen consecutively without interruption from
# other threads). I think ++tcount would handle that in cpp.
new thread(worker, queue_item)
function worker(item)
# stuff with item here...
tcount -= 1 # must be atomic
