Seeking help with a MT design pattern - multithreading

I have a queue of 1000 work items and a n-proc machine (assume n =
4).The main thread spawns n (=4) worker threads at a time ( 25 outer
iterations) and waits for all threads to complete before processing
the next n (=4) items until the entire queue is processed
for(i= 0 to queue.Length / numprocs)
for(j= 0 to numprocs)
{
CreateThread(WorkerThread,WorkItem)
}
WaitForMultipleObjects(threadHandle[])
The work done by each (worker) thread is not homogeneous.Therefore in
1 batch (of n) if thread 1 spends 1000 s doing work and rest of the 3
threads only 1 s , above design is inefficient,becaue after 1 sec
other 3 processors are idling. Besides there is no pooling - 1000
distinct threads are being created
How do I use the NT thread pool (I am not familiar enough- hence the
long winded question) and QueueUserWorkitem to achieve the above. The
following constraints should hold
The main thread requires that all worker items are processed before
it can proceed.So I would think that a waitall like construct above
is required
I want to create as many threads as processors (ie not 1000 threads
at a time)
Also I dont want to create 1000 distinct events, pass to the worker
thread, and wait on all events using the QueueUserWorkitem API or
otherwise
Exisitng code is in C++.Prefer C++ because I dont know c#
I suspect that the above is a very common pattern and was looking for
input from you folks.

I'm not a C++ programmer, so I'll give you some half-way pseudo code for it
tcount = 0
maxproc = 4
while queue_item = queue.get_next() # depends on implementation of queue
# may well be:
# for i=0; i<queue.length; i++
while tcount == maxproc
wait 0.1 seconds # or some other interval that isn't as cpu intensive
# as continously running the loop
tcount += 1 # must be atomic (reading the value and writing the new
# one must happen consecutively without interruption from
# other threads). I think ++tcount would handle that in cpp.
new thread(worker, queue_item)
function worker(item)
# ...do stuff with item here...
tcount -= 1 # must be atomic

Related

Computing c𝑖 = √(a𝑖 × b𝑖) in parallel using nested parallelism

Let's say we have two vectors A=(ai) and B=(bi), each of size n and we have to compute a new vector C=(ci) as 𝑐𝑖 = √(𝑎𝑖 × 𝑏𝑖) for(i=1,...,n)
Main question: What would be the best way to compute the ci in parallel (using nested parallelism, i.e. using sync and spawn).
I think the below understanding is correct about the computation
for (i = 1 to n) {
C[i] = Math.sqrt(A[i] * B[i]);
}
And is there any way to use parallel for loops to compute C in parallel ?
If so, I think the approach will be the following:
parallel for (i = 1 to n) {
C[i] = Math.sqrt(A[i] * B[i]);
}
Is it correct ?
Assuming that by best you mean fastest, the usual approach would be to divide A and B into chunks, spawn a separate thread for handling each of these chunks in parallel, and wait for all the threads to finish their tasks.
The optimal number of chunks for such computation, most likely, will be the number of CPU cores you have on your computer. So, the pseudocode would look like:
chunkSize = ceiling(n / numberOfCPUs)
for (t = 1 to numberOfCPUs) {
startIndex = (t - 1) * chunkSize + 1
size = min(chunkSize, C.size - startIndex + 1)
threads.add(Thread.spawn(startIndex, size))
}
threads.join()
Where each thread, provided with the startIndex and size, computes:
for (i = startIndex to startIndex + size) {
C[i] = Math.sqrt(A[i] * B[i])
}
Another approach would be to have a pool of threads and give those threads a single shared queue of indices 1, 2, ... n. Each thread on each iteration polls the top index (let it be i) and calculates C[i]. As soon as the queue is empty, the work is done. The problem here is that you need additional synchronization mechanism that would guarantee that every index is processed by exactly one thread. For some simple tasks (like yours) such mechanism might consume more resources than actual calculation, but for relatively long-running tasks it works pretty well.
There's a mutual approach when you break the initial set of tasks into chunks, provide each thread in the pool with its own chunk, but when a thread is done with its chunk, it starts 'stealing' tasks from other threads in order not to sit idle. On many real tasks it gives better results than either of previous approaches.

Perl Queue and Threads abnormal exit

I am quite new to Perl, especially Perl Threads.
I want to accomplish:
Have 5 threads that will en-queue data(Random numbers) into a
Thread::queue
Have 3 threads that will de-queue data from the
Thread::queue.
The complete code that I wrote in order to achieve above mission:
#!/usr/bin/perl -w
use strict;
use threads;
use Thread::Queue;
my $queue = new Thread::Queue();
our #Enquing_threads;
our #Dequeuing_threads;
sub buildQueue
{
my $TotalEntry=1000;
while($TotalEntry-- >0)
{
my $query = rand(10000);
$queue->enqueue($query);
print "Enque thread with TID " .threads->tid . " got $query,";
print "Queue Size: " . $queue->pending . "\n";
}
}
sub process_Queue
{
my $query;
while ($query = $queue->dequeue)
{
print "Dequeu thread with TID " .threads->tid . " got $query\n";
}
}
push #Enquing_threads,threads->create(\&buildQueue) for 1..5;
push #Dequeuing_threads,threads->create(\&process_Queue) for 1..3;
Issues that I am Facing:
The threads are not running as concurrently as expected.
The entire program abnormally exit with following console output:
Perl exited with active threads:
8 running and unjoined
0 finished and unjoined
0 running and detached
Enque thread with TID 5 got 6646.13585023883,Queue Size: 595
Enque thread with TID 1 got 3573.84104215917,Queue Size: 595
Any help on code-optimization is appreciated.
This behaviour is to be expected: When the main thread exits, all other threads exit as well. If you don't care, you can $thread->detach them. Otherwise, you have to manually $thread->join them, which we'll do.
The $thread->join waits for the thread to complete, and fetches the return value (threads can return values just like subroutines, although the context (list/void/scalar) has to be fixed at spawn time).
We will detach the threads that enqueue data:
threads->create(\&buildQueue)->detach for 1..5;
Now for the dequeueing threads, we put them into a lexical variable (why are you using globals?), so that we can dequeue them later:
my #dequeue_threads = map threads->create(\&process_queue), 1 .. 3;
Then wait for them to complete:
$_->join for #dequeue_threads;
We know that the detached threads will finish execution before the programm exits, because the only way for the dequeueing threads to exit is to exhaust the queue.
Except for one and a half bugs. You see, there is a difference between an empty queue and a finished queue. If the queue is just empty, the dequeueing threads will block on $queue->dequeue until they get some input. The traditional solution is to dequeue while the value they get is defined. We can break the loop by supplying as many undef values in the queue as there are threads reading from the queue. More modern version of Thread::Queue have an end method, that makes dequeue return undef for all subsequent calls.
The problem is when to end the queue. We should to this after all enqueueing threads have exited. Which means, we should wait for them manually. Sigh.
my #enqueueing = map threads->create(\&enqueue), 1..5;
my #dequeueing = map threads->create(\&dequeue), 1..3;
$_->join for #enqueueing;
$queue->enqueue(undef) for 1..3;
$_->join for #dequeueing;
And in sub dequeuing: while(defined( my $item = $queue->dequeue )) { ... }.
Using the defined test fixes another bug: rand can return zero, although this is quite unlikely and will slip through most tests. The contract of rand is that it returns a pseudo-random floating point number between including zero and excluding some upper bound: A number from the interval [0, x). The bound defaults to 1.
If you don't want to join the enqueueing threads manually, you could use a semaphore to signal completition. A semaphore is a multithreading primitive that can be incremented and decremented, but not below zero. If a decrement operation would let the drop count below zero, the call blocks until another thread raises the count. If the start count is 1, this can be used as a flag to block resources.
We can also start with a negative value 1 - $NUM_THREADS, and have each thread increment the value, so that only when all threads have exited, it can be decremented again.
use threads; # make a habit of importing `threads` as the first thing
use strict; use warnings;
use feature 'say';
use Thread::Queue;
use Thread::Semaphore;
use constant {
NUM_ENQUEUE_THREADS => 5, # it's good to fix the thread counts early
NUM_DEQUEUE_THREADS => 3,
};
sub enqueue {
my ($out_queue, $finished_semaphore) = #_;
my $tid = threads->tid;
# iterate over ranges instead of using the while($maxval --> 0) idiom
for (1 .. 1000) {
$out_queue->enqueue(my $val = rand 10_000);
say "Thread $tid enqueued $val";
}
$finished_semaphore->up;
# try a non-blocking decrement. Returns true only for the last thread exiting.
if ($finished_semaphore->down_nb) {
$out_queue->end; # for sufficiently modern versions of Thread::Queue
# $out_queue->enqueue(undef) for 1 .. NUM_DEQUEUE_THREADS;
}
}
sub dequeue {
my ($in_queue) = #_;
my $tid = threads->tid;
while(defined( my $item = $in_queue->dequeue )) {
say "thread $tid dequeued $item";
}
}
# create the queue and the semaphore
my $queue = Thread::Queue->new;
my $enqueuers_ended_semaphore = Thread::Semaphore->new(1 - NUM_ENQUEUE_THREADS);
# kick off the enqueueing threads -- they handle themself
threads->create(\&enqueue, $queue, $enqueuers_ended_semaphore)->detach for 1..NUM_ENQUEUE_THREADS;
# start and join the dequeuing threads
my #dequeuers = map threads->create(\&dequeue, $queue), 1 .. NUM_DEQUEUE_THREADS;
$_->join for #dequeuers;
Don't be suprised if the threads do not seem to run in parallel, but sequentially: This task (enqueuing a random number) is very fast, and is not well suited for multithreading (enqueueing is more expensive than creating a random number).
Here is a sample run where each enqueuer only creates two values:
Thread 1 enqueued 6.39390993005694
Thread 1 enqueued 0.337993319585337
Thread 2 enqueued 4.34504733960242
Thread 2 enqueued 2.89158054485114
Thread 3 enqueued 9.4947585773571
Thread 3 enqueued 3.17079715055542
Thread 4 enqueued 8.86408863197179
Thread 5 enqueued 5.13654995317669
Thread 5 enqueued 4.2210886147538
Thread 4 enqueued 6.94064174636395
thread 6 dequeued 6.39390993005694
thread 6 dequeued 0.337993319585337
thread 6 dequeued 4.34504733960242
thread 6 dequeued 2.89158054485114
thread 6 dequeued 9.4947585773571
thread 6 dequeued 3.17079715055542
thread 6 dequeued 8.86408863197179
thread 6 dequeued 5.13654995317669
thread 6 dequeued 4.2210886147538
thread 6 dequeued 6.94064174636395
You can see that 5 managed to enqueue a few things before 4. The threads 7 and 8 don't get to dequeue anything, 6 is too fast. Also, all enqueuers are finished before the dequeuers are spawned (for such a small number of inputs).

Design pattern for asynchronous while loop

I have a function that boils down to:
while(doWork)
{
config = generateConfigurationForTesting();
result = executeWork(config);
doWork = isDone(result);
}
How can I rewrite this for efficient asynchronous execution, assuming all functions are thread safe, independent of previous iterations, and probably require more iterations than the maximum number of allowable threads ?
The problem here is we don't know how many iterations are required in advance so we can't make a dispatch_group or use dispatch_apply.
This is my first attempt, but it looks a bit ugly to me because of arbitrarily chosen values and sleeping;
int thread_count = 0;
bool doWork = true;
int max_threads = 20; // arbitrarily chosen number
dispatch_queue_t queue =
dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
while(doWork)
{
if(thread_count < max_threads)
{
dispatch_async(queue, ^{ Config myconfig = generateConfigurationForTesting();
Result myresult = executeWork();
dispatch_async(queue, checkResult(myresult)); });
thread_count++;
}
else
usleep(100); // don't consume too much CPU
}
void checkResult(Result value)
{
if(value == good) doWork = false;
thread_count--;
}
Based on your description, it looks like generateConfigurationForTesting is some kind of randomization technique or otherwise a generator which can make a near-infinite number of configuration (hence your comment that you don't know ahead of time how many iterations you will need). With that as an assumption, you are basically stuck with the model that you've created, since your executor needs to be limited by some reasonable assumptions about the queue and you don't want to over-generate, as that would just extend the length of the run after you have succeeded in finding value ==good measurements.
I would suggest you consider using a queue (or OSAtomicIncrement* and OSAtomicDecrement*) to protect access to thread_count and doWork. As it stands, the thread_count increment and decrement will happen in two different queues (main_queue for the main thread and the default queue for the background task) and thus could simultaneously increment and decrement the thread count. This could lead to an undercount (which would cause more threads to be created than you expect) or an overcount (which would cause you to never complete your task).
Another option to making this look a little nicer would be to have checkResult add new elements into the queue if value!=good. This way, you load up the initial elements of the queue using dispatch_apply( 20, queue, ^{ ... }) and you don't need the thread_count at all. The first 20 will be added using dispatch_apply (or an amount that dispatch_apply feels is appropriate for your configuration) and then each time checkResult is called you can either set doWork=false or add another operation to queue.
dispatch_apply() works for this, just pass ncpu as the number of iterations (apply never uses more than ncpu worker threads) and keep each instance of your worker block running for as long as there is more work to do (i.e. loop back to generateConfigurationForTesting() unless !doWork).

Providing Concurrency Between Pthreads

I am working on multithread programming and I am stuck on something.
In my program there are two tasks and two types of robots for carrying out the tasks:
Task 1 requires any two types of robot and
task 2 requires 2 robot1 type and 2 robot2 type.
Total number of robot1 and robot2 and pointers to these two types are given for initialization. Threads share these robots and robots are reserved until a thread is done with them.
Actual task is done in doTask1(robot **) function which takes pointer to a robot pointer as parameter so I need to pass the robots that I reserved. I want to provide concurrency. Obviously if I lock everything it will not be concurrent. robot1 is type of Robot **. Since It is used by all threads before one thread calls doTask or finish it other can overwrite robot1 so it changes things. I know it is because robot1 is shared by all threads. Could you explain how can I solve this problem? I don't want to pass any arguments to thread start routine.
rsc is my struct to hold number of robots and pointers that are given in an initialization function.
void *task1(void *arg)
{
int tid;
tid = *((int *) arg);
cout << "TASK 1 with thread id " << tid << endl;
pthread_mutex_lock (&mutexUpdateRob);
while (rsc->totalResources < 2)
{
pthread_cond_wait(&noResource, &mutexUpdateRob);
}
if (rsc->numOfRobotA > 0 && rsc->numOfRobotB > 0)
{
rsc->numOfRobotA --;
rsc->numOfRobotB--;
robot1[0] = &rsc->robotA[counterA];
robot1[1] = &rsc->robotB[counterB];
counterA ++;
counterB ++;
flag1 = true;
rsc->totalResources -= 2;
}
pthread_mutex_unlock (&mutexUpdateRob);
doTask1(robot1);
pthread_mutex_lock (&mutexUpdateRob);
if(flag1)
{
rsc->numOfRobotA ++;
rsc->numOfRobotB++;
rsc->totalResources += 2;
}
if (totalResource >= 2)
{
pthread_cond_signal(&noResource);
}
pthread_mutex_unlock (&mutexUpdateRob);
pthread_exit(NULL);
}
If robots are global resources, threads should not dispose of them. It should be the duty of the main thread exit (or cleanup) function.
Also, there sould be a way for threads to locate unambiguously the robots, and to lock their use.
The robot1 array seems to store the robots, and it seems to be a global array. However:
its access is not protected by a mutex (pthread_mutex_t), it seems now that you've taken care of that.
Also, the code in task1 is always modifying entries 0 and 1 of this array. If two threads or more execute that code, the entries will be overwritten. I don't think that it is what you want. How will that array be used afterwards?
In fact, why does this array need to be global?
The bottom line is this: as long as this array is shared by threads, they will have problems working concurrently. Think about it this way:
You have two companies using robots to work, but they're using the same truck (robot1) to move the robots around. How are these two companies supposed to function properly, and efficiently with only one truck?

Limit number of threads in Groovy

How can I limit number of threads that are being executed at the same time?
Here is sample of my algorithm:
for(i = 0; i < 100000; i++) {
Thread.start {
// Do some work
}
}
I would like to make sure that once number of threads in my application hits 100, algorithm will pause/wait until number of threads in the app goes below 100.
Currently "some work" takes some time to do and I end up with few thousands of threads in my app. Eventually it runs out of threads and "some work" crashes. I would like to fix it by limiting number of pools that it can use at one time.
Please let me know how to solve my issue.
I believe you are looking for a ThreadPoolExecutor in the Java Concurrency API. The idea here is that you can define a maximum number of threads in a pool and then instead of starting new Threads with a Runnable, just let the ThreadPoolExecutor take care of managing the upper limit for Threads.
Start here: http://docs.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html
import java.util.concurrent.*;
import java.util.*;
def queue = new ArrayBlockingQueue<Runnable>( 50000 )
def tPool = new ThreadPoolExecutor(5, 500, 20, TimeUnit.SECONDS, queue);
for(i = 0; i < 5000; i++) {
tPool.execute {
println "Blah"
}
}
Parameters for the ThreadBlockingQueue constructor: corePoolSize (5), this is the # of threads to create and to maintain if the system is idle, maxPoolSize (500) max number of threads to create, 3rd and 4th argument states that the pool should keep idle threads around for at least 20 seconds, and the queue argument is a blocking queue that stores queued tasks.
What you'll want to play around with is the queue sizes and also how to handle rejected tasks. If you need to execute 100k tasks, you'll either have to have a queue that can hold 100k tasks, or you'll have to have a strategy for handling a rejected tasks.

Resources