Most efficient way to spawn n pthreads with the same parameters in C - multithreading

I have 32 threads that I know the input parameters to ahead of time, nothing changes inside the function (other than the memory buffer that each thread interacts with).
In pseudo C code this is my design pattern:
// declare 32 pthreads as global variables
void dispatch_32_threads() {
for(int i=0; i < 32; i++) {
pthread_create( &thread_id[i], NULL, thread_function, (void*) thread_params[i] );
}
// wait until all 32 threads are finished
for(int j=0; j < 32; j++) {
pthread_join( thread_id[j], NULL);
}
}
int main (crap) {
//init 32 pthreads here
for(int n = 0; n<4000; n++) {
for(int x = 0; x<100< x++) {
for(int y = 0; y<100< y++) {
dispatch_32_threads();
//modify buffers here
}
}
}
}
I am calling dispatch_32_threads 100*100*4000= 40000000 times. thread_function and (void*) thread_params[i] do not change. I think pthread_create keeps creating and destroying threads, I have 32 cores, none of them are at 100% utilization, it hovers around 12%. Moreover, when I reduce the number of threads to 10, all 32 cores remain at 5-7% utilization, and I see no slow down in runtime. Running less than 10 slow things down.
Running 1 thread however is extremely slow, so multi threading is helping. I profiled my code, I know it's thread_func that is slow, and thread_func is parallelizable. This leads me to believe that pthread_create keeps spawning and destroying threads on different cores, and after 10 threads I lose efficiency, and it gets slower, thread_func is in essence "less complicated" than spawning more than 10 threads.
Is this assessment true? What is the best way to utilize 100% of all cores?

Thread creation is expensive. It depends on different parameters, but is rarely below 1000 cycles. And thread synchronisation and destruction is similar. If the amount of work in your thread_function is not very high it will largely dominate the computation time.
It is rarely a good idea to create threads in the inner loops. Probably, the best is to create threads to process iterations of the outer loop. Depending on your program and on what does the thread_function there may be dependencies between iterations and this may require some rewriting, but a solution could be:
int outer=4000;
int nthreads=32;
int perthread=outer/nthreads;
// add an integer with thread_id to thread_param struct
void thread_func(whatisrequired *thread_params){
// runs perthread iteration of the loop beginning at start
int start = thread_param->thread_id;
for(int n = start; n<start+perthread; n++) {
for(int x = 0; x<100< x++) {
for(int y = 0; y<100< y++) {
//do the work
}
}
}
}
int main(){
for(int i=0; i < 32; i++) {
thread_params[i]->thread_id=i;
pthread_create( &thread_id[i], NULL, thread_func,
(void*) thread_params[i]);
}
// wait until all 32 threads are finished
for(int j=0; j < 32; j++) {
pthread_join( thread_id[j], NULL);
}
}
With this kind of parallelization, you can consider using openmp. The parallel for clause will make you easily experiment with the best parallelization scheme.
If there are dependencies and such an obvious parallelization is not possible, you can create threads at program start and give them work by managing a thread pool. Managing queues is less expensive than thread creation (but atomic accesses do have a cost).
Edit: Alternatively, you can
1. put all you loops in the thread function
2. at the start (or the end) of the inner loop add a barrier to synchronize your threads. This will ensure that all threads have finished their job.
3. In the main create all the threads and wait for completion.
Barriers are less expensive than thread creation and the result will be identical.

Related

barrier code and waiting for all thread to reach rendezvous and then enter critical section

semaphore mutex = 1;
semaphore barrier = 0;
int count = 0;
void barrier-done() {
wait(mutex);
count++;
if (count < N ) {
post(mutex);
wait(barrier);
}
else {
post(mutex);
count = 0;
for (int i = 1; i < N; i++) {
post(barrier);
}
}
}
does anyone know the problem with this code? I'm trying to implement a code for barrier.
Assuming N is the number of threads you are expecting to wait for the barrier.
For Example N=10, then the threads 1 to 9 will have if condition true and they will wait for barrier.
The 10th Thread calling this will have that condition false because (10 !< 10).
So it will go ahead and post barrier 9 times.
I am not sure of the exact situation you want to achieve. But, this is what I understood from your code. May be you might need to tweak the if condition a bit.
I had the same issue but the problem is that you can't use minus sign in the name of function "barrier-done" after fixing this bug the code will be correct.

How to run CPU and GPU function simultaneously using threads?

I have two functions which I want to run using threads.
1) CPU function, which I can join to thread using:
thread t1(vector_add, p->iNum1, p->iNum2, p->iNumAns, p->flag);
t1.join();
2) and a GPU kernel
vectorAdd_gpu <<<blocksPerGrid, threadsPerBlock >>>(s.a1, s.a2, s.a2, s.flag);
But my problem is how to call GPU kernal call using threads and join it so that it can run simultaneously with CPU function.
vectorAdd_gpu <<<blocksPerGrid, threadsPerBlock >>>(s.a1, s.a2, s.a2, s.flag);
thread t2(vectorAdd_gpu);
t2.join();
Any other way to run a CPU and a GPU function simultanliously using threads?
As talonmies said,
Put its call into a lambda function
auto myFunc = [&](){
cudaStream_t stream2;
cudaSetDevice(device2);
cudaStreamCreate (&stream2);
vectorAdd_gpu <<<blocksPerGrid, threadsPerBlock,0,stream2 >>>(s.a1, s.a2, s.a2, s.flag);
cudaStreamSynchronize(stream2);
cudaStreamDestroy(stream2);
};
then give it to thread.
thread t2(myFunc);
t2.join();
But instead of this, you can still use same main thread of your application with streams asynchronously on CPU work. I just showed what you wanted to see. Using same thread asynchronously could be more efficient than re-creating streams and re-joining threads, depending on size of work. Maybe re-joining has more overhead than synchronizing and launching kernel here. How many kernel calls do you make per second?
In the following blog from Nvidia, https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/ there is a nice example about single-thread asynchronous CUDA:
for (int i = 0; i < nStreams; ++i) {
int offset = i * streamSize;
cudaMemcpyAsync(&d_a[offset], &a[offset],
streamBytes, cudaMemcpyHostToDevice, cudaMemcpyHostToDevice, stream[i]);
}
for (int i = 0; i < nStreams; ++i) {
int offset = i * streamSize;
kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset);
}
for (int i = 0; i < nStreams; ++i) {
int offset = i * streamSize;
cudaMemcpyAsync(&a[offset], &d_a[offset],
streamBytes, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToHost, stream[i]);
}
this is only one of different ways to do asynchronous stream overlapping.

How to join all threads before deleting the ThreadPool

I am using a MultiThreading class which creates the required number of threads in its own threadpool and deletes itself after use.
std::thread *m_pool; //number of threads according to available cores
std::mutex m_locker;
std::condition_variable m_condition;
std::atomic<bool> m_exit;
int m_processors
m_pool = new std::thread[m_processors + 1]
void func()
{
//code
}
for (int i = 0; i < m_processors; i++)
{
m_pool[i] = std::thread(func);
}
void reset(void)
{
{
std::lock_guard<std::mutex> lock(m_locker);
m_exit = true;
}
m_condition.notify_all();
for(int i = 0; i <= m_processors; i++)
m_pool[i].join();
delete[] m_pool;
}
After running through all tasks, the for-loop is supposed to join all running threads before delete[] is being executed.
But there seems to be one last thread still running, while the m_pool does not exist anymore.
This leads to the problem, that I can't close my program anymore.
Is there any way to check if all threads are joined or wait for all threads to be joined before deleting the threadpool?
Simple typo bug I think.
Your loop that has the condition i <= m_processors is a bug and will actually process one extra entry past the end of the array. This is an off-by-one bug. Suppose m_processors is 2. You'll have an array that contains 2 elements with indices [0] and [1]. Yet, you'll be reading past the end of the array, attempting to join with the item at index [2]. m_pool[2] is undefined memory and you're likely going to either crash or block forever there.
You likely intended i < m_processors.
The real source of the problem is addressed by Wick's answer. I will extend it with some tips that also solve your problem while improving other aspects of your code.
If you use C++11 for std::thread, then you shouldn't create your thread handles using operator new[]. There are better ways of doing that with other C++ constructs, which will make everything simpler and exception safe (you don't leak memory if an unexpected exception is thrown).
Store your thread objects in a std::vector. It will manage the memory allocation and deallocation for you (no more new and delete). You can use other more flexible containers such as std::list if you insert/delete threads dynamically.
Fill the vector in place with std::generate or similar
std::vector<std::thread> m_pool;
m_pool.reserve(n_processors);
// Fill the vector
std::generate_n( std::back_inserter(m_pool), m_processors,
[](){ return std::thread(func); } );
Join all the elements using range-for loop and delete handles using container's functions.
for( std::thread& t: m_pool ) {
t.join();
}
m_pool.clear();

ways to express concurrency without thread

I am wondering about how concurrency can be expressed without an explicit thread object, not the implementation, which probably would use threads or thread pools, but the language design related issues.
Q1: I wonder what would be lost if there was no thread object, what couldn't be done in such a language?
Q2: I also wonder about how this would be expressed, what ways were proposed or implemented as alternatives or complements to threads?
one possibility is the MPI-programm-model (GPU as well)
lets say you have the following code
for(int i=0; i < 100; i++) {
work(i);
}
the "normal" thread-based way would be the separation of the iteration-range into multiple subsets. So something like this
Thread-1:
for(int i=0; i < 50; i++) {
work(i);
}
Thread-2:
for(int i=50; i < 100; i++) {
work(i);
}
however in MPI/GPU you do something different.
the idea is, that every core execute the same(GPU) or at least
a similar (MPI) programm. the difference is, that each core uses
a different ID, which changes the behavior of the code.
mpi-style: (not exactly the MPI-syntax)
int rank = get_core_id();
int size = get_num_core();
int subset = 100 / size;
for (int i = rank * subset;i < (rand+1)*subset; i+) {
//each core will use a different range for i
work(i);
}
the next big thing is communication. Normally you need to use all of the synchronization-stuff manually. MPI is message-based, meaning that its not perfectly suited for classical shared-memory modells (every core has access to the same memory), but in a cluster system (many cores combined with a network) it works excellent. This is not only limited to supercomputers (they use basically only mpi-style stuff), but in the recent years a new type of core-architecture (manycores) was developed. They have a local so called Network-On-Chip, so each core can send/receive messages without having the problem with synchronization.
MPI contains not only simple messages, but higher constructs to automatically scatter and gather data to every core.
Example: (again not MPI-syntax)
int rank = get_core_id();
int size = get_num_core();
int data[100];
int result;
int results[size];
if (rank == 0) { //master-core only
fill_with_stuff(data);
}
scatter(0, data); //core-0 will send the data-content to all other cores
result = work(rank, data); // every core works on the same data
gather(0,result,results); //get all local results and store them in
//the results-array of core-0
an other solutions is the openMP-libary
here you declare parallel-blocks. the whole thread-part is done by the libary itself
example:
//this will split the for-loop automatically in 4 threads
#pragma omp parallel for num_threads(4)
for(int i=0; i < 100; i++) {
work(i);
}
the big advantage is, that its fast to write. thats it
you may get better performance with writing the threads on your own,
but it takes a lot more time and knowledge about synchronization

What's the correct way of waiting for detached threads to finish?

Look at this sample code:
void OutputElement(int e, int delay)
{
this_thread::sleep_for(chrono::milliseconds(100 * delay));
cout << e << '\n';
}
void SleepSort(int v[], uint n)
{
for (uint i = 0 ; i < n ; ++i)
{
thread t(OutputElement, v[i], v[i]);
t.detach();
}
}
It starts n new threads and each one sleeps for some time before outputting a value and finishing. What's the correct/best/recommended way of waiting for all threads to finish in this case? I know how to work around this but I want to know what's the recommended multithreading tool/design that I should use in this situation (e.g. condition_variable, mutex etc...)?
And now for the slightly dissenting answer. And I do mean slightly because I mostly agree with the other answer and the comments that say "don't detach, instead join."
First imagine that there is no join(). And that you have to communicate among your threads with a mutex and condition_variable. This really isn't that hard nor complicated. And it allows an arbitrarily rich communication, which can be anything you want, as long as it is only communicated while the mutex is locked.
Now a very common idiom for such communication would simply be a state that says "I'm done". Child threads would set it, and the parent thread would wait on the condition_variable until the child said "I'm done." This idiom would in fact be so common as to deserve a convenience function that encapsulated the mutex, condition_variable and state.
join() is precisely this convenience function.
But imho one has to be careful. When one says: "Never detach, always join," that could be interpreted as: Never make your thread communication more complicated than "I'm done."
For a more complex interaction between parent thread and child thread, consider the case where a parent thread launches several child threads to go out and independently search for the solution to a problem. When the problem is first found by any thread, that gets communicated to the parent, and the parent can then take that solution, and tell all the other threads that they don't need to search any more.
For example:
#include <chrono>
#include <iostream>
#include <iterator>
#include <random>
#include <thread>
#include <vector>
void OneSearch(int id, std::shared_ptr<std::mutex> mut,
std::shared_ptr<std::condition_variable> cv,
int& state, int& solution)
{
std::random_device seed;
// std::mt19937_64 eng{seed()};
std::mt19937_64 eng{static_cast<unsigned>(id)};
std::uniform_int_distribution<> dist(0, 100000000);
int test = 0;
while (true)
{
for (int i = 0; i < 100000000; ++i)
{
++test;
if (dist(eng) == 999)
{
std::unique_lock<std::mutex> lk(*mut);
if (state == -1)
{
state = id;
solution = test;
cv->notify_one();
}
return;
}
}
std::unique_lock<std::mutex> lk(*mut);
if (state != -1)
return;
}
}
auto findSolution(int n)
{
std::vector<std::thread> threads;
auto mut = std::make_shared<std::mutex>();
auto cv = std::make_shared<std::condition_variable>();
int state = -1;
int solution = -1;
std::unique_lock<std::mutex> lk(*mut);
for (uint i = 0 ; i < n ; ++i)
threads.push_back(std::thread(OneSearch, i, mut, cv,
std::ref(state), std::ref(solution)));
while (state == -1)
cv->wait(lk);
lk.unlock();
for (auto& t : threads)
t.join();
return std::make_pair(state, solution);
}
int
main()
{
auto p = findSolution(5);
std::cout << '{' << p.first << ", " << p.second << "}\n";
}
Above I've created a "dummy problem" where a thread searches for how many times it needs to query a URNG until it comes up with the number 999. The parent thread puts 5 child threads to work on it. The child threads work for awhile, and then every once in a while, look up and see if any other thread has found the solution yet. If so, they quit, else they keep working. The main thread waits until solution is found, and then joins with all the child threads.
For me, using the bash time facility, this outputs:
$ time a.out
{3, 30235588}
real 0m4.884s
user 0m16.792s
sys 0m0.017s
But what if instead of joining with all the threads, it detached those threads that had not yet found a solution. This might look like:
for (unsigned i = 0; i < n; ++i)
{
if (i == state)
threads[i].join();
else
threads[i].detach();
}
(in place of the t.join() loop from above). For me this now runs in 1.8 seconds, instead of the 4.9 seconds above. I.e. the child threads are not checking with each other that often, and so main just detaches the working threads and lets the OS bring them down. This is safe for this example because the child threads own everything they are touching. Nothing gets destructed out from under them.
One final iteration can be realized by noticing that even the thread that finds the solution doesn't need to be joined with. All of the threads could be detached. The code is actually much simpler:
auto findSolution(int n)
{
auto mut = std::make_shared<std::mutex>();
auto cv = std::make_shared<std::condition_variable>();
int state = -1;
int solution = -1;
std::unique_lock<std::mutex> lk(*mut);
for (uint i = 0 ; i < n ; ++i)
std::thread(OneSearch, i, mut, cv,
std::ref(state), std::ref(solution)).detach();
while (state == -1)
cv->wait(lk);
return std::make_pair(state, solution);
}
And the performance remains at about 1.8 seconds.
There is still (sort of) an effective join with the solution-finding thread here. But it is accomplished with the condition_variable::wait instead of with join.
thread::join() is a convenience function for the very common idiom that your parent/child thread communication protocol is simply "I'm done." Prefer thread::join() in this common case as it is easier to read, and easier to write.
However don't unnecessarily constrain yourself to such a simple parent/child communication protocol. And don't be afraid to build your own richer protocol when the task at hand needs it. And in this case, thread::detach() will often make more sense. thread::detach() doesn't necessarily imply a fire-and-forget thread. It can simply mean that your communication protocol is more complex than "I'm done."
Don't detach, but instead join:
std::vector<std::thread> ts;
for (unsigned int i = 0; i != n; ++i)
ts.emplace_back(OutputElement, v[i], v[i]);
for (auto & t : threads)
t.join();

Resources