How to run CPU and GPU function simultaneously using threads? - multithreading

I have two functions which I want to run using threads.
1) CPU function, which I can join to thread using:
thread t1(vector_add, p->iNum1, p->iNum2, p->iNumAns, p->flag);
t1.join();
2) and a GPU kernel
vectorAdd_gpu <<<blocksPerGrid, threadsPerBlock >>>(s.a1, s.a2, s.a2, s.flag);
But my problem is how to call GPU kernal call using threads and join it so that it can run simultaneously with CPU function.
vectorAdd_gpu <<<blocksPerGrid, threadsPerBlock >>>(s.a1, s.a2, s.a2, s.flag);
thread t2(vectorAdd_gpu);
t2.join();
Any other way to run a CPU and a GPU function simultanliously using threads?

As talonmies said,
Put its call into a lambda function
auto myFunc = [&](){
cudaStream_t stream2;
cudaSetDevice(device2);
cudaStreamCreate (&stream2);
vectorAdd_gpu <<<blocksPerGrid, threadsPerBlock,0,stream2 >>>(s.a1, s.a2, s.a2, s.flag);
cudaStreamSynchronize(stream2);
cudaStreamDestroy(stream2);
};
then give it to thread.
thread t2(myFunc);
t2.join();
But instead of this, you can still use same main thread of your application with streams asynchronously on CPU work. I just showed what you wanted to see. Using same thread asynchronously could be more efficient than re-creating streams and re-joining threads, depending on size of work. Maybe re-joining has more overhead than synchronizing and launching kernel here. How many kernel calls do you make per second?
In the following blog from Nvidia, https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/ there is a nice example about single-thread asynchronous CUDA:
for (int i = 0; i < nStreams; ++i) {
int offset = i * streamSize;
cudaMemcpyAsync(&d_a[offset], &a[offset],
streamBytes, cudaMemcpyHostToDevice, cudaMemcpyHostToDevice, stream[i]);
}
for (int i = 0; i < nStreams; ++i) {
int offset = i * streamSize;
kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset);
}
for (int i = 0; i < nStreams; ++i) {
int offset = i * streamSize;
cudaMemcpyAsync(&a[offset], &d_a[offset],
streamBytes, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToHost, stream[i]);
}
this is only one of different ways to do asynchronous stream overlapping.

Related

Most efficient way to spawn n pthreads with the same parameters in C

I have 32 threads that I know the input parameters to ahead of time, nothing changes inside the function (other than the memory buffer that each thread interacts with).
In pseudo C code this is my design pattern:
// declare 32 pthreads as global variables
void dispatch_32_threads() {
for(int i=0; i < 32; i++) {
pthread_create( &thread_id[i], NULL, thread_function, (void*) thread_params[i] );
}
// wait until all 32 threads are finished
for(int j=0; j < 32; j++) {
pthread_join( thread_id[j], NULL);
}
}
int main (crap) {
//init 32 pthreads here
for(int n = 0; n<4000; n++) {
for(int x = 0; x<100< x++) {
for(int y = 0; y<100< y++) {
dispatch_32_threads();
//modify buffers here
}
}
}
}
I am calling dispatch_32_threads 100*100*4000= 40000000 times. thread_function and (void*) thread_params[i] do not change. I think pthread_create keeps creating and destroying threads, I have 32 cores, none of them are at 100% utilization, it hovers around 12%. Moreover, when I reduce the number of threads to 10, all 32 cores remain at 5-7% utilization, and I see no slow down in runtime. Running less than 10 slow things down.
Running 1 thread however is extremely slow, so multi threading is helping. I profiled my code, I know it's thread_func that is slow, and thread_func is parallelizable. This leads me to believe that pthread_create keeps spawning and destroying threads on different cores, and after 10 threads I lose efficiency, and it gets slower, thread_func is in essence "less complicated" than spawning more than 10 threads.
Is this assessment true? What is the best way to utilize 100% of all cores?
Thread creation is expensive. It depends on different parameters, but is rarely below 1000 cycles. And thread synchronisation and destruction is similar. If the amount of work in your thread_function is not very high it will largely dominate the computation time.
It is rarely a good idea to create threads in the inner loops. Probably, the best is to create threads to process iterations of the outer loop. Depending on your program and on what does the thread_function there may be dependencies between iterations and this may require some rewriting, but a solution could be:
int outer=4000;
int nthreads=32;
int perthread=outer/nthreads;
// add an integer with thread_id to thread_param struct
void thread_func(whatisrequired *thread_params){
// runs perthread iteration of the loop beginning at start
int start = thread_param->thread_id;
for(int n = start; n<start+perthread; n++) {
for(int x = 0; x<100< x++) {
for(int y = 0; y<100< y++) {
//do the work
}
}
}
}
int main(){
for(int i=0; i < 32; i++) {
thread_params[i]->thread_id=i;
pthread_create( &thread_id[i], NULL, thread_func,
(void*) thread_params[i]);
}
// wait until all 32 threads are finished
for(int j=0; j < 32; j++) {
pthread_join( thread_id[j], NULL);
}
}
With this kind of parallelization, you can consider using openmp. The parallel for clause will make you easily experiment with the best parallelization scheme.
If there are dependencies and such an obvious parallelization is not possible, you can create threads at program start and give them work by managing a thread pool. Managing queues is less expensive than thread creation (but atomic accesses do have a cost).
Edit: Alternatively, you can
1. put all you loops in the thread function
2. at the start (or the end) of the inner loop add a barrier to synchronize your threads. This will ensure that all threads have finished their job.
3. In the main create all the threads and wait for completion.
Barriers are less expensive than thread creation and the result will be identical.

How to join all threads before deleting the ThreadPool

I am using a MultiThreading class which creates the required number of threads in its own threadpool and deletes itself after use.
std::thread *m_pool; //number of threads according to available cores
std::mutex m_locker;
std::condition_variable m_condition;
std::atomic<bool> m_exit;
int m_processors
m_pool = new std::thread[m_processors + 1]
void func()
{
//code
}
for (int i = 0; i < m_processors; i++)
{
m_pool[i] = std::thread(func);
}
void reset(void)
{
{
std::lock_guard<std::mutex> lock(m_locker);
m_exit = true;
}
m_condition.notify_all();
for(int i = 0; i <= m_processors; i++)
m_pool[i].join();
delete[] m_pool;
}
After running through all tasks, the for-loop is supposed to join all running threads before delete[] is being executed.
But there seems to be one last thread still running, while the m_pool does not exist anymore.
This leads to the problem, that I can't close my program anymore.
Is there any way to check if all threads are joined or wait for all threads to be joined before deleting the threadpool?
Simple typo bug I think.
Your loop that has the condition i <= m_processors is a bug and will actually process one extra entry past the end of the array. This is an off-by-one bug. Suppose m_processors is 2. You'll have an array that contains 2 elements with indices [0] and [1]. Yet, you'll be reading past the end of the array, attempting to join with the item at index [2]. m_pool[2] is undefined memory and you're likely going to either crash or block forever there.
You likely intended i < m_processors.
The real source of the problem is addressed by Wick's answer. I will extend it with some tips that also solve your problem while improving other aspects of your code.
If you use C++11 for std::thread, then you shouldn't create your thread handles using operator new[]. There are better ways of doing that with other C++ constructs, which will make everything simpler and exception safe (you don't leak memory if an unexpected exception is thrown).
Store your thread objects in a std::vector. It will manage the memory allocation and deallocation for you (no more new and delete). You can use other more flexible containers such as std::list if you insert/delete threads dynamically.
Fill the vector in place with std::generate or similar
std::vector<std::thread> m_pool;
m_pool.reserve(n_processors);
// Fill the vector
std::generate_n( std::back_inserter(m_pool), m_processors,
[](){ return std::thread(func); } );
Join all the elements using range-for loop and delete handles using container's functions.
for( std::thread& t: m_pool ) {
t.join();
}
m_pool.clear();

OpenMP Multithreads becoming one thread

I am programming using OpenMP to get to learn about multithreads. Is it possible for any thread, which is any thread of 11 in this case, to reach the return statement at the end while some threads may be still working on something in the for loop? Or do they become one master thread again after line 13?
int np, iam;
#pragma omp parallel private(np, iam) num_threads(11)
{
np = omp_get_num_threads();
iam = omp_get_thread_num();
#pragma omp for
for (int i = 2; i < 100; i++) {
std::cout << i;
doStuff(i);
}
}
} // line 13
// synchronize necessary?
return 0;
There is an implicit barrier ar the end of the parallel construct, so no synchronization is necessary. Any further code is executed only by the master thread (the one that had thread_num == 0 within the parallel region), and only after all threads have reached the end of the parallel region.

ways to express concurrency without thread

I am wondering about how concurrency can be expressed without an explicit thread object, not the implementation, which probably would use threads or thread pools, but the language design related issues.
Q1: I wonder what would be lost if there was no thread object, what couldn't be done in such a language?
Q2: I also wonder about how this would be expressed, what ways were proposed or implemented as alternatives or complements to threads?
one possibility is the MPI-programm-model (GPU as well)
lets say you have the following code
for(int i=0; i < 100; i++) {
work(i);
}
the "normal" thread-based way would be the separation of the iteration-range into multiple subsets. So something like this
Thread-1:
for(int i=0; i < 50; i++) {
work(i);
}
Thread-2:
for(int i=50; i < 100; i++) {
work(i);
}
however in MPI/GPU you do something different.
the idea is, that every core execute the same(GPU) or at least
a similar (MPI) programm. the difference is, that each core uses
a different ID, which changes the behavior of the code.
mpi-style: (not exactly the MPI-syntax)
int rank = get_core_id();
int size = get_num_core();
int subset = 100 / size;
for (int i = rank * subset;i < (rand+1)*subset; i+) {
//each core will use a different range for i
work(i);
}
the next big thing is communication. Normally you need to use all of the synchronization-stuff manually. MPI is message-based, meaning that its not perfectly suited for classical shared-memory modells (every core has access to the same memory), but in a cluster system (many cores combined with a network) it works excellent. This is not only limited to supercomputers (they use basically only mpi-style stuff), but in the recent years a new type of core-architecture (manycores) was developed. They have a local so called Network-On-Chip, so each core can send/receive messages without having the problem with synchronization.
MPI contains not only simple messages, but higher constructs to automatically scatter and gather data to every core.
Example: (again not MPI-syntax)
int rank = get_core_id();
int size = get_num_core();
int data[100];
int result;
int results[size];
if (rank == 0) { //master-core only
fill_with_stuff(data);
}
scatter(0, data); //core-0 will send the data-content to all other cores
result = work(rank, data); // every core works on the same data
gather(0,result,results); //get all local results and store them in
//the results-array of core-0
an other solutions is the openMP-libary
here you declare parallel-blocks. the whole thread-part is done by the libary itself
example:
//this will split the for-loop automatically in 4 threads
#pragma omp parallel for num_threads(4)
for(int i=0; i < 100; i++) {
work(i);
}
the big advantage is, that its fast to write. thats it
you may get better performance with writing the threads on your own,
but it takes a lot more time and knowledge about synchronization

Non-repeatable affinity for pthreads

I am trying to measure the time it takes for a thread from creation to actually start.
Using POSIX thread on a Debian 6.0 machine with 32-cores (no hyper-threading) and calling pthread_attr_setaffinity_np function to set the affinity.
In a loop, I am creating the threads, waiting for them to finish, repeatedly.
So, my code looks like the following (thread 0 is running this).
for (ni=0; ni<n; ni++)
{
pthread_t *thrds;
pthread_attr_t attr;
cpu_set_t cpuset;
ths = 1; // thread starts from 1
thrds = malloc(sizeof(pthread_t)*nt); // thrds[0] not used
assert(!pthread_attr_init(&attr));
for (i=ths; i<nt; i++)
{
pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset); // setting i as the affinity for thread i
assert(!pthread_attr_setaffinity_np(&attr,
sizeof(cpu_set_t), &cpuset));
assert(!pthread_create(thrds+i, &attr, DoWork, i));
}
pthread_attr_destroy(&attr);
DoWork(0);
for (i=ths; i<nt; i++)
{
pthread_join(thrds[i], NULL);
}
if (thrds) free(thrds);
}
Inside the thread function, I am calling sched_getcpu() to verify that the affinity is working. The problem is, this verification only passes the first iteration of i-loop. For the second iteration, thrd[1] gets the affinity of nt-1 (instead of 1) and so on.
Can anyone please explain why? And/or how to fix it?
NOTE: I found a workaround that if I put the master thread to sleep for 1 second after the join finishes at each iteration, the affinity works correctly. But this sleep duration could different on other machines. So still need a real fix for the issue.

Resources