Enforcing progress thread when overlapping communication and computation - multithreading

This question is originating from here: Overlapping communication and computation taking 2.1 times as much time.
I've implemented Cannon's algorithm which performs distributed memory tensor-matrix multiplication. During this, I thought it would be clever to hide communication latencies by overlapping computation and communication.
Now I started micro-benchmarking the components i.e. the communication, the computation and the overlapped communication and computation, and something funny has come out of it. The overlapped operation is taking 2.1 times as long as the longest time taking operation of the two. Only sending a single message took 521639 us, only computing on the (same sized) data took 340435 us but the act of overlapping them took 1111500 us.
After numerous test-runs involving independent data buffers and valuable inputs in the form of comments, I have come to the conclusion that the problem is being caused by MPI's weird concept of progress.
The following is the desired behaviour:
only the thread identified by COMM_THREAD handles the communication and
all the other threads perform the computation.
If the above behaviour can be forced, in the above example, I expect to see the overlapped operation take ~521639 us.
Information:
The MPI implementation is by Intel as part of OneAPI v2021.6.0.
A single compute node has 2 sockets of Intel Xeon Platinum 8168 (2x 24 = 48 cores).
SMT is not being made use of i.e. each thread is pinned to a single core.
Data is being initialized before each experiment by being mapped to the corresponding memory nodes as required by the computation.
Benchmarking runs are preceded by 10 warm-up runs.
In the given example, the tensor is sized N=600 i.e. has 600^3 data-points. However, the same behaviour was observed for smaller sizes as well.
What I've tried:
Just making asynchronous calls in the overlap:
// ...
#define COMM_THREAD 0
// ...
#pragma omp parallel
{
if (omp_get_thread_num() == COMM_THREAD)
{
// perform the comms.
auto requests = std::array<MPI_Request, 4>{};
const auto r1 = MPI_Irecv(tens_recv_buffer_->data(), 2 * tens_recv_buffer_->size(), MPI_DOUBLE, src_proc_id_tens_,
2, MPI_COMM_WORLD, &requests[0]);
const auto s1 = MPI_Isend(tens_send_buffer_->data(), 2 * tens_send_buffer_->size(), MPI_DOUBLE,
target_proc_id_tens_, 2, MPI_COMM_WORLD, &requests[1]);
const auto r2 = MPI_Irecv(mat_recv_buffer_->data(), 2 * mat_recv_buffer_->size(), MPI_DOUBLE, src_proc_id_mat_,
3, MPI_COMM_WORLD, &requests[2]);
const auto s2 = MPI_Isend(mat_send_buffer_->data(), 2 * mat_send_buffer_->size(), MPI_DOUBLE,
target_proc_id_mat_, 3, MPI_COMM_WORLD, &requests[3]);
if (MPI_SUCCESS != s1 || MPI_SUCCESS != r1 || MPI_SUCCESS != s2 || MPI_SUCCESS != r2)
{
throw std::runtime_error("tensor_matrix_mult_mpi_sendrecv_error");
}
if (MPI_SUCCESS != MPI_Waitall(requests.size(), requests.data(), MPI_STATUSES_IGNORE))
{
throw std::runtime_error("tensor_matrix_mult_mpi_waitall_error");
}
}
else
{
const auto work_indices = schedule_thread_work(tens_recv_buffer_->get_n1(), 1);
shared_mem::tensor_matrix_mult(*tens_send_buffer_, *mat_send_buffer_, *result_, work_indices);
}
}
Trying manual progression:
if (omp_get_thread_num() == COMM_THREAD)
{
// perform the comms.
auto requests = std::array<MPI_Request, 4>{};
const auto r1 = MPI_Irecv(tens_recv_buffer_->data(), 2 * tens_recv_buffer_->size(), MPI_DOUBLE, src_proc_id_tens_,
2, MPI_COMM_WORLD, &requests[0]);
const auto s1 = MPI_Isend(tens_send_buffer_->data(), 2 * tens_send_buffer_->size(), MPI_DOUBLE,
target_proc_id_tens_, 2, MPI_COMM_WORLD, &requests[1]);
const auto r2 = MPI_Irecv(mat_recv_buffer_->data(), 2 * mat_recv_buffer_->size(), MPI_DOUBLE, src_proc_id_mat_,
3, MPI_COMM_WORLD, &requests[2]);
const auto s2 = MPI_Isend(mat_send_buffer_->data(), 2 * mat_send_buffer_->size(), MPI_DOUBLE,
target_proc_id_mat_, 3, MPI_COMM_WORLD, &requests[3]);
if (MPI_SUCCESS != s1 || MPI_SUCCESS != r1 || MPI_SUCCESS != s2 || MPI_SUCCESS != r2)
{
throw std::runtime_error("tensor_matrix_mult_mpi_sendrecv_error");
}
// custom wait-all to ensure COMM_THREAD makes progress happen
auto comm_done = std::array<int, 4>{0, 0, 0, 0};
auto all_comm_done = false;
while(!all_comm_done)
{
auto open_comms = 0;
for (auto request_index = std::size_t{}; request_index < requests.size(); ++request_index)
{
if (comm_done[request_index])
{
continue;
}
MPI_Test(&requests[request_index], &comm_done[request_index], MPI_STATUS_IGNORE);
++open_comms;
}
all_comm_done = open_comms == 0;
}
}
else
{
const auto work_indices = schedule_thread_work(tens_recv_buffer_->get_n1(), 1);
shared_mem::tensor_matrix_mult(*tens_send_buffer_, *mat_send_buffer_, *result_, work_indices);
}
Using the environment variables mentioned here: https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/environment-variables-for-async-progress-control.html in my job-script:
export I_MPI_ASYNC_PROGRESS=1 I_MPI_ASYNC_PROGRESS_THREADS=1 I_MPI_ASYNC_PROGRESS_PIN="0"
and then running the code in variant 1.
All of the above attempts have resulted in the same undesirable behaviour.
Question: How can I force only COMM_THREAD to participate in MPI progression?
Any thoughts, suggestions, speculations and ideas will be greatly appreciated. Thanks in advance.
Note: although the buffers tens_send_buffer_ and mat_send_buffer_ are accessed concurrently during the overlap, this is read-only access.

Related

Linux kernel function set_user_nice

I have an homework assignment, we need to add some features to Linux kernel, and we're working on red hat 2.4.18.
I looked in sched.c, function set_user_nice:
void set_user_nice(task_t *p, long nice)
{
unsigned long flags;
prio_array_t *array;
runqueue_t *rq;
if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
return;
/*
* We have to be careful, if called from sys_setpriority(),
* the task might be in the middle of scheduling on another CPU.
*/
rq = task_rq_lock(p, &flags);
if (rt_task(p)) {
p->static_prio = NICE_TO_PRIO(nice);
goto out_unlock;
}
array = p->array;
if (array)
dequeue_task(p, array);
p->static_prio = NICE_TO_PRIO(nice);
p->prio = NICE_TO_PRIO(nice);
if (array) {
enqueue_task(p, array);
/*
* If the task is running and lowered its priority,
* or increased its priority then reschedule its CPU:
*/
if ((NICE_TO_PRIO(nice) < p->static_prio) || (p == rq->curr))
resched_task(rq->curr);
}
out_unlock:
task_rq_unlock(rq, &flags);
}
I don't understand what exactly the code checks in the last if statement,
because few lines above it, we have this line:
p->static_prio = NICE_TO_PRIO(nice);
and then, in the if statement we check:
(NICE_TO_PRIO(nice) < p->static_prio)
Am I missing something?
Thanks
OK, so I looked for this function in a newer source code, and this function implemented in kernel/sched/core.c.
The part I was talking about:
old_prio = p->prio;
3585 p->prio = effective_prio(p);
3586 delta = p->prio - old_prio;
3587
3588 if (queued) {
3589 enqueue_task(rq, p, ENQUEUE_RESTORE);
3590 /*
3591 * If the task increased its priority or is running and
3592 * lowered its priority, then reschedule its CPU:
3593 */
3594 if (delta < 0 || (delta > 0 && task_running(rq, p)))
3595 resched_curr(rq);
3596 }
3597 out_unlock:
So it does seem like now the diff between the old and the new priority calculated properly.
Is 2.4.18 the kernel version? I'm looking at that source and don't see set_user_nice in sched.c.
Anyways, I think that they're handling a race condition there. It's possible between the time they set the new process priority, the process itself has changed it. So they're checking if that's the case and re-scheduling the task if so.

Shared counter using combining tree deadlock issue

I am working on a shared counter increment application using combining tree concept. My goal is to make this application work on 2^n number of cores such as 4, 8, 16, 32, etc. This algorithm might err on any thread failure. The assumption is that there would be no thread failure or very slow threads.
Two threads compete at leaf nodes and the latter one arriving goes up the tree.
The first one that arrives waits until the second one goes up the hierarchy and comes down with the correct return value.
The second thread wakes the first thread up
Each thread gets the correct fetchAndAdd value
But this algorithm sometimes gets locked inside while (nodes[index].isActive == 1) or while(nodes[index].waiting == 1) loop. I don't see any possibility of a deadlock because only two threads are competing at each node. Could you guys enlighten me on this problem??
int increment(int threadId, int index, int value) {
int lastValue = __sync_fetch_and_add(&nodes[index].firstValue, value);
if (index == 0) return lastValue;
while (nodes[index].isActive == 1) {
}
if (lastValue == 0) {
while(nodes[index].waiting == 1) {
}
nodes[index].waiting = 1;
nodes[lindex].isActive = 0;
} else {
nodes[index].isActive = 1;
nodes[index].result = increment(threadId, (index - 1)/2, nodes[index].firstValue);
nodes[index].firstValue = 0;
nodes[index].waiting = 0;
}
return nodes[index].result + lastValue;
}
I don't think that will work on 1 core. You infinitely loop on isActive because you can't set isActive to 0 unless it is 0.
I'm not sure if you're code has a mechanism to stop this but, Here's my best crack at it Here are the threads that run and cause problems:
ex)
thread1 thread 2
nodes[10].isActive = 1
//next run on index 10
while (nodes[index].isActive == 1) {//here is the deadlock}
It's hard to understand exactly what's going on here/ what you're trying to do but I would recommend that somehow you need to be able to deactivate nodes[index].isActive. You may want to set it to 0 at the end of the function

Multithreading in Windows Forms

I wrote app, Caesar Cipher in Windows Forms CLI with dynamic linking libraries(in C++ and in ASM) with my alghorithms for model(eciphering and deciphering). That part of my app is working.
Here is also a multithreading from Windows Forms. User can chose number of threads(1-64). If he chose 2, message to encipher(decipher) will be divided on two substrings which will be divided on two threads. And I want to execute these threads paraller, and finally reduce cost of execution time.
When user push encipher or decipher button there will be displayed enciphered or deciphered text and time costs for execution functions in C++ and ASM. Actualy everything is alright, but times for greater threads than 1 aren't smaller, they are bigger.
There is some code:
/*Function which concats string for substrings to threads*/
array<String^>^ ThreadEncipherFuncCpp(int nThreads, string str2){
//Tablica wątków
array<String^>^ arrayOfThreads = gcnew array <String^>(nThreads);
//Przechowuje n-tą część wiadomosci do przetworzenia
string loopSubstring;
//Długość podstringa w wiadomości
int numberOfSubstring = str2.length() / nThreads;
int isModulo = str2.length() % nThreads;
array<Thread^>^ xThread = gcnew array < Thread^ >(nThreads);
for (int i = 0; i < nThreads; i++)
{
if (i == 0 && numberOfSubstring != 0)
loopSubstring = str2.substr(0, numberOfSubstring);
else if ((i == nThreads - 1) && numberOfSubstring != 0){
if (isModulo != 0)
loopSubstring = str2.substr(numberOfSubstring*i, numberOfSubstring + isModulo);
else
loopSubstring = str2.substr(numberOfSubstring*i, numberOfSubstring);
}
else if (numberOfSubstring == 0){
loopSubstring = str2.substr(0, isModulo);
i = nThreads - 1;
}
else
loopSubstring = str2.substr(numberOfSubstring*i, numberOfSubstring);
ThreadExample::inputString = gcnew String(loopSubstring.c_str());
xThread[i] = gcnew Thread(gcnew ThreadStart(&ThreadExample::ThreadEncipher));
xThread[i]->Start();
xThread[i]->Join();
arrayOfThreads[i] = ThreadExample::outputString;
}
return arrayOfThreads;
}}
Here is a fragment which is responsible for the calculation of the time for C++:
/*****************C++***************/
auto start = chrono::high_resolution_clock::now();
array<String^>^ arrayOfThreads = ThreadEncipherFuncCpp(nThreads, str2);
auto elapsed = chrono::high_resolution_clock::now() - start;
long long milliseconds = chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
double micro = milliseconds;
this->label4->Text = Convert::ToString(micro + " microseconds");
String^ str3;
String^ str4;
str4 = str3->Concat(arrayOfThreads);
this->textBox2->Text = str4;
/**********************************/
And example of working:
For input data: "Some example text. Some example text2."
Program will display: "Vrph hadpsoh whaw. Vrph hadpsoh whaw2."
Times of execution for 1 thread:
C++ time: 31231us.
Asm time: 31212us.
Times of execution for 2 threads:
C++ time: 62488us.
Asm time: 62505us.
Times of execution for 4 threads:
C++ time: 140254us.
Asm time: 124587us.
Times of execution for 32 threads:
C++ time: 1002548us.
Asm time: 1000020us.
How to solve this problem?
I need this structure of program, this is academic project.
My CPU has 4 cores.
The reason it's not going any faster is because you aren't letting your threads run in parallel.
xThread[i] = gcnew Thread(gcnew ThreadStart(&ThreadExample::ThreadEncipher));
xThread[i]->Start();
xThread[i]->Join();
These three lines create the thread, start it running, and then wait for it to finish. You're not getting any parallelism here, you're just adding the overhead of spawning & waiting for threads.
If you want to have a speedup from multithreading, the way to do it is to start all the threads at once, let them all run, and then collect up the results.
In this case, I'd make it so that ThreadEncipher (which you haven't shown us the source of, so I'm making assumptions) takes a parameter, which is used as an array index. Instead of having ThreadEncipher read from inputString and write to outputString, have it read from & write to one index of an array. That way, each thread can read & write at the same time. After you've spawned all these threads, then you can wait for all of them to finish, and you can either process the output array, or since array<String^>^ is already your return type, just return it as-is.
Other thoughts:
You've got a mix of unmanaged and managed objects here. It will be better if you pick one and stick with it. Since you're in C++/CLI, I'd recommend that you stick with the managed objects. I'd stop using std::string, and use System::String^ exclusively.
Since your CPU has 4 cores, you're not going to get any speedup by using more than 4 threads. Don't be surprised when 32 threads takes longer than 4, because you're doing 8x the string manipulation, and you've got 32 threads fighting over 4 processor cores.
Your string splitting code is more complex than it needs to be. You've got five different cases in there, I'd have to sit down and think about it for a while to be sure it's correct. Try this:
int totalLen = str2->length;
for (int i = 0; i < nThreads; i++)
{
int startIndex = totalLen * i / nThreads;
int endIndex = totalLen * (i+1) / nThreads;
int substrLen = endIndex - startIndex;
String^ substr = str2->SubString(startIndex, substrLen);
...
}

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
{
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
}
}
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
barrier(CLK_LOCAL_MEM_FENCE);
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
}
if (localId==0) {
output[get_group_id(0)] = target[0];
}
}
https://pastebin.com/xN4yQ28N
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
}
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
}
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

How can I make this prime finder operate in parallel

I know prime finding is well studied, and there are a lot of different implementations. My question is, using the provided method (code sample), how can I go about breaking up the work? The machine it will be running on has 4 quad core hyperthreaded processors and 16GB of ram. I realize that there are some improvements that could be made, particularly in the IsPrime method. I also know that problems will occur once the list has more than int.MaxValue items in it. I don't care about any of those improvements. The only thing I care about is how to break up the work.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Prime
{
class Program
{
static List<ulong> primes = new List<ulong>() { 2 };
static void Main(string[] args)
{
ulong reportValue = 10;
for (ulong possible = 3; possible <= ulong.MaxValue; possible += 2)
{
if (possible > reportValue)
{
Console.WriteLine(String.Format("\nThere are {0} primes less than {1}.", primes.Count, reportValue));
try
{
checked
{
reportValue *= 10;
}
}
catch (OverflowException)
{
reportValue = ulong.MaxValue;
}
}
if (IsPrime(possible))
{
primes.Add(possible);
Console.Write("\r" + possible);
}
}
Console.WriteLine(primes[primes.Count - 1]);
Console.ReadLine();
}
static bool IsPrime(ulong value)
{
foreach (ulong prime in primes)
{
if (value % prime == 0) return false;
if (prime * prime > value) break;
}
return true;
}
}
}
There are 2 basic schemes I see: 1) using all threads to test a single number, which is probably great for higher primes but I cannot really think of how to implement it, or 2) using each thread to test a single possible prime, which can cause a non-continuous string of primes to be found and run into unused resources problems when the next number to be tested is greater than the square of the highest prime found.
To me it feels like both of these situations are challenging only in the early stages of building the list of primes, but I'm not entirely sure. This is being done for a personal exercise in breaking this kind of work.
If you want, you can parallelize both operations: the checking of a prime, and the checking of multiple primes at once. Though I'm not sure this would help. To be honest I'd consider remove the threading in main().
I've tried to stay faithful to your algorithm, but to speed it up a lot I've used x*x instead of reportvalue; this is something you could easily revert if you wish.
To further improve on my core splitting you could determine an algorithm to figure out the number of computations required to perform the divisions based on the size of the numbers and split the list that way. (aka smaller numbers take less time to divide by so make the first partitions larger)
Also my concept of threadpool may not exist the way I want to use it
Here's my go at it(pseudo-ish-code):
List<int> primes = {2};
List<int> nextPrimes = {};
int cores = 4;
main()
{
for (int x = 3; x < MAX; x=x*x){
int localmax = x*x;
for(int y = x; y < localmax; y+=2){
thread{primecheck(y);}
}
"wait for all threads to be executed"
primes.add(nextPrimes);
nextPrimes = {};
}
}
void primecheck(int y)
{
bool primality;
threadpool? pool;
for(int x = 0; x < cores; x++){
pool.add(thread{
if (!smallcheck(x*primes.length/cores,(x+1)*primes.length/cores ,y)){
primality = false;
pool.kill();
}
});
}
"wait for all threads to be executed or killed"
if (primality)
nextPrimes.add(y);
}
bool smallcheck(int a, int b, int y){
foreach (int div in primes[a to b])
if (y%div == 0)
return false;
return true;
}
E: I added what I think pooling should look like, look at revision if you want to see it without.
Use the sieve of Eratosthenes instead. It's not worthwhile to parallelize unless you use a good algorithm in the first place.
Separate the space to sieve into large regions and sieve each in its own thread. Or better use some workqueue concept for large regions.
Use a bit array to represent the prime numbers, it takes less space than representing them explicitly.
See also this answer for a good implementation of a sieve (in Java, no split into regions).

Resources