Synchronizing all threads in a grid in Metal

Synchronizing all threads in a grid in Metal - multithreading

I am trying to write a norm or a squared length function for an n-sized vector in Metal. To do this, I planned on having every thread square each element, then elect one thread to sum all elements.
Here is my current kernel:
#include <metal_stdlib>
#include <metal_compute>
using namespace metal;
kernel void length_squared(const device float *x [[ buffer(0) ]],
device float *s [[ buffer(1) ]],
device float *out [[ buffer(2) ]],
uint gid [[ thread_position_in_grid ]],
uint numElements [[ threads_per_grid ]])
{
s[gid] = x[gid];// * x[gid];
simdgroup_barrier(mem_flags::mem_none);
if(gid == 0){
for(uint i = 0; i < numElements; i++){
*out += s[i];
}
}
}
Unfortunately, this code does not compile, for "Use of Undeclared Identifier simdgroup_barrier". The method is documented in the Metal Shading Language Specification.
Has anyone encountered this? or know how to synchronize all threads across a grid? threadgroup_barrier does not achieve total synchronization for me.
Am I approaching this problem incorrectly? What is the best way to synchronize this operation?

A SIMD group is smaller than a threadgroup, so that synchronization won't work.
Instead, you'll want to use a parallel reduction to sum up the values in parallel. Here is some Metal code I found.
Though, if you don't mind a single thread doing all the summing, you can run a separate kernel with just one thread to do the sum. Of course, this can be very slow.

Related

Where can PTHRED_MUTEX_ADAPTIVE_NP be specified and how does it work?

I found that there's a macro called PTHRED_MUTEX_ADAPTIVE_NP which is somehow given as a value to a mutex so that the mutex does an adaptive spinning, meaning that it spins in the magnitude of an immediate wakeup through the kernel would last. But how do I utilize this configuration-macro to a thread ?
And as I've developed an improved shared readers-writer lock (it needs only one atomic operation at best in contrast to the three operations given in the Wikipedia-solution) with relative writer-priority (further readers are stalled when there's a writer and the readers before are allowed to proceed) which could also make use of adaptive spinning: how is the number of spinning-cycles calculated ?

I found that there's a macro called PTHRED_MUTEX_ADAPTIVE_NP
Some pthreads implementations provide a macro PTHREAD_MUTEX_ADAPTIVE_NP (note spelling) that is one of the possible values of the kind_np mutex attribute, but neither that attribute nor the macro are standard. It looks like at least BSD and AIX have them, or at least did at one time, but this is not something you should be using in new code.
But how do I utilize this configuration-macro to a thread ?
You don't. Even if you are using a pthreads implementation that supports it, this is the value of a mutex attribute, not a thread attribute. You obtain a mutex with that attribute value by explicitly requesting it when you initialize the mutex. It would look something like this:
pthread_mutexattr_t attr;
pthread_mutex_t mutex;
int rval;
// Return-value checks omitted for brevity and clarity
rval = pthread_mutexattr_init(&attr);
rval = pthread_mutexattr_setkind_np(&attr, PTHREAD_MUTEX_ADAPTIVE_NP);
rval = pthread_mutex_init(&mutex, &attr);
There are other mutex attributes that you can set in analogous ways, which is one of the reasons I wrote this answer. Although you should not be using the kind_np attribute, you can follow this general model for other mutex attributes. There are also thread attributes, which work similarly.

I found the code in the glibc:
That's the "adaptive" mutex locking code of pthread_mutex_lock
in the glibc 2.31:
else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
== PTHREAD_MUTEX_ADAPTIVE_NP, 1))
{
if (! __is_smp)
goto simple;
if (LLL_MUTEX_TRYLOCK (mutex) != 0)
{
int cnt = 0;
int max_cnt = MIN (max_adaptive_count (),
mutex->__data.__spins * 2 + 10);
do
{
if (cnt++ >= max_cnt)
{
LLL_MUTEX_LOCK (mutex);
break;
}
atomic_spin_nop ();
}
while (LLL_MUTEX_TRYLOCK (mutex) != 0);
mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
}
assert (mutex->__data.__owner == 0);
}
So the spin count is doubled up to a maximum plus 10 first (system configurable or 1000 if thre's no configuration) and after the locking the difference between the actual spins and the predefined spins divided by 8 is added to the next spin-count.

Was: How does BPF calculate number of CPU for PERCPU_ARRAY?

I have encountered an interesting issue where a PERCPU_ARRAY created on one system with 2 processors creates an array with 2 per-CPU elements and on another system with 2 processors, an array with 128 per-CPU elements. The latter was rather unexpected to me!
The way I discovered this behavior is that a program that allocated an array for the number of CPUs (using get_nprocs_conf(3)) and then read in the PERCPU_ARRAY into it (using bpf_map_lookup_elem()) ended up writing past the end of the array and crashing.
I would like to find out what is the proper way to determine in a program that reads BPF maps the number of elements in a PERCPU_ARRAY used on a system.
Failing that, I think the second best approach is to pick a buffer for reading in that is "large enough." Here, the problem is similar: what is that number and is there way to learn it at runtime?

The question comes from reading the source of bpftool, which figures this out:
unsigned int get_possible_cpus(void)
{
int cpus = libbpf_num_possible_cpus();
if (cpus < 0) {
p_err("Can't get # of possible cpus: %s", strerror(-cpus));
exit(-1);
}
return cpus;
}
int libbpf_num_possible_cpus(void)
{
static const char *fcpu = "/sys/devices/system/cpu/possible";
static int cpus;
int err, n, i, tmp_cpus;
bool *mask;
/* ---8<--- snip */
}
So that's how they do it!

"threadgroup_barrier" makes no difference

Currently I'm working with Metal compute shaders and trying to understand how GPU threads synchronization works there.
I wrote a simple code but it doesn't work the way I expect it:
Consider I have threadgroup variable, which is array where all threads can produce an output simultaneously.
kernel void compute_features(device float output [[ buffer(0) ]],
ushort2 group_pos [[ threadgroup_position_in_grid ]],
ushort2 thread_pos [[ thread_position_in_threadgroup]],
ushort tid [[ thread_index_in_threadgroup ]])
{
threadgroup short blockIndices[288];
float someValue = 0.0
// doing some work here which fills someValue...
blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x] = someValue;
//wait when all threads are done with calculations
threadgroup_barrier(mem_flags::mem_none);
output += blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x]; // filling out output variable with threads calculations
}
The code above doesn't work. Output variable doesn't contain all threads calculations, it contains only the value from the thread which was presumable the last at adding up a value to output. To me it seems like threadgroup_barrier does absolutely nothing.
Now, the interesting part. The code below works:
blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x] = someValue;
threadgroup_barrier(mem_flags::mem_none); //wait when all threads are done with calculations
if (tid == 0) {
for (int i = 0; i < 288; i ++) {
output += blockIndices[i]; // filling out output variable with threads calculations
}
}
And this code also works as good as the previous one:
blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x] = someValue;
if (tid == 0) {
for (int i = 0; i < 288; i ++) {
output += blockIndices[i]; // filling out output variable with threads calculations
}
}
To summarize: My code works as expected only when I'm handling threadgroup memory in one GPU thread, no matter what's the id of it, it can be the last thread in the threadgroup as well as the first one. And presense of threadgroup_barrier makes absolutely no difference. I also used threadgroup_barrier with mem_threadgroup flag, code still doesn't work.
I understand that I might be missing some very important detail and I would be happy if someone can point me out to my errors. Thanks in advance!

When you write output += blockIndices[...], all threads will try to perform this operation at the same time. But since output is not an atomic variable, this results in race conditions. It's not a threadsafe operation.
Your second solution is the correct one. You need to have just a single thread to collect the results (although you could split this up across multiple threads too). That it still works OK if you remove the barrier may just be due to luck.

Parallel processing a prime finder with openMP

I am trying to construct a prime finder for a bit of C practice. I've got the algorithm down and I've done a bunch of optimisations to make it faster, I then decided to try to parallelize it because, hey why not! Turns out to be harder than I thought. I can either get all threads running the same process (with same args) or a single thread will run if I try and supply different args to each process. I really have no idea what I'm doing here but you can see some experimental values I'm using in this code:
// gcc -std=c99 -o multithread multithread.c -fopenmp -lm
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
int pf(unsigned int start, unsigned int limit, unsigned int q);
int main(int argc, char *argv[])
{
printf("prime finder\n");
int j, slimits[4] = {1,10000000,20000000,30000000}, elimits[4] = {10000000,20000000,30000000,40000000};
double startTime = omp_get_wtime();
#pragma omp parallel shared(slimits, elimits primes)
{
#pragma omp for
for (j = 0; j < 4; j++)
{
primes += pf(slimits[j], elimits[j], atoi(argv[2]));
}
}
printf("%d prime numbers found in %.2f seconds.\n\n", primes, omp_get_wtime() - startTime);
return 0;
}
I havn't included the pf function as it is quite large but it works on its own, it returns the number of primes found. Im sure the issue is here somewhere.
Any help would be greatly appreciated!

You have made at least one obvious (to me) and serious mistake. You've declared primes shared and allowed all the threads in the program to update it. You have, thereby, programmed a data race. Nothing in OpenMP (nor in C if I recall correctly) guarantees that += will be implemented atomically. You haven't actually specified what the problem with your program is, or what the problems are, but this must surely be one of them.
I'll tell you how to fix this later but I think there is a more serious underlying design problem you should address first. You seem to have decided that you would have 4 threads running and that you should divide the range of integers to test for primality into 4 and pass one chunk to each thread. Sure, you can make that work but it's not a smart approach to using OpenMP. Nor is it a smart approach to dividing the work of primality testing.
A smarter approach to OpenMP program design is to start off by making no assumptions about the number of threads that will be available to the executing program. Design for any number of threads, do not design a program whose behaviour depends on the number of threads it gets at run-time. Use OpenMP's facilities, specifically the schedule clause, to distribute the workload at run time.
Turning to primality testing. Draw, or at least think about, a scatter plot of points (i,t(i)), where i is an integer and t(i) is the time it takes to determine whether or not i is prime. The pattern in this plot is about as difficult to discern as the pattern in the plot of the occurrence of primes in the integers. In other words, the time to determine the primality of an integer is very unpredictable. It does tend to rise as the integers increase (well, excluding large even integers which I'm sure your test doesn't consider anyway).
One implication of this unpredictability is that if you divide a range of integers into N sub-ranges and give one sub-range to each of N threads you are not giving the threads the same amount of work to do. Indeed, in the range of integers 1..m (any m) there is one integer which takes much longer to test than any other integer in the range, and this time is the irreducible minimum that your program will take. A naive distribution of the range will produce a seriously unbalanced workload.
Here's what I think you should do to fix your program.
First, write a function which tests the primality of a single integer. This will be the basic task for your computation. Call this is_prime. Next, study the schedule clause for the parallel for construct. OpenMP provides a number of task scheduling options, I won't explain them here, you will find plenty of good documentation online. Finally, study also the reduction clause; this provides the solution to the data race you have programmed.
Applying all this I suggest you change
#pragma omp parallel shared(slimits, elimits primes)
{
#pragma omp for
for (j = 0; j < 4; j++)
{
primes += pf(slimits[j], elimits[j], atoi(argv[2]));
}
}
to
#pragma omp parallel shared(slimits, elimits, max_int_to_test)
{
#pragma omp for reduction(+:primes) schedule (dynamic, 10)
for (j = 3; j < max_int_to_test; j += 2)
{
primes += is_prime(j);
}
}
With any luck my rudimentary C hasn't screwed up the syntax too much.

Native mutex implementation

So in my ilumination days, i started to think about how the hell do windows/linux implement the mutex, i've implemented this synchronizer in 100... different ways, in many diferent arquitectures but never think how it is really implemented in big ass OS, for example in the ARM world i made some of my synchronizers disabling the interrupts but i always though that it wasn't a really good way to do it.
I tried to "swim" throgh the linux kernel but just like a though i can't see nothing that satisfies my curiosity. I'm not an expert in threading, but i have solid all the basic and intermediate concepts of it.
So does anyone know how a mutex is implemented?

A quick look at code apparently from one Linux distribution seems to indicate that it is implemented using an interlocked compare and exchange. So, in some sense, the OS isn't really implementing it since the interlocked operation is probably handled at the hardware level.
Edit As Hans points out, the interlocked exchange does the compare and exchange in an atomic manner. Here is documentation for the Windows version. For fun, I just now wrote a small test to show a really simple example of creating a mutex like that. This is a simple acquire and release test.
#include <windows.h>
#include <assert.h>
#include <stdio.h>
struct homebrew {
LONG *mutex;
int *shared;
int mine;
};
#define NUM_THREADS 10
#define NUM_ACQUIRES 100000
DWORD WINAPI SomeThread( LPVOID lpParam )
{
struct homebrew *test = (struct homebrew*)lpParam;
while ( test->mine < NUM_ACQUIRES ) {
// Test and set the mutex. If it currently has value 0, then it
// is free. Setting 1 means it is owned. This interlocked function does
// the test and set as an atomic operation
if ( 0 == InterlockedCompareExchange( test->mutex, 1, 0 )) {
// this tread now owns the mutex. Increment the shared variable
// without an atomic increment (relying on mutex ownership to protect it)
(*test->shared)++;
test->mine++;
// Release the mutex (4 byte aligned assignment is atomic)
*test->mutex = 0;
}
}
return 0;
}
int main( int argc, char* argv[] )
{
LONG mymutex = 0; // zero means
int shared = 0;
HANDLE threads[NUM_THREADS];
struct homebrew test[NUM_THREADS];
int i;
// Initialize each thread's structure. All share the same mutex and a shared
// counter
for ( i = 0; i < NUM_THREADS; i++ ) {
test[i].mine = 0; test[i].shared = &shared; test[i].mutex = &mymutex;
}
// create the threads and then wait for all to finish
for ( i = 0; i < NUM_THREADS; i++ )
threads[i] = CreateThread(NULL, 0, SomeThread, &test[i], 0, NULL);
for ( i = 0; i < NUM_THREADS; i++ )
WaitForSingleObject( threads[i], INFINITE );
// Verify all increments occurred atomically
printf( "shared = %d (%s)\n", shared,
shared == NUM_THREADS * NUM_ACQUIRES ? "correct" : "wrong" );
for ( i = 0; i < NUM_THREADS; i++ ) {
if ( test[i].mine != NUM_ACQUIRES ) {
printf( "Thread %d cheated. Only %d acquires.\n", i, test[i].mine );
}
}
}
If I comment out the call to the InterlockedCompareExchange call and just let all threads run the increments in a free-for-all fashion, then the results do result in failures. Running it 10 times, for example, without the interlocked compare call:
shared = 748694 (wrong)
shared = 811522 (wrong)
shared = 796155 (wrong)
shared = 825947 (wrong)
shared = 1000000 (correct)
shared = 795036 (wrong)
shared = 801810 (wrong)
shared = 790812 (wrong)
shared = 724753 (wrong)
shared = 849444 (wrong)
The curious thing is that one time the results showed now incorrect contention. That might be because there is no "everyone start now" synchronization; maybe all threads started and finished in order in that case. But when I have the InterlockedExchangeCall in place, it runs without failure (or at least it ran 100 times without failure ... that doesn't prove I didn't write a subtle bug into the example).

Here is the discussion from the people who implemented it ... very interesting as it shows the tradeoffs ..
Several posts from Linus T ... of course

In earlier days pre-POSIX etc I used to implement synchronization by using a native mode word (e.g. 16 or 32 bit word) and the Test And Set instruction lurking on every serious processor. This instruction guarantees to test the value of a word and set it in one atomic instruction. This provides the basis for a spinlock and from that a hierarchy of synchronization functions could be built. The simplest is of course just a spinlock which performs a busy wait, not an option for more than transitory sync'ing, then a spinlock which drops the process time slice at each iteration for a lower system impact. Notional concepts like Semaphores, Mutexes, Monitors etc can be built by getting into the kernel scheduling code.
As I recall the prime usage was to implement message queues to permit multiple clients to access a database server. Another was a very early real time car race result and timing system on a quite primitive 16 bit machine and OS.
These days I use Pthreads and Semaphores and Windows Events/Mutexes (mutices?) etc and don't give a thought as to how they work, although I must admit that having been down in the engine room does give one and intuitive feel for better and more efficient multiprocessing.

In windows world.
The mutex before the windows vista mas implemented with a Compare Exchange to change the state of the mutex from Empty to BeingUsed, the other threads that entered the wait on the mutex the CAS will obvious fail and it must be added to the mutex queue for furder notification. Those operations (add/remove/check) of the queue would be protected by an common lock in windows kernel.
After Windows XP, the mutex started to use a spin lock for performance reasons being a self-suficiant object.
In unix world i didn't get much furder but probably is very similar to the windows 7.
Finally for kernels that work on a single processor the best way is to disable the interrupts when entering the critical section and re-enabling then when exiting.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Synchronizing all threads in a grid in Metal - multithreading

Related

Where can PTHRED_MUTEX_ADAPTIVE_NP be specified and how does it work?

Was: How does BPF calculate number of CPU for PERCPU_ARRAY?

"threadgroup_barrier" makes no difference

Parallel processing a prime finder with openMP

Native mutex implementation

Categories

Resources