OpenCL - Sync and Signal? - multithreading

OpenCL - Sync and Signal? - multithreading

I want to use a sync-signal setting in OpenCL to make sure, that only one thread can go into a critical kernel part.
Here is the code, I have so far:
void sync(int barrierID) {
int ID = get_global_id(0);
barrier(CLK_GLOBAL_MEM_FENCE);
while (ID - barrierID != 0) {
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
//critical part
void signal(int threadCount, int barrierID) {
barrierID++;
barrier(CLK_GLOBAL_MEM_FENCE);
while (barrierID != threadCount) {
barrier(CLK_GLOBAL_MEM_FENCE);
}
barrierID = 0;
}
with threadcount for the amount of threads, that wnat to access the critical part and barrierID is the counter for how many threads has passed this part.
unfortunally, this code does not work in OpenCL.
Does anyone knows, how to fix this code?

You are approaching GPU computing as a multi thread computing, which is completely wrong approach.
The reason is that in GPU computing all the "threads" (in reality they are work items), run at the same time. A work item cannot enter a zone, run code, while the others are doing something else.
Therefore, having any type of branching in the GPU is a terrible idea, since it will slow down your application, making the GPU run all the branches, even if some items do not enter in that case.
For your specific case:
You are getting a deadlock in your kernel because you are creating a barrier in a branch. After one single work item enters, will wait until all the others have entered. If that case does never happen, then you have a deadlock.
Check the barrier command: https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/barrier.html
If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier.

Related

Is there a data race?

class Test {
struct hazard_pointer {
std::atomic<void*> hp;
std::atomic<std::thread::id> id;
};
hazard_pointer hazard_pointers[max_hazard_pointers];
std::atomic<void*>& get_hazard_pointer_for_current_thread(){
std::thread::id id = std::this_thread::get_id();
for( int i =0; i < max_hazard_pointers; i++){
if( hazard_pointers[i].id.load() == id){
hazard_pointers[i].id.store(id);
return hazard_pointers[i].hp;
}
}
std::atomic<nullptr> ptr;
return ptr;
}
};
int main() {
Test* t =new Test();
std::thread t1([&t](){ while(1) t->get_hazard_pointer_for_current_thread();});
std::thread t2([&t](){ while(1) t->get_hazard_pointer_for_current_thread();});
t1.join();
t2.join();
return 0;
}
The function get_hazard_pointer_for_current_thread can be executed parallelly. Is there data race? On my eye there is no data race because of atomic operation, but I am not sure.
So, please make me sure or explain why there is ( are ) data race(s).
Let's assume that hazard_pointers array elements are initialized.

There are a few errors in the code:
get_hazard_pointer_for_current_thread may not return any value - undefined behaviour.
hazard_pointers array elements are not initialized.
if(hazard_pointers[i].id.load() == id) hazard_pointers[i].id.store(id); does not make any sense.
And yes, there is a data race. Between statement if(hazard_pointers[i].id.load() == id) and hazard_pointers[i].id.store(id); another thread may change hazard_pointers[i].id. You probably need to use a compare-and-swap instruction.

I don't think you have any C++ UB from concurrent access to non-atomic data, but it looks like you do have the normal kind of race condition in your code.
if (x==a) x = b almost always needs to be an atomic read-modify-write (instead of separate atomic loads and atomic stores) in lock-free algorithms, unless there's some reason why it's ok to still store b if x changed to something other than a between the check and the store.
(In this case, the only thing that can ever be stored is the value that was already there, as #MargaretBloom points out. So there's no "bug", just a bunch of useless stores if this is the only code that touches the array. I'm assuming that you didn't really intend to write a useless example, so I'm considering it a bug.)
Lock-free programming is not easy, even if you do it the low-performance way with the default std::memory_order_seq_cst for all the stores so the compiler has to MFENCE everywhere. Making everything atomic only avoids C++ UB; you still have to carefully design the logic of your algorithm to make sure it's correct even if multiple stores/loads from other thread(s) become visible between every one of your own operations, and stuff like that. (e.g. see Preshing's lock-free hash table.)
Being UB-free is necessary (at least in theory) but definitely not sufficient for code to be correct / safe. Being race-free means no (problematic) races even between atomic accesses. This is a stronger but still not sufficient part of being bug-free.
I say "in theory" because in practice a lot of code with UB happens to compile the way we expect, and will only bite you on other platforms, or with future compilers, or with different surrounding code that exposes the UB during optimization.
Testing can't easily detect all bugs, esp. if you only test on strongly-ordered x86 hardware, but a simple bug like this should be easily detectable with testing.
The problem with your code, in more detail:
You do a non-atomic compare-exchange, with an atomic load and a separate atomic store:
if( hazard_pointers[i].id.load() == id){
// a store from another thread can become visible here
hazard_pointers[i].id.store(id);
return hazard_pointers[i].hp;
}
The .store() should be a std::compare_exchange_strong, so the value isn't modified if a store from another thread changed the value between your load and your store. (Putting it inside an if on a relaxed or acquire load is still a good idea; I think a branch to avoid a lock cmpxchg is a good idea if you expect the value to not match most of the time. That should let the cache lines stay Shared when no thread finds a match on those elements.)

Is it possible to create a binary analysis software which would sort out all possible vulnerabilities and bugs in other software?

I find myself often questioning myself whether it is possible to design a software which would load up another software and try to emulate all possible outcomes from it and figure out bugs and vulnerabilities on the software being analyzed.
Theoretically, it could load any piece of software, have an internal representation of the underlying system (CPU registers, memory, etc) like a Virtual Machine software, and by means of analyzing, it would start fetching the instructions, emulating them, which would go linearly until it finds a conditional jump.
To make it simple to understand, when it finds a conditional jump, it would take a snapshot of the current representational state of the system and follow that conditional jump, it would keep evaluating the instructions and at some point would restore that snapshot and do not follow the conditional jump, going past over it and evaluating the next instructions, and so on.
Such software would be smart enough to emulate user supplied input.
To make things clearer lets imagine we are analyzing the following (pseudo?) C code:
char* gets(char *s)
{
int i = 0;
while( (s[i] = _getche()) != VK_RETURN ) i++;
s[i] = NULL;
return s;
}
void main() {
char buf[8];
char is_admin = FALSE;
do {
gets( buf );
if( _strcmp(buf, "s3cr3t!") == 0 )
is_admin = TRUE;
else
{
if( is_admin )
super_user.exec( buf );
else
unprivileged_user.exec( buf );
}
} while( _strcmp(buf, "exit") != 0 );
}
It just keeps polling for user commands and executes them until the user input is "exit". if the user inputs a password "s3cr3t!" them it will execute the following commands as a super user, otherwise it will just impersonate an unprivileged user.
Moving on, we could ask our analysis software to detect and sort out which ways that would be possible to execute commands as a super user on the subject code being analyzed.
By going through each instruction, it will come to conditional jumps and test both cases, when the jump is made and when it is not. So after a few iterations it would know that if a user inputs the string "s3cr3t!", it will later on come to execute commands as a super user. It would not try every possible string combination until eventually it comes to "s3cr3t!", it would be smart to see there is a comparison for that string, and see what it changes in the program flow.
Then, it would also be able to see that any user input string that has more than 8 letters would overflow the allocated space for the buf char array, thereby corrupting memory. Which in this particular case, assuming that the stack memory layout for this was that the is_admin variable would be sitting right next to the buf char array, would set is_admin to evaluate to TRUE, and then comes to execute commands as super user.
It would also be able to spot an integer overflow in that gets() function, if that would corrupt stack memory somehow that would end up changing the RETURN address from a function call. Figuring it would be a scenario for exploitation where the user inputs the shellcode and by overwriting the RETURN address it would then jump to that shellcode which would also execute commands as a super user.
So... I know I could not go into much detail on the inner workings, but overall I think I made my point. Does anyone see something wrong with that approach or thinks it would not work?
I am thinking about going for an open project on this. I would appreciate any considerations.

If I understand you correctly, there is such thing. Search for static analysis, control flow graph and such things. So generally, your idea is good.
However, writing a program that will find all the bugs in some program is impossible. The proof is by reduction from the Halting problem. So obviously, it is impossilbe to use your approach to find them all.
However, it might be possible to find all the bugs of some family.
For example: I can define the "bug family" of crashing within one minute when only one ASCII char is given as input. Of course you can check this (at least for deterministic programs, for probabilistic programs - a simple check will give probability that there is no bug).
So for spcific bugs your approach might work.
And last thing: notice that this approach might have high time complexity.

Reading a Global Variable from a Thread and Writing to that Variable from another Thread

My program has 2 threads and a int global variable. One thread is reading from that variable and other thread is writing to that variable. Should I use mutex lock in this situation.
These functions are executing from 2 threads simultaneously and repetitively in my program.
void thread1()
{
if ( condition1 )
iVariable = 1;
else if ( condition2 )
iVariable = 2;
}
void thread2()
{
if ( iVariable == 1)
//do something
else if ( iVarable == 2 )
//do another thing
}

If you don't use any synchronization then it is entirely unpredictable when the 2nd thread sees the updated value. This ranges somewhere between a handful of nanoseconds and never. With the never outcome being particularly troublesome of course, it can happen on a x86 processor when you don't declare the variable volatile and you run the Release build of your program. It can take a long time on processors with a weak memory model, like ARM cores. The only thing you don't have to worry about is seeing a partially updated value, int updates are atomic.
That's about all that can be said about the posted code. Fine-grained locking rarely works well.

Yes you should (under most circumstances). Mutexes will ensure that the data you are protecting will be correctly visible from multiple contending CPUs. Unless you have a performance problem, you should use a mutex. If performance is an issue, look into lock free data structures.

Why is threading dangerous?

I've always been told to puts locks around variables that multiple threads will access, I've always assumed that this was because you want to make sure that the value you are working with doesn't change before you write it back
i.e.
mutex.lock()
int a = sharedVar
a = someComplexOperation(a)
sharedVar = a
mutex.unlock()
And that makes sense that you would lock that. But in other cases I don't understand why I can't get away with not using Mutexes.
Thread A:
sharedVar = someFunction()
Thread B:
localVar = sharedVar
What could possibly go wrong in this instance? Especially if I don't care that Thread B reads any particular value that Thread A assigns.

It depends a lot on the type of sharedVar, the language you're using, any framework, and the platform. In many cases, it's possible that assigning a single value to sharedVar may take more than one instruction, in which case you may read a "half-set" copy of the value.
Even when that's not the case, and the assignment is atomic, you may not see the latest value without a memory barrier in place.

MSDN Magazine has a good explanation of different problems you may encounter in multithreaded code:
Forgotten Synchronization
Incorrect Granularity
Read and Write Tearing
Lock-Free Reordering
Lock Convoys
Two-Step Dance
Priority Inversion
The code in your question is particularly vulnerable to Read/Write Tearing. But your code, having neither locks nor memory barriers, is also subject to Lock-Free Reordering (which may include speculative writes in which thread B reads a value that thread A never stored) in which side-effects become visible to a second thread in a different order from how they appeared in your source code.
It goes on to describe some known design patterns which avoid these problems:
Immutability
Purity
Isolation
The article is available here

The main problem is that the assignment operator (operator= in C++) is not always guaranteed to be atomic (not even for primitive, built in types). In plain English, that means that assignment can take more than a single clock cycle to complete. If, in the middle of that, the thread gets interrupted, then the current value of the variable might be corrupted.
Let me build off of your example:
Lets say sharedVar is some object with operator= defined as this:
object& operator=(const object& other) {
ready = false;
doStuff(other);
if (other.value == true) {
value = true;
doOtherStuff();
} else {
value = false;
}
ready = true;
return *this;
}
If thread A from your example is interrupted in the middle of this function, ready will still be false when thread B starts to run. This could mean that the object is only partially copied over, or is in some intermediate, invalid state when thread B attempts to copy it into a local variable.
For a particularly nasty example of this, think of a data structure with a removed node being deleted, then interrupted before it could be set to NULL.
(For some more information regarding structures that don't need a lock (aka, are atomic), here is another question that talks a bit more about that.)

This could go wrong, because threads can be suspended and resumed by the thread scheduler, so you can't be sure about the order these instructions are executed. It might just as well be in this order:
Thread B:
localVar = sharedVar
Thread A:
sharedVar = someFunction()
In which case localvar will be null or 0 (or some completeley unexpected value in an unsafe language), probably not what you intended.
Mutexes actually won't fix this particular issue by the way. The example you supply does not lend itself well for parallelization.

Primitive synchronization primitives -- safe?

On constrained devices, I often find myself "faking" locks between 2 threads with 2 bools. Each is only read by one thread, and only written by the other. Here's what I mean:
bool quitted = false, paused = false;
bool should_quit = false, should_pause = false;
void downloader_thread() {
quitted = false;
while(!should_quit) {
fill_buffer(bfr);
if(should_pause) {
is_paused = true;
while(should_pause) sleep(50);
is_paused = false;
}
}
quitted = true;
}
void ui_thread() {
// new Thread(downloader_thread).start();
// ...
should_pause = true;
while(!is_paused) sleep(50);
// resize buffer or something else non-thread-safe
should_pause = false;
}
Of course on a PC I wouldn't do this, but on constrained devices, it seems reading a bool value would be much quicker than obtaining a lock. Of course I trade off for slower recovery (see "sleep(50)") when a change to the buffer is needed.
The question -- is it completely thread-safe? Or are there hidden gotchas I need to be aware of when faking locks like this? Or should I not do this at all?

Using bool values to communicate between threads can work as you intend, but there are indeed two hidden gotchas as explained in this blog post by Vitaliy Liptchinsky:
Cache Coherency
A CPU does not always fetch memory values from RAM. Fast memory caches on the die are one of the tricks used by CPU designers to work around the Von Neumann bottleneck. On some multi-cpu or multi-core architectures (like Intel's Itanium) these CPU caches are not shared or automatically kept in sync. In other words, your threads may be seeing different values for the same memory address if they run on different CPU's.
To avoid this you need to declare your variables as volatile (C++, C#, java), or do explicit volatile read/writes, or make use of locking mechanisms.
Compiler Optimizations
The compiler or JITter may perform optimizations which are not safe if multiple threads are involved. See the linked blog post for an example. Again, you must make use of the volatile keyword or other mechanisms to inform you compiler.

Unless you understand the memory architecture of your device in detail, as well as the code generated by your compiler, this code is not safe.
Just because it seems that it would work, doesn't mean that it will. "Constrained" devices, like the unconstrained type, are getting more and more powerful. I wouldn't bet against finding a dual-core CPU in a cell phone, for instance. That means I wouldn't bet that the above code would work.

Concerning the sleep call, you could always just do sleep(0) or the equivalent call that pauses your thread letting the next in line a turn.
Concerning the rest, this is thread safe if you know the implementation details of your device.

Answering the questions.
Is this completely thread safe? I would answer no this is not thread safe and I would just not do this at all. Without knowing the details of our device and compiler, if this is C++, the compiler is free to reorder and optimize things away as it sees fit. e.g. you wrote:
is_paused = true;
while(should_pause) sleep(50);
is_paused = false;
but the compiler may choose to reorder this into something like this:
sleep(50);
is_paused = false;
this probably won't work even a single core device as others have said.
Rather than taking a lock, you may try to do better to just do less on the UI thread rather than yield in the middle of processing UI messages. If you think that you have spent too much time on the UI thread then find a way to cleanly exit and register an asynchronous call back.
If you call sleep on a UI thread (or try to acquire a lock or do anyting that may block) you open the door to hangs and glitchy UIs. A 50ms sleep is enough for a user to notice. And if you try to acquire a lock or do any other blocking operation (like I/O) you need to deal with the reality of waiting for an indeterminate amount of time to get the I/O which tends to translate from glitch to hang.

This code is unsafe under almost all circumstances. On multi-core processors you will not have cache coherency between cores because bool reads and writes are not atomic operations. This means each core is not guarenteed to have the same value in the cache or even from memory if the cache from the last write hasn't been flushed.
However, even on resource constrained single core devices this is not safe because you do not have control over the scheduler. Here is an example, for simplicty I'm going to pretend these are the only two threads on the device.
When the ui_thread runs, the following lines of code could be run in the same timeslice.
// new Thread(downloader_thread).start();
// ...
should_pause = true;
The downloader_thread runs next and in it's time slice the following lines are executed:
quitted = false;
while(!should_quit)
{
fill_buffer(bfr);
The scheduler prempts the downloader_thread before fill_buffer returns and then activates the ui_thread which runs.
while(!is_paused) sleep(50);
// resize buffer or something else non-thread-safe
should_pause = false;
The resize buffer operation is done while the downloader_thread is in the process of filling the buffer. This means the buffer is corrupted and you'll likely crash soon. It won't happen everytime, but the fact that you are filling the buffer before you set is_paused to true makes it more likely to happen, but even if you switched the order of those two operations on the downloader_thread you would still have a race condition, but you'd likely deadlock instead of corrupting the buffer.
Incidentally, this is a type of spinlock, it just doesn't work. Spinlock's aren't very for wait times that are likely to span to many time slices cause the spin the processor. Your implmentation does sleep which is a bit nicer but the scheduler still has to run your thread and thread context switches aren't cheap. If you are waiting on a critical section or semaphore, the scheduler doesn't active your thread again till the resource has become free.
You might be able to get away with this in some form on a specific platform/architecture, but it is really easy to make a mistake that is very hard to track down.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string