Does this use of release/acquire semantics contain a potential data race? - multithreading

Consider the following code which I found on http://preshing.com/20120913/acquire-and-release-semantics/ (but I am sure I saw it frequently elsewhere)
Common code:
int A = 0;
std::atomic<int> Ready(0);
Code executed by Thread 1:
A = 42
Ready.store(1, std::memory_order_release);
Code executed by Thread 2:
int r1 = Ready.load(std::memory_order_acquire);
int r2 = A;
It can then be said, that if r1 == 1 then r2 will always be 42 and there is no data race.
My first question is: Does the code nevertheless contain a data race? I mean in the case that r1 == 0. For example, Thread 1 could be preempted half-way through the store to A and then Thread 2 could pick up execution.
I would think that the code for Thread 2 should be rewritten as:
int r1 = Ready.load(std::memory_order_acquire);
int r2 = 0;
if( r1 == 1 ) {
r2 = A;
}
to avoid the potential data race.
Is that true?

Yes, your interpretation is correct: the program contains a data race due to the unconditional read of A. The program is simplified to be a minimal example to demonstrate the workings of acquire-release for the blog: the author discusses only how the memory ordering enforces that the "reader" thread must read 42 from A if it reads 1 from Ready. He doesn't talk about the alternative because it's not germaine.
In a real program, the "reader" thread would probably wait in a loop on Ready to get the desired semantics:
int A = 0;
std::atomic<int> Ready(0);
void write_thread() {
A = 42;
Ready.store(1, std::memory_order_release);
}
void read_thread() {
while (!Ready.load(std::memory_order_acquire))
;
int r2 = A;
assert(r2 == 42);
}
Or use a conditional as in your example:
int A = 0;
std::atomic<int> Ready(0);
void write_thread() {
A = 42;
Ready.store(1, std::memory_order_release);
}
void read_thread() {
while (true) {
if (Ready.load(std::memory_order_acquire)) {
int r2 = A;
assert(r2 == 42);
break;
}
std::cout << "Not ready yet - try again later.\n" << std:flush;
std::this_thread::sleep_for(std::chrono::milliseconds{125});
}
}

Related

Is this petersons solution correct for N threads?

bool lock[N];
int turn=0;
int offset=0;
int M = N-1;
int pidToN(int pid); //returns a unique number between (0,N-1) for a given pid; mapping pids
void critical()
{
int pidn = pidToN(getpid());
lock[pidn] = true;
turn = M-pidn;
if(turn == pidn)
{
val=1;
turn+=val%N;
}
else
val=0;
while(lock(M-pidn+val) && turn == (M-pidn+val) && lock(M-pidn-val) && turn == (M-pidn-val));
//critical section
lock[pidn] = false;
}
Does this implemention work? Essentially thread[i] tries to pass to thread[N-1-i] and vice versa. If i = N/2 (the thread at the middle, if it exists, which passes to itself) then I increment it by certain val (1 in this case) which then waits.
Couldn't come up with any race conditions.
Any help would be appreciated.

How to trigger a race condition?

I am researching about fuzzing approaches, and I want to be sure which approach is suitable for Race Condition problem. Therefor I have a question about race condition itself.
Let's suppose we have a global variable and some threads have access to it without any restriction. How can we trigger the existing race condition? Is it enough to run just the function that uses the global variable with several threads? I mean just running the function will trigger race condition anyway?
Here, I put some code, and I know it has race condition problem. I want to know which inputs should give the functions to trigger the corresponding race condition problem.
#include<thread>
#include<vector>
#include<iostream>
#include<experimental/filesystem>
#include<Windows.h>
#include<atomic>
using namespace std;
namespace fs = experimental::filesystem;
volatile int totalSum;
//atomic<int> totalSum;
volatile int* numbersArray;
void threadProc(int startIndex, int endIndex)
{
Sleep(300);
for(int i = startIndex; i < endIndex; i++)
{
totalSum += numbersArray[i];
}
}
void performAddition(int maxNum, int threadCount)
{
totalSum = 0;
numbersArray = new int[maxNum];
for(int i = 0; i < maxNum; i++)
{
numbersArray[i] = i + 1;
}
int numbersPerThread = maxNum / threadCount;
vector<thread> workerThreads;
for(int i = 0; i < threadCount; i++)
{
int startIndex = i * numbersPerThread;
int endIndex = startIndex + numbersPerThread;
if (i == threadCount - 1)
endIndex = maxNum;
workerThreads.emplace_back(threadProc, startIndex, endIndex);
}
for(int i = 0; i < workerThreads.size(); i++)
{
workerThreads[i].join();
}
delete[] numbersArray;
}
void printUsage(char* progname)
{
cout << "usage: " << fs::path(progname).filename() << " maxNum threadCount\t with 1<maxNum<=10000, 0<threadCount<=maxNum" << endl;
}
int main(int argc, char* argv[])
{
if(argc != 3)
{
printUsage(argv[0]);
return -1;
}
long int maxNum = strtol(argv[1], nullptr, 10);
long int threadCount = strtol(argv[2], nullptr, 10);
if(maxNum <= 1 || maxNum > 10000 || threadCount <= 0 || threadCount > maxNum)
{
printUsage(argv[0]);
return -2;
}
performAddition(maxNum, threadCount);
cout << "Result: " << totalSum << " (soll: " << (maxNum * (maxNum + 1))/2 << ")" << endl;
return totalSum;
}
Thanks for your help
There may be many cases of race conditions. One of example for your case:
one thread:
reads commonly accessible variable (1)
increments it (2)
sets the common member variable to resulting value (to 2)
second thread starts just after the first thread read the common value
it read the same value (1)
incremented the value it read. (2)
then writes the calculated value to common member variable at the same time as first one. (2)
As a result
the member value was incremented only by one (to value of 2) , but it should increment by two (to value of 3) since two threads were acting on it.
Testing race conditions:
for your purpose (in the above example) you can detect race condition when you get different result than expected.
Triggerring
if you may want the described situation always to happen for the purpose of - you will need to coordinate the work of two threads. This will allow you to do your testing
Nevertheless coordination of two threads will violate definition race condition if it is defined as: "A race condition or race hazard is the behavior of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events.". So you need to know what you want, and in summary race condition is an unwanted behavior, that in your case you want to happen what can make sense for testing purpose.
If you are asking generally - when a race condition can occur - it depends on your software design (e.g you can have shared atomic integers which are ok to be used), hardware design (eg. variables stored in temporary registers) and generally luck.
Hope this helps,
Witold

Please explain cache coherence

I've recently learned about false sharing, which in my understanding stems from the CPU's attempt to create cache coherence between different cores.
However, doesn't the following example demonstrate that cache coherence is violated?
The example below launches several threads that increase a global variable x, and several threads that assign the value of x to y, and an observer that tests if y>x. The condition y>x should never happen if there was memory coherence between the cores, as y is only increased after x was increased. However, this condition does happen according to the results of running this program. I tested it on visual studio both 64 and 86, both debug and release with pretty much the same results.
So, does memory coherence only happen when it's bad and never when it's good? :)
Please explain how cache coherence works and how it doesn't work. If you can guide me to a book that explains the subject I'll be grateful.
edit: I've added mfence where ever possible, still there is no memory coherence (presumably due to stale cache).
Also, I know the program has a data race, that's the whole point. My question is: Why is there a data race if the cpu maintains cache coherence (if it wasn't maintaining cache coherence, then what is false sharing and how does it happen?). Thank you.
#include <intrin.h>
#include <windows.h>
#include <iostream>
#include <thread>
#include <atomic>
#include <list>
#include <chrono>
#include <ratio>
#define N 1000000
#define SEPARATE_CACHE_LINES 0
#define USE_ATOMIC 0
#pragma pack(1)
struct
{
__declspec (align(64)) volatile long x;
#if SEPARATE_CACHE_LINES
__declspec (align(64))
#endif
volatile long y;
} data;
volatile long &g_x = data.x;
volatile long &g_y = data.y;
int g_observed;
std::atomic<bool> g_start;
void Observer()
{
while (!g_start);
for (int i = 0;i < N;++i)
{
_mm_mfence();
long y = g_y;
_mm_mfence();
long x = g_x;
_mm_mfence();
if (y > x)
{
++g_observed;
}
}
}
void XIncreaser()
{
while (!g_start);
for (int i = 0;i < N;++i)
{
#if USE_ATOMIC
InterlockedAdd(&g_x,1);
#else
_mm_mfence();
int x = g_x+1;
_mm_mfence();
g_x = x;
_mm_mfence();
#endif
}
}
void YAssigner()
{
while (!g_start);
for (int i = 0;i < N;++i)
{
#if USE_ATOMIC
long x = g_x;
InterlockedExchange(&g_y, x);
#else
_mm_mfence();
int x = g_x;
_mm_mfence();
g_y = x;
_mm_mfence();
#endif
}
}
int main()
{
using namespace std::chrono;
g_x = 0;
g_y = 0;
g_observed = 0;
g_start = false;
const int NAssigners = 4;
const int NIncreasers = 4;
std::list<std::thread> threads;
for (int i = 0;i < NAssigners;++i)
{
threads.emplace_back(YAssigner);
}
for (int i = 0;i < NIncreasers;++i)
{
threads.emplace_back(XIncreaser);
}
threads.emplace_back(Observer);
auto tic = high_resolution_clock::now();
g_start = true;
for (std::thread& t : threads)
{
t.join();
}
auto toc = high_resolution_clock::now();
std::cout << "x = " << g_x << " y = " << g_y << " number of times y > x = " << g_observed << std::endl;
std::cout << "&x = " << (int*)&g_x << " &y = " << (int*)&g_y << std::endl;
std::chrono::duration<double> t = toc - tic;
std::cout << "time elapsed = " << t.count() << std::endl;
std::cout << "USE_ATOMIC = " << USE_ATOMIC << " SEPARATE_CACHE_LINES = " << SEPARATE_CACHE_LINES << std::endl;
return 0;
}
Example output:
x = 1583672 y = 1583672 number of times y > x = 254
&x = 00007FF62BE95800 &y = 00007FF62BE95804
time elapsed = 0.187785
USE_ATOMIC = 0 SEPARATE_CACHE_LINES = 0
False sharing is mainly related to performance, not coherence or program order. The cpu cache works on a granularity which is typically 16, 32, 64,... bytes. That means if two independent data items are close together in memory, they will experience each others cache operations. Specifically, if &a % CACHE_LINE_SIZE == &b % CACHE_LINE_SIZE, then they will share a cache line.
For example, if cpu0 & 1 are fighting over a, and cpu 2 & 3 are fighting over b, the cache line containing a & b will thrash between each of the 4 caches. This is the effect of false sharing, and it causes a large performance drop.
False sharing happens because the coherence algorithm in the caches demand that there is a consistent view of memory. A good way to examine it is to put two atomic counters in a structure spaced out by one or two k:
struct a {
long a;
long pad[1024];
long b;
};
and find a nice little machine language function to do an atomic increment. Then cut loose NCPU/2 threads incrementing a and NCPU/2 threads incrementing b until they reach a big number.
Then repeat, commenting out the pad array. Compare the times.
When you are trying to get at machine details, clarity and precision are your friends; C++ and weird attribute declarations aren’t.

Concurrency in Linux: Alternate access of two threads in critical zone

I'm new with the concepts of concurrency and threads in Linux and I tried to solve a relative simple problem. I create two threads which run the same function which increment a global variable. What I really want from my program is to increment that variable alternatively, namely ,say in each step the thread that increment that variable prints to the screen a message , so the expected output should look like:
Thread 1 is incrementing variable count... count = 1
Thread 2 is incrementing variable count... count = 2
Thread 1 is incrementing variable count... count = 3
Thread 2 is incrementing variable count... count = 4
and so on.
I tried an implementation with a semaphore which ensures mutual exclusion, but nonetheless the result resembles this:
Thread 2 is incrementing variable count... count = 1
Thread 2 is incrementing variable count... count = 2
Thread 2 is incrementing variable count... count = 3
Thread 2 is incrementing variable count... count = 4
Thread 2 is incrementing variable count... count = 5
Thread 1 is incrementing variable count... count = 6
Thread 1 is incrementing variable count... count = 7
Thread 1 is incrementing variable count... count = 8
Thread 1 is incrementing variable count... count = 9
Thread 1 is incrementing variable count... count = 10
My code looks like:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <semaphore.h>
int count = 0;
sem_t mutex;
void function1(void *arg)
{
int i = 0;
int *a = (int*) arg;
while (i < 10)
{
sem_wait(&mutex);
count++;
i++;
printf("From the function : %d count is %d\n",*a,count);
sem_post(&mutex);
}
}
int main()
{
pthread_t t1,t2;
int a = 1;
int b = 2;
pthread_create(&t1,NULL,(void *)function1,&a);
pthread_create(&t2,NULL,(void *)function1,&b);
sem_init(&mutex,0,1);
pthread_join(t2,NULL);
pthread_join(t1,NULL);
sem_destroy(&mutex);
return 0;
}
My big question is now , how do I achieve this alternation between threads? I got mutual exclusion , but the alternation is still missing. Maybe I lack a good insight of semaphores usage, but I would be very grateful if someone would explain that to me. (I have read several courses on this topic ,namely,Linux semaphores,concurrency and threads, but the information there wasn't satisfactory enough)
Mutex locks don't guarantee any fairness. This means that if thread 1 releases it and then attempts to get it again, there is no guarantee that it won't - it's just guarantees that no two pieces of code can be running in that block at the same time.
EDIT: Removed previous C-style solution because it was probably incorrect. The question asks for a synchronization solution.
If you really want to do this correctly you would use something called a monitor and a guard (or condition variable). I'm not too familiar with C and pThreads so you would need to take a look how to do it with that, but in Java it would look something like:
public class IncrementableInteger {
public int value;
#Override
public String toString() {
return String.valueOf(value);
}
}
#Test
public void testThreadAlternating() throws InterruptedException {
IncrementableInteger i = new IncrementableInteger();
Monitor m = new Monitor();
Monitor.Guard modIsZeroGuard = new Monitor.Guard(m) {
#Override public boolean isSatisfied() {
return i.value % 2 == 0;
}
};
Monitor.Guard modIsOneGuard = new Monitor.Guard(m) {
#Override public boolean isSatisfied() {
return i.value % 2 == 1;
}
};
Thread one = new Thread(() -> {
while (true) {
m.enterWhenUninterruptibly(modIsZeroGuard);
try {
if (i.value >= 10) return;
i.value++;
System.out.println("Thread 1 inc: " + String.valueOf(i));
} finally {
m.leave();
}
}
});
Thread two = new Thread(() -> {
while (true) {
m.enterWhenUninterruptibly(modIsOneGuard);
try {
if (i.value >= 10) return;
i.value++;
System.out.println("Thread 2 inc: " + String.valueOf(i));
} finally {
m.leave();
}
}
});
one.start();
two.start();
one.join();
two.join();
}

correct usage of compare_exchange_weak

const int SIZE = 20;
struct Node { Node* next; };
std::atomic<Node*> head (nullptr);
void push (void* p)
{
Node* n = (Node*) p;
n->next = head.load ();
while (!head.compare_exchange_weak (n->next, n));
}
void* pop ()
{
Node* n = head.load ();
while (n &&
!head.compare_exchange_weak (n, n->next));
return n ? n : malloc (SIZE);
}
void thread_fn()
{
std::array<char*, 1000> pointers;
for (int i = 0; i < 1000; i++) pointers[i] = nullptr;
for (int i = 0; i < 10000000; i++)
{
int r = random() % 1000;
if (pointers[r] != nullptr) // allocated earlier
{
push (pointers[r]);
pointers[r] = nullptr;
}
else
{
pointers[r] = (char*) pop (); // allocate
// stamp the memory
for (int i = 0; i < SIZE; i++)
pointers[r][i] = 0xEF;
}
}
}
int main(int argc, char *argv[])
{
int N = 8;
std::vector<std::thread*> threads;
threads.reserve (N);
for (int i = 0; i < N; i++)
threads.push_back (new std::thread (thread_fn));
for (int i = 0; i < N; i++)
threads[i]->join();
}
What is wrong with this usage of compare_exchange_weak ? The above code crashes 1 in 5 times using clang++ (MacOSX).
The head.load() at the time of the crash will have "0xEFEFEFEFEF". pop is like malloc and push is like free. Each thread (8 threads) randomly allocate or deallocate memory from head
It could be nice lock-free allocator, but ABA-problem arise:
A: Assume, that some thread1 executes pop(), which reads current value of head into n variable, but immediately after this the thread is preemted and concurrent thread2 executes full pop() call, that is it reads same value from head and performs successfull compare_exchange_weak.
B: Now object, referred by n in the thread1, has no longer belonged to the list, and can be modified by thread2. So n->next is garbage in general: reading from it can return any value. For example, it can be 0xEFEFEFEFEF, where the first 5 bytes are stamp (EF), witch has been written by thread2, and the last 3 bytes are still 0, from nullptr. (Total value is numerically interpreted in little-endian manner). It seems that, because head value has been changed, thread1 will fail its compare_exchange_weak call, but...
A: Concurrent thread2 push()es resulted pointer back into the list. So thread1 sees initial value of head, and perform successfull compare_exchange_weak, which writes incorrect value into head. List is corrupted.
Note, that problem is more than possibility, that other thread can modify content of n->next. The problem is that value of n->next is no longer coupled with the list. So, even it is not modified concurrently, it becomes invalid (for replace head) in case, e.g., when other thread(s) pop() 2 elements from the list but push() back only first of them. (So n->next will points to the second element, which is has no longer belonged to the list.)

Resources