Race condition and atomic operations in Julia and other languages - multithreading

I have a several questions about atomic operations and multithreading.
There is a function for which a race condition occurs (julia lang):
function counter(n)
counter = 0
for i in 1:n
counter += i
return counter
If atomic operations are used to change the global variable "counter", would that help get rid of the race condition?
Does protocol of cache coherence have any real effect to perfomance? Virtual machines like the JVM can use their own architectures to support parallel computing.
Do atomic arithmetic and similar operations require more or less resources than ordinary arithmetic?
I don't quite understand your example, the variable counter seems to be local, and then there will be no race conditions in your example.
Anyway, yes, atomic operations will ensure that race conditions do not occur. There are 2 or 3 ways to do that.
1. Your counter can be an Atomic{Int}:
using .Threads
const counter = Atomic{Int}(0)
function updatecounter(i)
atomic_add!(counter, i)
This is described in the manual: https://docs.julialang.org/en/v1/manual/multi-threading/#Atomic-Operations
2. You can use a field in a struct declared as #atomic:
mutable struct Counter
#atomic c::Int
const counter = Counter(0)
function updatecounter(i)
#atomic counter.c += i
This is described here: https://docs.julialang.org/en/v1/base/multi-threading/#Atomic-operations
It seems the details of the semantics haven't been written yet, but it's the same as in C++.
3. You can use a lock:
counter = 0
countlock = ReentrantLock()
function updatecounter(i)
#lock countlock global counter += i
and 2. are more or less the same. The lock approach is slower, but can be used if several operations must be done serially. No matter how you do it, there will be a performance degradation relative to non-atomic arithmetic. The atomic primitives in 1. and 2. must do a memory fence to ensure the correct ordering, so cache coherence will matter, depending on the hardware.


Why can a race condition occur when filling an array in parallel?

There is a function in the Julia language that fills an array with random values in parallel and calculates its sum:
function thread_test(v)
Threads.#threads for i = 1:length(v)
#inbounds v[i] = rand()
#inbounds is a macro that disables checks for a possible index out of the array, since in this case the index will always lie within its boundaries.
Why might a race condition occur when executing this code?
rand is generally not thread-safe in most languages, including some version of Julia. This means calling rand() from multiple threads can cause an undefined behaviour (in practice, the seed is typically written by different threads at the same time decreasing performance and the randomness of the random number generator). The Julia documentation explicitly states:
In a multi-threaded program, you should generally use different RNG objects from different threads or tasks in order to be thread-safe. However, the default RNG is thread-safe as of Julia 1.3 (using a per-thread RNG up to version 1.6, and per-task thereafter).
Besides this, the code is fine.
Because multiple threads are accessing the same variable (v) at the same time, which can lead to unexpected results.

Peterson's solution just use one variable

For Pi:
do {
turn = i; // prepare enter section
//critical section
turn = j; //exit section.
} while(true);
For Pj:
do {
turn = j; // prepare enter section
//critical section
turn = i; //exit section.
} while(true);
In this simplified algorithm, if process i want to enter critical section for i, it will set "turn = i"(different from Peterson's solution which will set "turn = j"). this algorithm does not seem to cause deadlock or starvation, so why Peterson's algorithm not simplified like this?
Another Question: as i know, mutual exclusion mechanisms such as semaphore P/V operations require atomicity (P should do test sem.value and sem.value-- concurrently). but why the algorithm above just use one variable turn does not seem to require atomicity (turn = i, test turn == j not atomicity )?
Before you ask whether the algorithm avoids deadlock and starvation, you first have to verify that it still locks. With your version, even assuming sequential consistency, the operations could be sequenced like this:
Pi Pj
turn = i;
while (turn == j); // exits immediately
turn = j;
while (turn == i); // exits immediately
// critical section // critical section
and you have a lock violation.
To your second question: it depends on what you mean by "atomicity". You do need it to be the case that when one thread stores turn = i; then the other thread loading turn will only read i or j and not anything else. On some machines, depending on the type of turn and the values of i and j, you could get tearing and load an entirely different value. So whatever language you are using may require you to declare turn as "atomic" in some fashion to avoid this. In C++ in particular, if turn isn't declared std::atomic, then any concurrent read/write access is a data race, and the behavior of the entire program becomes undefined (that's bad).
Besides the need to avoid tearing and data races, Peterson's algorithm also requires strict memory ordering (sequential consistency), which on many systems / languages is not guaranteed unless specially requested, again perhaps by declaring the variable as atomic in some fashion.
It is true that unlike more typical lock algorithms, Peterson doesn't require an atomic read-modify-write, only atomic sequentially consistent loads and stores. That's precisely what makes it an interesting and clever algorithm. But there's a substantial tradeoff in complexity and performance, especially if you want more than two threads, and most real-life systems do have reasonably efficient atomic RMW instructions, so Peterson is rarely used in practice.

What is the minimum hardware support required for mutual exclusion of competing threads from a critical section?

When several threads share common data, to avoid race conditions when it is being modified, mutual exclusion is required. These can be implemented if the hardware supports atomic test-and-set instruction.
But can we go even simpler? By having just atomic read operation and atomic write operation, is it possible to achieve mutual exclusion? Dekker's algorithm and Peterson's algorithm are some of the algorithms that can achieve mutual exclusion between just 2 processes if there exists atomic read and atomic write operations.
I have seen that Peterson's algorithm can be extended to involve N processes. The algorithm for that is like this:
lock(for Process i):
/* repeat for all partners */
for (count = 0; count < (NUMPROCS-1); count++) {
flags[i] = count; // I think I'm in position "count" in the queue
turn[count] = i; // and I'm the most recent process to think I'm in position "count"
"wait until // wait until
(for all k != i, flags[k]<count) // everyone thinks they're behind me
or (turn[count] != i)" // or someone later than me thinks they're in position "count"
// now I can update my estimated position to "count"+1
} // now I'm at the head of the queue so I can start my critical section
Unlock (for Process i):
/* tell everyone we are finished */
flags[i] = -1; // I'm not in the queue anymore
As far as I can think, this algorithm only requires atomic reads and atomic writes. But above algorithm is for cases where N is known. It cannot be extended to dynamic N case, since there concurrent array insert-allocation has to be protected again.
So, is there any known algorithm that can provide mutual exclusion among dynamic N threads, in a preemptive, multi-core environment with no test-and-set instruction? What if the starvation requirement is not there? Or, is it proven that this cannot be done without atomic test-and-set?
Sequentially consistent memory model is assumed, but mention if this is also not required. I think every hardware supports in some way to write a sequentially consistent program.

pthreads: If I increment a global from two different threads, can there be sync issues?

Suppose I have two threads A and B that are both incrementing a ~global~ variable "count". Each thread runs a for loop like this one:
for(int i=0; i<1000; i++)
count++; //alternatively, count = count + 1;
i.e. each thread increments count 1000 times, and let's say count starts at 0. Can there be sync issues in this case? Or will count correctly equal 2000 when the execution is finished? I guess since the statement "count = count + 1" may break down into TWO assembly instructions, there is potential for the other thread to be swapped in between these two instructions? Not sure. What do you think?
Yes there can be sync issues in this case. You need to either protect the count variable with a mutex, or use a (usually platform specific) atomic operation.
Example using pthread mutexes
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
for(int i=0; i<1000; i++) {
Using atomic ops
There is a prior discussion of platform specific atomic ops here:
UNIX Portable Atomic Operations
If you only need to support GCC, this approach is straightforward. If you're supporting other compilers, you'll probably have to make some per-platform decisions.
Count clearly needs to be protected with a mutex or other synchronization mechanism.
At a fundamental level, the count++ statment breaks down to:
load count into register
increment register
store count from register
A context switch could occur before/after any of those steps, leading to situations like:
Thread 1: load count into register A (value = 0)
Thread 2: load count into register B (value = 0)
Thread 1: increment register A (value = 1)
Thread 1: store count from register A (value = 1)
Thread 2: increment register B (value = 1)
Thread 2: store count from register B (value = 1)
As you can see, both threads completed one iteration of the loop, but the net result is that count was only incremented once.
You probably would also want to make count volatile to force loads & stores to go to memory, since a good optimizer would likely keep count in a register unless otherwise told.
Also, I would suggest that if this is all the work that's going to be done in your threads, performance will dramatically drop from all the mutex locking/unlocking required to keep it consistent. Threads should have much bigger work units to perform.
Yes, there can be sync problems.
As an example of the possible issues, there is no guarantee that an increment itself is an atomic operation.
In other words, if one thread reads the value for increment then gets swapped out, the other thread could come in and change it, then the first thread will write back the wrong value:
| 0 | Value stored in memory (0).
| 0 | Thread 1 reads value into register (r1 = 0).
| 0 | Thread 2 reads value into register (r2 = 0).
| 1 | Thread 2 increments r2 and writes back.
| 1 | Thread 1 increments r1 and writes back.
So you can see that, even though both threads have tried to increment the value, it's only increased by one.
This is just one of the possible problems. It may also be that the write itself is not atomic and one thread may update only part of the value before being swapped out.
If you have atomic operations that are guaranteed to work in your implementation, you can use them. Otherwise, use mutexes, That's what pthreads provides for synchronisation (and guarantees will work) so is the safest approach.
I guess since the statement "count = count + 1" may break down into TWO assembly instructions, there is potential for the other thread to be swapped in between these two instructions? Not sure. What do you think?
Don't think like this. You're writing C code and pthreads code. You don't have to ever think about assembly code to know how your code will behave.
The pthreads standard does not define the behavior when one thread accesses an object while another thread is, or might be, modifying it. So unless you're writing platform-specific code, you should assume this code can do anything -- even crash.
The obvious pthreads fix is to use mutexes. If your platform has atomic operations, you can use those.
I strongly urge you not to delve into detailed discussions about how it might fail or what the assembly code might look like. Regardless of what you might or might not think compilers or CPUs might do, the behavior of the code is undefined. And it's too easy to convince yourself you've covered every way you can think of that it might fail and then you miss one and it fails.

When should the Win32 InterlockedExchange function be used?

I came across the function InterlockedExchange and was wondering when I should use this function. In my opinion, setting a 32 Bit value on an x86 processor should always be atomic?
In the case where I want to use the function, the new value does not depend on the old value (it is not an increment operation).
Could you provide an example where this method is mandatory (I'm not looking for InterlockedCompareExchange)
InterlockedExchange is both a write and a read -- it returns the previous value.
This is necessary to ensure another thread didn't write a different value just after you did. For example, say you're trying to increment a variable. You can read the value, add 1, then set the new value with InterlockedExchange. The value returned by InterlockedExchange must match the value you originally read, otherwise another thread probably incremented it at the same time, and you need to loop around and try again.
As well as writing the new value, InterlockedExchange also reads and returns the previous value; this whole operation is atomic. This is useful for lock-free algorithms.
(Incidentally, 32-bit writes are not guaranteed to be atomic. Consider the case where the write is unaligned and straddles a cache boundary, for instance.)
In a multi-processor or multi-core machine each core has it's own cache - so each core has each own potentially different "view" of what the content of the system memory is.
Thread synchronization mechanisms take care of synchronizing between cores, for more information look at http://blogs.msdn.com/oldnewthing/archive/2008/10/03/8969397.aspx or google for acquire and release semantics
Setting a 32-bit value is atomic, but only if you're setting a literal.
b = a is 2 operations:
mov eax,dword ptr [a]
mov dword ptr [b],eax
Theoretically there could be some interruption between the first and second operation.
Writing a value is never atomic by default. When you write a value to a variable, several machine instructions are generated. With modern, preemptive OSes, the OS might switch to another thread between the individual operations of the write.
This is even more a problem on multi-processor machines, where several threads could be executing at the same time, and trying to write to a single memory location simultaneously.
Interlocked operations avoid this by using specialized instructions to make the write (x86 has dedicated instructions for this kind of situation), which do the read-modify-write in one instruction. These instructions also lock the memory bus of all processors, to ensure that no other executing thread could be writing to the value at the same time.
InterlockedExchange makes sure that the change of a variable and the return of its original value are not interrupted by other threads.
So, if 'i' is an int, these calls (taken individually) do not need InterlockedExchange around 'i':
a = i;
i = 9;
i = a;
i = a + 9;
a = i + 9;
if(0 == i)
None of these statements rely upon BOTH the initial AND final values of 'i'. But these following calls DO need InterlockedExchange around 'i':
a = i++; //a = InterlockedExchange(&i, i + 1);
Without it, two threads running through this same code might get the same value of 'i' assigned to 'a' or 'a' may unexpectedly skip two or more numbers.
if(0 == i++) //if(0 == InterlockedExchange(&i, i + 1))
Two threads may both execute the code that is only supposed to happen once.
wow, so many conflicting answers. Hard to sift through who's right, who's wrong, and what information is misleading.
I'm unsure of the answer too, given the above half-answers, but I think it works like this, I may be wrong, and it will be interesting to find out if I am:
32-bit read & writes ARE atomic, but depending on your code, that may not mean much.
don't worry about non-aligned read/writes. ALL 32-bit writes to a 32-bit variable have to be aligned or the machine page-faults.
don't worry about a write wrapping around the end of a cached page, that can't happen.
If you need to write-then-read on one thread, and you're writing on another thread, then you need to use InterlockedExchange. If you're simply reading the value on one thread, and writing it on another, then you don't need to use it, but those values may be wiggly because of multithreading.
