Why can a race condition occur when filling an array in parallel? - multithreading

There is a function in the Julia language that fills an array with random values in parallel and calculates its sum:
function thread_test(v)
Threads.#threads for i = 1:length(v)
#inbounds v[i] = rand()
end
sum(v)
end
#inbounds is a macro that disables checks for a possible index out of the array, since in this case the index will always lie within its boundaries.
Why might a race condition occur when executing this code?

rand is generally not thread-safe in most languages, including some version of Julia. This means calling rand() from multiple threads can cause an undefined behaviour (in practice, the seed is typically written by different threads at the same time decreasing performance and the randomness of the random number generator). The Julia documentation explicitly states:
In a multi-threaded program, you should generally use different RNG objects from different threads or tasks in order to be thread-safe. However, the default RNG is thread-safe as of Julia 1.3 (using a per-thread RNG up to version 1.6, and per-task thereafter).
Besides this, the code is fine.

Because multiple threads are accessing the same variable (v) at the same time, which can lead to unexpected results.

Related

Race condition and atomic operations in Julia and other languages

I have a several questions about atomic operations and multithreading.
There is a function for which a race condition occurs (julia lang):
function counter(n)
counter = 0
for i in 1:n
counter += i
end
return counter
end
If atomic operations are used to change the global variable "counter", would that help get rid of the race condition?
Does protocol of cache coherence have any real effect to perfomance? Virtual machines like the JVM can use their own architectures to support parallel computing.
Do atomic arithmetic and similar operations require more or less resources than ordinary arithmetic?
It's difficult for me now. Hope for your help.
I don't quite understand your example, the variable counter seems to be local, and then there will be no race conditions in your example.
Anyway, yes, atomic operations will ensure that race conditions do not occur. There are 2 or 3 ways to do that.
1. Your counter can be an Atomic{Int}:
using .Threads
const counter = Atomic{Int}(0)
...
function updatecounter(i)
atomic_add!(counter, i)
end
This is described in the manual: https://docs.julialang.org/en/v1/manual/multi-threading/#Atomic-Operations
2. You can use a field in a struct declared as #atomic:
mutable struct Counter
#atomic c::Int
end
const counter = Counter(0)
...
function updatecounter(i)
#atomic counter.c += i
end
This is described here: https://docs.julialang.org/en/v1/base/multi-threading/#Atomic-operations
It seems the details of the semantics haven't been written yet, but it's the same as in C++.
3. You can use a lock:
counter = 0
countlock = ReentrantLock()
...
function updatecounter(i)
#lock countlock global counter += i
end
and 2. are more or less the same. The lock approach is slower, but can be used if several operations must be done serially. No matter how you do it, there will be a performance degradation relative to non-atomic arithmetic. The atomic primitives in 1. and 2. must do a memory fence to ensure the correct ordering, so cache coherence will matter, depending on the hardware.

Julia: Macro threads and parallel

as we know, Julia supports parallelism and this is something rooted in the language which is very good.
I recently saw that Julia supports threads but it seems to me to be experimental. I noticed that in the case of using the Threads.#Threads macro there is no need for Shared Arrays which is perhaps a computational advantage since no copies of the objects are performed. I also saw that there is the advantage of not declaring all functions with #everywhere.
Can anyone tell me the advantage of using the #parallel macro instead of the #threads macro?
Below are two simple examples of using non-synchronized macros for parallelism.
Using the #threads macro
addprocs(Sys.CPU_CORES)
function f1(b)
b+1
end
function f2(c)
f1(c)
end
result = Vector(10)
#time Threads.#threads for i = 1:10
result[i] = f2(i)
end
0.015273 seconds (6.42 k allocations: 340.874 KiB)
Using the #parallel macro
addprocs(Sys.CPU_CORES)
#everywhere function f1(b)
b+1
end
#everywhere function f2(c)
f1(c)
end
result = SharedArray{Float64}(10)
#time #parallel for i = 1:10
result[i] = f2(i)
end
0.060588 seconds (68.66 k allocations: 3.625 MiB)
It seems to me that for Monte Carlo simulations where loops are mathematically independent and there is a need for a lot of computational performance the use of the #threads macro is more convenient. What do you think the advantages and disadvantages of using each of the macros?
Best regards.
Here is my experience:
Threads
Pros:
shared memory
low cost of spawning Julia with many threads
Cons:
constrained to a single machine
number of threads must be specified at Julia start
possible problems with false sharing (https://en.wikipedia.org/wiki/False_sharing)
often you have to use locking or atomic operations for the program to work correctly; in particular many functions in Julia are not threadsafe so you have to be careful using them
not guaranteed to stay in the current form past Julia 1.0
Processess
Pros:
better scaling (you can spawn them e.g. on a cluster of multiple machines)
you can add processes while Julia is running
Cons:
low efficiency when you have to pass a lot of data between processes
slower to start
you have to explicitly share code and data to/between workers
Summary
Processes are much easier to work with and scale better. In most situations they give you enough performance. If you have large data transfers between parallel jobs threads will be better but are much more delicate to correctly use and tune.

Is reading/writing to different elements of a module array thread-safe?

As long as a program does not allow simultaneous writes to the same elements of a shared data structure that is stored in a module, is it thread-safe? I know this is a noob question, but couldn't find it explicitly addressed anywhere. Here's the situation:
At the beginning of a program, data is initialized and stored in a module-level allocatable array (FIELDVARS) which then becomes accessible to any subroutine where the module is referenced by a USE statement.
Suppose now that the program enters a multi-threaded and/or multi-core computational phase, and FIELDVARS is accessed for "read/write" operations during repeated multiple simultaneous calls to subroutine (COMPUTE).
Once the computational phase is complete, the program returns to a single-threaded phase and FIELDVARS must be used in a subsequent subroutine (POST). However, FIELDVARS cannot be added to the input args of COMPUTE or POST because these are called from a closed-source main program. Therefore the module-level array is used to pass the addt'l data between subroutines.
Assume that FIELDVARS and COMPUTE have been designed so that each call to COMPUTE will always give access to a set of unique elements of FIELDVARS, which are guaranteed to be different than for any other call, so that simultaneous "write" operations on the same elements will never occur. For example:
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... ] <-- FIELDVARS
^---call 1---^ ^---call 2---^ ... <-- Each call to COMPUTE is guaranteed to access a specific set of elements of FIELDVARS.
Question: Is this scenario considered "thread-safe", "conditionally safe", or "not thread-safe"? If it is not safe, what is the specific danger and what would you suggest to handle it?
Other relevant details:
The main program controls threading.
The source code of the main program is not available and cannot be changed.
The main program controls how/when COMPUTE, POST, and other subroutines are called, as well as what args can be passed in. This is why the module-level array is used to pass data between different subroutines rather than as an arg.
! DEMO MODULE W/ ALLOCATABLE INTEGER ARRAY
module DATA_MODULE
integer, dimension(:), allocatable :: FIELDVARS !<-- allocated/populated elsewhere, prior to calling COMPUTE
end module DATA_MODULE
! DEMO COMPUTE SUBROUTINE (THREADED PHASE W/ MULTIPLE SIMULTANEOUS CALLS)
subroutine COMPUTE(x, y, x_idx, y_idx, flag)
use DATA_MODULE
logical :: flag
integer :: x,y,x_idx,y_idx !<-- different for every call to COMPUTE
if (flag == .false.) then !<-- read data only
...
x = FIELDVARS(x_idx)
y = FIELDVARS(y_idx)
...
else if (flag == .true.) then !<-- write data
...
FIELDVARS(x_idx) = 0
FIELDVARS(y_idx) = 0
...
endif
end subroutine COMPUTE
It is fine and many programs depend on that fact. In OpenMP you often loop over arrays and different threads may easily work with elements which are close to each other in memory, especially on the boundaries of the blocks assigned to each thread.
At modern CPUs this is a non-issue. See also https://en.wikipedia.org/wiki/Cache_coherence
What is a real problem is False sharing. Two or more threads working with elements of memory belonging to the same chache line will be compiting for a shared resource and it may be very slow.

Parallel computation

All,
I would like to use Ilnumerics for computations to be made in parallel. They are completely uncoupled. I would need it for
1) random restarts for an optimiser (especially stochastic optimiser, e.g. simulated annealing) : solving the same optimisation problems starting in parallel from different points:
e.g.: argmin_x f(x) starting from x0_h h = 1,2,..,K
2) same optimisation to be run over a sets of uncoupled data; as an example, consider the following unconstrained optimisation problem:
given a function f (R^d x R^p) --> R of x \in R^d and p parameters p\in R^d
solve argmin_x f(x,p_h), h = 1, 2, ..., K.
I hope the notation is clear enough.
Would it be possible to run this loop in parallel, executing everytime some lambda expression involving ILnumerics objects and leveraging on multicores architectures?
Thanks in advance, as usual,
GL
It depends: ILNumerics automatically parallelizes mathematical expressions like
C = A + B[":;2"] / 0.4 * pinv(C) ...
By attempting to run multiple instances of such expressions in parallel, using multiple threads from the thread pool, you would end up producing a lot of contention by too many threads competing for the CPU time slots. In the result your algorithm may runs slower than without parallelizing it.
So, in that case you may disable the internal automatic parallelization ILNumerics does transparently for you:
Settings.MaxNumberThreads = 1;
Expressions like the one above will get evaluated within a single thread afterwards. However, now you are responsible for distributing computational tasks over multiple threads. And moreover, you will have to lock your arrays accordingly - because ILNumerics is not thread safe in general! This allows you to write concurrently to your output arrays but also brings the burdon of having to implement a correct locking scheme...

Is it ok to have multiple threads writing the same values to the same variables?

I understand about race conditions and how with multiple threads accessing the same variable, updates made by one can be ignored and overwritten by others, but what if each thread is writing the same value (not different values) to the same variable; can even this cause problems? Could this code:
GlobalVar.property = 11;
(assuming that property will never be assigned anything other than 11), cause problems if multiple threads execute it at the same time?
The problem comes when you read that state back, and do something about it. Writing is a red herring - it is true that as long as this is a single word most environments guarantee the write will be atomic, but that doesn't mean that a larger piece of code that includes this fragment is thread-safe. Firstly, presumably your global variable contained a different value to begin with - otherwise if you know it's always the same, why is it a variable? Second, presumably you eventually read this value back again?
The issue is that presumably, you are writing to this bit of shared state for a reason - to signal that something has occurred? This is where it falls down: when you have no locking constructs, there is no implied order of memory accesses at all. It's hard to point to what's wrong here because your example doesn't actually contain the use of the variable, so here's a trivialish example in neutral C-like syntax:
int x = 0, y = 0;
//thread A does:
x = 1;
y = 2;
if (y == 2)
print(x);
//thread B does, at the same time:
if (y == 2)
print(x);
Thread A will always print 1, but it's completely valid for thread B to print 0. The order of operations in thread A is only required to be observable from code executing in thread A - thread B is allowed to see any combination of the state. The writes to x and y may not actually happen in order.
This can happen even on single-processor systems, where most people do not expect this kind of reordering - your compiler may reorder it for you. On SMP even if the compiler doesn't reorder things, the memory writes may be reordered between the caches of the separate processors.
If that doesn't seem to answer it for you, include more detail of your example in the question. Without the use of the variable it's impossible to definitively say whether such a usage is safe or not.
It depends on the work actually done by that statement. There can still be some cases where Something Bad happens - for example, if a C++ class has overloaded the = operator, and does anything nontrivial within that statement.
I have accidentally written code that did something like this with POD types (builtin primitive types), and it worked fine -- however, it's definitely not good practice, and I'm not confident that it's dependable.
Why not just lock the memory around this variable when you use it? In fact, if you somehow "know" this is the only write statement that can occur at some point in your code, why not just use the value 11 directly, instead of writing it to a shared variable?
(edit: I guess it's better to use a constant name instead of the magic number 11 directly in the code, btw.)
If you're using this to figure out when at least one thread has reached this statement, you could use a semaphore that starts at 1, and is decremented by the first thread that hits it.
I would expect the result to be undetermined. As in it would vary from compiler to complier, langauge to language and OS to OS etc. So no, it is not safe
WHy would you want to do this though - adding in a line to obtain a mutex lock is only one or two lines of code (in most languages), and would remove any possibility of problem. If this is going to be two expensive then you need to find an alternate way of solving the problem
In General, this is not considered a safe thing to do unless your system provides for atomic operation (operations that are guaranteed to be executed in a single cycle).
The reason is that while the "C" statement looks simple, often there are a number of underlying assembly operations taking place.
Depending on your OS, there are a few things you could do:
Take a mutual exclusion semaphore (mutex) to protect access
in some OS, you can temporarily disable preemption, which guarantees your thread will not swap out.
Some OS provide a writer or reader semaphore which is more performant than a plain old mutex.
Here's my take on the question.
You have two or more threads running that write to a variable...like a status flag or something, where you only want to know if one or more of them was true. Then in another part of the code (after the threads complete) you want to check and see if at least on thread set that status... for example
bool flag = false
threadContainer tc
threadInputs inputs
check(input)
{
...do stuff to input
if(success)
flag = true
}
start multiple threads
foreach(i in inputs)
t = startthread(check, i)
tc.add(t) // Keep track of all the threads started
foreach(t in tc)
t.join( ) // Wait until each thread is done
if(flag)
print "One of the threads were successful"
else
print "None of the threads were successful"
I believe the above code would be OK, assuming you're fine with not knowing which thread set the status to true, and you can wait for all the multi-threaded stuff to finish before reading that flag. I could be wrong though.
If the operation is atomic, you should be able to get by just fine. But I wouldn't do that in practice. It is better just to acquire a lock on the object and write the value.
Assuming that property will never be assigned anything other than 11, then I don't see a reason for assigment in the first place. Just make it a constant then.
Assigment only makes sense when you intend to change the value unless the act of assigment itself has other side effects - like volatile writes have memory visibility side-effects in Java. And if you change state shared between multiple threads, then you need to synchronize or otherwise "handle" the problem of concurrency.
When you assign a value, without proper synchronization, to some state shared between multiple threads, then there's no guarantees for when the other threads will see that change. And no visibility guarantees means that it it possible that the other threads will never see the assignt.
Compilers, JITs, CPU caches. They're all trying to make your code run as fast as possible, and if you don't make any explicit requirements for memory visibility, then they will take advantage of that. If not on your machine, then somebody elses.

Resources