Julia threadsafe loop parallelism for matrix construction - multithreading

My Julia for loop is of the "embarrassingly parallel" form
M=Array{Float64}(undef,200,100,100)
for i in 1:200
M[i,:,:,:]=construct_row(i)
end
where construct_row(i) is some function returning a 100x100 matrix.
I would like to use the 64 cores available to me (in fact 272 threads because of hyperthreading) to parallelize this loop by having each k value run on its own thread. So I preface the for loop with Threads.#threads.
As far as I understand, this is an obvious threadsafe situation, no synchronization is necessary. However, after reading https://discourse.julialang.org/t/poor-performance-while-multithreading-julia-1-0/20325/9, I am concerned about the comment by foobar_lv2
The most deadly pattern would be e.g. a 4xN matrix, where thread k
reads and writes into M[k, :]: Obviously threadsafe for humans, very
non-obviously threadsafe for your poor CPU that will run in circles.
So: Will multithreading work in the way I described for my for loop, or am I missing some major issue here?

Julia matrices are column major.
Indeed you will get the best performance when each Thread is mutating adjacent memory cells.
Hence the best performance will be obtained via:
#inbounds for i in 1:100
(#view M[:,:,i]) .= construct_row(i)
end
To illustrate let's test on a Julia running 4 threads:
julia> const m=Array{Float64}(undef,100,100,100);
julia> #btime Threads.#threads for i in 1:100
(#view m[:,:,i]) .= reshape(1.0:10_000.0,100,100)
end
572.500 μs (19 allocations: 2.39 KiB)
julia> #btime Threads.#threads for i in 1:100
(#view m[i,:,:]) .= reshape(1.0:10_000.0,100,100)
end
1.051 ms (21 allocations: 2.45 KiB)
You can see that mutating adjacent cells means the 2x performance.
Having such a huge number of threads you should also consider using multiprocessing and SharedArrays. My experience is that above 16 or 32 threads - multiprocessing can yield better performance than multi-threading. This is however case-specific and needs appropriate benchmarks.

Related

Why can a race condition occur when filling an array in parallel?

There is a function in the Julia language that fills an array with random values in parallel and calculates its sum:
function thread_test(v)
Threads.#threads for i = 1:length(v)
#inbounds v[i] = rand()
end
sum(v)
end
#inbounds is a macro that disables checks for a possible index out of the array, since in this case the index will always lie within its boundaries.
Why might a race condition occur when executing this code?
rand is generally not thread-safe in most languages, including some version of Julia. This means calling rand() from multiple threads can cause an undefined behaviour (in practice, the seed is typically written by different threads at the same time decreasing performance and the randomness of the random number generator). The Julia documentation explicitly states:
In a multi-threaded program, you should generally use different RNG objects from different threads or tasks in order to be thread-safe. However, the default RNG is thread-safe as of Julia 1.3 (using a per-thread RNG up to version 1.6, and per-task thereafter).
Besides this, the code is fine.
Because multiple threads are accessing the same variable (v) at the same time, which can lead to unexpected results.

Safe multithreading Julia

I want to download data using 3 API codes, and I would like to multi-thread the APIs (one thread per API). Would something like this work and be safe from data races?
# DataKeys: a DataFrame object with keys to search for
len_data=size(DataKeys)[1]
array_split=[1,floor(Int, len_data/3),floor(Int,2*len_data/3), floor(Int,len_data)]
api_keys=[api1, api2, api3]
data_slices=[DataFrame(), DataFrame(), DataFrame()]
Threads.#threads for k in 1:3
key=api_keys[k]
for i in array[k]:array[k+1]
loc=DataKeys[i,:index]
r=HTTP.get(url(loc),headers)
json_r=JSON3.read(String(r.body))
temp=DataFrame(json_r[:data])
global data_slices[k]=vcat(data_slices[k], temp)
end
end
On the one hand, I feel I should be safe since every thread works on a different element of data_slices, but OTOH they're all part of the same vector.
Updated with more details on the function within the threaded for.
Things do seem to work fine on this simple example:
using DataFrames
len=120
array_split=[1,floor(Int, len/3),floor(Int,2*len/3), floor(Int,len)]
data_slices=[DataFrame(), DataFrame(), DataFrame()]
Threads.#threads for k in 1:3
for loc in array_split[k]:array_split[k+1]
temp=DataFrame(k=k,squared=loc^2,half=loc/2, thrd=Threads.threadid())
global data_slices[k]=vcat(data_slices[k],temp)
end
end
data_fin=vcat(data_slices...)
although the order of the threads is [1,3,2] which is a bit odd. Also odd is that for this simple example, I ran a threaded and a non-threaded loop and the non-threaded one is faster (len=120000):
Threaded: 14.789920 seconds (20.25 M allocations: 72.492 GiB, 28.02% gc time, 3.78% compilation time)
Non-threaded 9.614164 seconds (19.14 M allocations: 72.434 GiB, 11.00% gc time)

Julia: Macro threads and parallel

as we know, Julia supports parallelism and this is something rooted in the language which is very good.
I recently saw that Julia supports threads but it seems to me to be experimental. I noticed that in the case of using the Threads.#Threads macro there is no need for Shared Arrays which is perhaps a computational advantage since no copies of the objects are performed. I also saw that there is the advantage of not declaring all functions with #everywhere.
Can anyone tell me the advantage of using the #parallel macro instead of the #threads macro?
Below are two simple examples of using non-synchronized macros for parallelism.
Using the #threads macro
addprocs(Sys.CPU_CORES)
function f1(b)
b+1
end
function f2(c)
f1(c)
end
result = Vector(10)
#time Threads.#threads for i = 1:10
result[i] = f2(i)
end
0.015273 seconds (6.42 k allocations: 340.874 KiB)
Using the #parallel macro
addprocs(Sys.CPU_CORES)
#everywhere function f1(b)
b+1
end
#everywhere function f2(c)
f1(c)
end
result = SharedArray{Float64}(10)
#time #parallel for i = 1:10
result[i] = f2(i)
end
0.060588 seconds (68.66 k allocations: 3.625 MiB)
It seems to me that for Monte Carlo simulations where loops are mathematically independent and there is a need for a lot of computational performance the use of the #threads macro is more convenient. What do you think the advantages and disadvantages of using each of the macros?
Best regards.
Here is my experience:
Threads
Pros:
shared memory
low cost of spawning Julia with many threads
Cons:
constrained to a single machine
number of threads must be specified at Julia start
possible problems with false sharing (https://en.wikipedia.org/wiki/False_sharing)
often you have to use locking or atomic operations for the program to work correctly; in particular many functions in Julia are not threadsafe so you have to be careful using them
not guaranteed to stay in the current form past Julia 1.0
Processess
Pros:
better scaling (you can spawn them e.g. on a cluster of multiple machines)
you can add processes while Julia is running
Cons:
low efficiency when you have to pass a lot of data between processes
slower to start
you have to explicitly share code and data to/between workers
Summary
Processes are much easier to work with and scale better. In most situations they give you enough performance. If you have large data transfers between parallel jobs threads will be better but are much more delicate to correctly use and tune.

Speed-up from multi-threading

I have a highly parallelizable problem. Hundreds of separate problems need to be solved by the same function. The problems each take an average of perhaps 120 ms (0.12 s) on a single core, but there is substantial variation, and some extreme and rare ones may take 10 times as long. Each problem needs memory, but this is allocated ahead of time. The problems do not need disk I/O, and they do not pass back and forth any variables once they are running. They do access different parts (array elements) of the same global struct, though.
I have C++ code, based on someone else's code, that works. (The global array of structs is not shown.) It runs 20 problems (for instance) and then returns. I think 20 is enough to even out the variability on 4 cores. I see the execution time flattening out from about 10 already.
There is a Win32 and an OpenMP version, and they behave almost identically in terms of execution time. I run the program on a 4-core Windows system. I include some OpenMP code below since it is shorter. (I changed names etc. to make it more generic and I may have made mistakes -- it won't compile stand-alone.)
The speed-up over the single-threaded version flattens out at about a factor of 2.3. So if it takes 230 seconds single-threaded, it takes 100 s multi-threaded. I am surprised that the speed-up is not a lot closer to 4, the number of cores.
Am I right to be disappointed?
Is there anything I can do to get closer to my theoretical expectation?
int split_bigtask(Inputs * inputs, Outputs * results)
{
for (int k = 0; k < MAXNO; k++)
results->solved[k].value = 0;
int res;
#pragma omp parallel shared(inputs, outputs)
{
#pragma omp for schedule(dynamic)
for (int k = 0; k < inputs->no; k++)
{
res = bigtask(inputs->values[k],
outputs->solved[k],
omp_get_thread_num()
);
}
}
return TRUE;
}
I Assume that there is no synchronization done within bigtask() (Obvious, but I'd still check it first).
It's possible that you run into a "dirty cache" problem: If you manipulate data that is close to each other (e.g. same cache line!) from multiple cores each manipulation will mark the cache line as dirty (which means that the processor needs to signal this to all other processeors which in turn involves synchronization again...).
you create too many threads (allocating a thread is quite an overhead. So creating one thread for each core is a lot more efficient than creating 5 threads each).
I personally would assume that you have case 2 ("Big Global Array").
Solution to the problem (if it's indeed case 2):
Write the results to a local array which is merged into the "Big Global Array" by the main thread after the end of the work
Split the global array into several smaller arrays (and give each thread one of these arrays)
Ensure that the records within the structure align on Cache-Line boundaries (this is a bit a hack since cache line boundaries may change for future processors)
You may want to try to create a local copy of the array for each thread (at least for the results)

overriding default Parallel Collections behavior in scala

I have a large batched parallel computation that I use a parallel map for in scala. I have noticed that there appears to be a gradual downstepping of CPU usage as the workers finish. It all comes down to a call to a call inside of the Map object
scala.collection.parallel.thresholdFromSize(length, tasksupport.parallelismLevel)
Looking at the code, I see this:
def thresholdFromSize(sz: Int, parallelismLevel: Int) = {
val p = parallelismLevel
if (p > 1) 1 + sz / (8 * p)
else sz
}
My calculation works great on a large number of cores, and now I understand why..
thesholdFromSize(1000000,24) = 5209
thesholdFromSize(1000000,4) = 31251
If I have an array of length 1000000 on 24 CPU's it will partition all the way down to 5209 elements. If I pass that same array into the parallel collections on my 4 CPU machine, it will stop partitioning at 31251 elements.
It should be noted that the runtime of my calculations is not uniform. Runtime per unit can be as much as 0.1 seconds. At 31251 items, that's 3100 seconds, or 52 minutes of time where the other workers could be stepping in and grabbing work, but are not. I have observed exactly this behavior while monitoring CPU utilization during the parallel computation. Obviously I'd love to run on a large machine, but that's not always possible.
My question is this: Is there any way to influence the parallel collections to give it a smaller threshold number that is more suited to my problem? The only thing I can think of is to make my own implementation of the class 'Map', but that seems like a very non-elegant solution.
You want to read up on Configuring Scala parallel collections. In particular, you probably need to implement a TaskSupport trait.
I think all you need to do is something like this:
yourCollection.tasksupport = new ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(24))
The parallelism parameter defaults to the number of CPU cores that you have, but you can override it like above. This is shown in the source for ParIterableLike as well.
0.1 second is large time enough to handle it separately. Wrap processing of each unit (or 10 units) in a separate Runnable and submit all of them to a FixedThreadPool. Another approach is to use ForkJoinPool - then it is easier to control the end of all computations.

Resources