I have a loop that I want to run multi-threaded. It depends on variables (arrays) that were preallocated and must be private to each thread. How do I do this preallocation in each thread? Alternatively, is there a macro to run a task in every thread?
The only solution I could think of relies on metaprogramming, therefore there is more overhead to convert a code (I have many arrays to preallocate). Here is what I got that works:
Threads.#threads for t = 1:Threads.nthreads()
# pre-alloc arrays for each thread
eval(Meta.parse("zl$(t) = Array{Float64}(undef, ($(ns),$(ns)))"))
end
Threads.#threads for i = 1:N
t = Threads.threadid()
zl = eval(Meta.parse("zl$(t)"))
# do things...
end
I was hoping for a solution similar to when you use OpenMP in a C code
#pragma omp parallel
{
double* zl = malloc(ns * ns * sizeof(double));
#pragma omp for
for (size_t i = 0; i < N; i++) {
// do things...
}
}
The rule in Julia is simple - if you do not know how to do something metaprogramming is never a good approach :-)
The pattern you need is to create a Vector of matrices and provide each matrix to each thread.
ns = 3
zls = [Matrix{Float64}(undef,ns,ns) for t in 1:Threads.nthreads()]
Threads.#threads for i = 1:N
zl = zls[Threads.threadid()]
# do things with zl....
end
If you want to prealocate the memory for zls in parallel try (although for all scenarios I can think of I doubt it is worth doing):
zls = Vector{Matrix{Float64}}(undef, Threads.nthreads())
Threads.#threads for i = 1:Threads.nthreads()
zls[i] = Matrix{Float64}(undef,ns,ns)
end
I've been asked a very interesting question about threads and how to implement them, specifically recursive or iterative. This is in the context of sorting algorithms like quicksort.
When you have an array of elements that need sorting, would you rather implement a tree structure of threads(so recursive) that keep spawning new threads until the sorting size threshold is reached, or would you rather divide the array from the very beginning in even chunks and spawn threads for them?
Example recursive psuedocode:
void sort(int array[], int start, int end){
if(array.size > THRESHOLD){
//partition logic with calls to sort()
}
//sorting logic
}
Example iterative psuedocode:
void sort(int array[], int start, int end){
if(array.size > THRESHOLD){
int numberOfChunks = array.size / 1000;
for(int i = 0; i < numberOfChunks; i++){
//spawn thread for every chunk with calls to sort(), technically also recursion but only once and can be rewritten easily
}
}
//sorting logic
}
Assume the calls to sort are separate threads. I didn't want to clutter the examples with boilerplate. Try and look past the multitude of errors.
Picture:
What I've been taught in college for quicksort is to use the recursive method. But(and this is my opinion and nothing else) I think recursion has a tendency to make code unreadable and complex. Sure it looks fancy and works well, but it's harder to read.
What is the recommended way of doing things here?
I am wondering about how concurrency can be expressed without an explicit thread object, not the implementation, which probably would use threads or thread pools, but the language design related issues.
Q1: I wonder what would be lost if there was no thread object, what couldn't be done in such a language?
Q2: I also wonder about how this would be expressed, what ways were proposed or implemented as alternatives or complements to threads?
one possibility is the MPI-programm-model (GPU as well)
lets say you have the following code
for(int i=0; i < 100; i++) {
work(i);
}
the "normal" thread-based way would be the separation of the iteration-range into multiple subsets. So something like this
Thread-1:
for(int i=0; i < 50; i++) {
work(i);
}
Thread-2:
for(int i=50; i < 100; i++) {
work(i);
}
however in MPI/GPU you do something different.
the idea is, that every core execute the same(GPU) or at least
a similar (MPI) programm. the difference is, that each core uses
a different ID, which changes the behavior of the code.
mpi-style: (not exactly the MPI-syntax)
int rank = get_core_id();
int size = get_num_core();
int subset = 100 / size;
for (int i = rank * subset;i < (rand+1)*subset; i+) {
//each core will use a different range for i
work(i);
}
the next big thing is communication. Normally you need to use all of the synchronization-stuff manually. MPI is message-based, meaning that its not perfectly suited for classical shared-memory modells (every core has access to the same memory), but in a cluster system (many cores combined with a network) it works excellent. This is not only limited to supercomputers (they use basically only mpi-style stuff), but in the recent years a new type of core-architecture (manycores) was developed. They have a local so called Network-On-Chip, so each core can send/receive messages without having the problem with synchronization.
MPI contains not only simple messages, but higher constructs to automatically scatter and gather data to every core.
Example: (again not MPI-syntax)
int rank = get_core_id();
int size = get_num_core();
int data[100];
int result;
int results[size];
if (rank == 0) { //master-core only
fill_with_stuff(data);
}
scatter(0, data); //core-0 will send the data-content to all other cores
result = work(rank, data); // every core works on the same data
gather(0,result,results); //get all local results and store them in
//the results-array of core-0
an other solutions is the openMP-libary
here you declare parallel-blocks. the whole thread-part is done by the libary itself
example:
//this will split the for-loop automatically in 4 threads
#pragma omp parallel for num_threads(4)
for(int i=0; i < 100; i++) {
work(i);
}
the big advantage is, that its fast to write. thats it
you may get better performance with writing the threads on your own,
but it takes a lot more time and knowledge about synchronization
I would like to use OpenMP to make my program run faster. Unfortunately, the opposite is the case. My code looks something like this:
const int max_iterations = 10000;
int num_interation = std::numeric_limits<int>::max();
#pragma omp parallel for
for(int i = 0; i < std::min(num_interation, max_iterations); i++)
{
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
num_interation = update_iterations(...);
}
For some reason, many more iterations are processed than required. Without OpenMP, it takes 500 iterations on avarage. However, even when setting the numbers of threads to one (set_num_threads(1)), it computes more than one thousand iterations. The same happens if I use mutliple threads, and also when using a writelock when updating num_iterations.
I would assume that it has something todo with memory bandwidth or a race condition. But those problems should not appear in case of set_num_threads(1).
Therefore, I assume that it could have something todo with the scheduling and the chunk size. However, I am really not sure about this.
Can somebody give me a hint?
A quick answer for the behaviour you experience is given by the OpenMP standard page 56:
The iteration count for each associated loop is computed before entry
to the outermost loop. If execution of any associated loop changes any
of the values used to compute any of the iteration counts, then the
behavior is unspecified.
In essence, this means that you cannot modify the boundaries of your loop once you entered it. Although according to the standard the behaviour is "unspecified", in your case, what happen is quite clear since as soon as you switch OpenMP on on your code, you compute the number of iterations you had specified initially.
So you have to take another approach to this problem.
This is a possible solution (amongst many other) which I hope scales OK. It has the drawback of potentially allowing more iterations to happen than the number you intended (up to OMP_NUM_THREADS-1 more iterations than expected, assuming that //do sth. is balanced, and many more if not). Also, it assumes that update_iterations(...) is thread safe and can be called in parallel without unwanted side effects... This is a very strong assumption which you'd better enforce!
num_interation = std::min(num_interation, max_iterations);
#pragma omp parallel
{
int i = omp_get_thread_num();
const int nbth = omp_get_num_threads();
while ( i < num_interation ) {
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
int new_num_interation = update_iterations(...);
#pragma omp critical
num_interation = std::min(num_interation, new_num_interation);
i += nbth;
}
}
A more synchronised solution, if the //do sth. isn't so balanced and not doing too many extra iterations is important, could be:
num_interation = std::min(num_interation, max_iterations);
int nb_it_done = 0;
#pragma omp parallel
{
int i = omp_get_thread_num();
const int nbth = omp_get_num_threads();
while ( nb_it_done < num_interation ) {
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
int new_num_interation = update_iterations(i);
#pragma omp critical
num_interation = std::min(num_interation, new_num_interation);
i += nbth;
#pragma omp single
nb_it_done += nbth;
}
}
Another weird thing here is that, since you didn't show what i is used for, it isn't clear if iterating somewhat randomly into the domain is a problem. If it isn't, the first solution should work well, even for unbalanced //do sth.. But if it is a problem, then you'd better stick with the second solution (and even potentially reinforce the synchronism).
But at the end of the day, there is now way (that I can think of and with decent parallelism) to avoid potential extra work to be done, since the number of iterations can change along the way.
I recently wrote a small number-crunching program that basically loops over an N-dimensional grid and performs some calculation at each point.
for (int i1 = 0; i1 < N; i1++)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
histogram[bin_index(i1, i2, i3, i4)] += 1; // see bottom of question
It worked fine, yadda yadda yadda, lovely graphs resulted ;-) But then I thought, I have 2 cores on my computer, why not make this program multithreaded so I could run it twice as fast?
Now, my loops run a total of, let's say, around a billion calculations, and I need some way to split them up among threads. I figure I should group the calculations into "tasks" - say each iteration of the outermost loop is a task - and hand out the tasks to threads. I've considered
just giving thread #n all iterations of the outermost loop where i1 % nthreads == n - essentially predetermining which tasks go to which threads
trying to set up some mutex-protected variable which holds the parameter(s) (i1 in this case) of the next task that needs executing - assigning tasks to threads dynamically
What reasons are there to choose one approach over the other? Or another approach I haven't thought about? Does it even matter?
By the way, I wrote this particular program in C, but I imagine I'll be doing the same kind of thing again in other languages as well so answers need not be C-specific. (If anyone knows a C library for Linux that does this sort of thing, though, I'd love to know about it)
EDIT: in this case bin_index is a deterministic function which doesn't change anything except its own local variables. Something like this:
int bin_index(int i1, int i2, int i3, int i4) {
// w, d, h are constant floats
float x1 = i1 * w / N, x2 = i2 * w / N, y1 = i3 * d / N, y2 = i4 * d / N;
float l = sqrt((x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2) + h * h);
float th = acos(h / l);
// th_max is a constant float (previously computed as a function of w, d, h)
return (int)(th / th_max);
}
(although I appreciate all the comments, even those which don't apply to a deterministic bin_index)
The first approach is simple. It is also sufficient if you expect that the load will be balanced evenly over the threads. In some cases, especially if the complexity of bin_index is very dependant on the parameter values, one of the threads could end up with a much heavier task than the rest. Remember: the task is finished when the last threads finishes.
The second approach is a bit more complicated, but balances the load more evenly if the tasks are finegrained enough (the number of tasks is much larger than the number of threads).
Note that you may have issues putting the calculations in separate threads. Make sure that bin_index works correctly when multiple threads execute it simultaneously. Beware of the use of global or static variables for intermediate results.
Also, "histogram[bin_index(i1, i2, i3, i4)] += 1" could be interrupted by another thread, causing the result to be incorrect (if the assignment fetches the value, increments it and stores the resulting value in the array). You could introduce a local histogram for each thread and combine the results to a single histogram when all threads have finished. You could also make sure that only one thread is modifying the histogram at the same time, but that may cause the threads to block each other most of the time.
The first approach is enough. No need for complication here. If you start playing with mutexes you risk making hard to detect errors.
Don't start complicating unless you really see that you need this. Syncronization issues (especially in case of many threads instead of many processes) can be really painful.
As I understand it, OpenMP was made just for what you are trying to do, although I have to admit I have not used it yet myself. Basically it seems to boil down to just including a header and adding a pragma clause.
You could probably also use Intel's Thread Building Blocks Library.
If you never coded a multithread application, I bare you to begin with OpenMP:
the library is now included in gcc by default
this is very easy to use
In your example, you should just have to add this pragma:
#pragma omp parallel shared(histogram)
{
for (int i1 = 0; i1 < N; i1++)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
histogram[bin_index(i1, i2, i3, i4)] += 1;
}
With this pragma, the compiler will add some instruction to create threads, launch them, add some mutexes around accesses to the histogram variable etc... There are a lot of options, but well defined pragma do all the work for you. Basically, the simplicity depends on the data dependency.
Of course, the result should not be optimal as if you coded all by hand. But if you don't have load balancing problem, you maybe could approach a 2x speed up. Actually this is only write in matrix with no spacial dependency in it.
I would do something like this:
void HistogramThread(int i1, Action<int[]> HandleResults)
{
int[] histogram = new int[HistogramSize];
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
histogram[bin_index(i1, i2, i3, i4)] += 1;
HandleResults(histogram);
}
int[] CalculateHistogram()
{
int[] histogram = new int[HistogramSize];
ThreadPool pool; // I don't know syntax off the top of my head
for (int i1=0; i1<N; i1++)
{
pool.AddNewThread(HistogramThread, i1, delegate(int[] h)
{
lock (histogram)
{
for (int i=0; i<HistogramSize; i++)
histogram[i] += h[i];
}
});
}
pool.WaitForAllThreadsToFinish();
return histogram;
}
This way you don't need to share any memory, until the end.
If you ever do it in .NET, use the Parallel Extensions.
If you want to write multithreaded number crunching code (and you are going to be doing a lot of it in the future) I would suggest you take a look at using a functional language like OCaml or Haskell.
Due to the lack of side effects and lack of shared state in functional languages (well, mostly) making your code run across multiple threads is a LOT easier. Plus, you'll probably find that you end up with a lot less code.
I agree with Sharptooth that your first approach seems like the only plausible one.
Your single threaded app is continuously assigning to memory. To get any speedup, your several threads would need to also be continuously assigning to memory. If only one thread is assigning at a time, you would get no speedup at all. So if your assignments are guarded, the whole exercise would fail.
This would be a dangerous approach since you assigning to shared memory without a guard. But it seems to be worth the danger (if a x2 speedup matters). If you can be sure that all the values of bin_index(i1, i2, i3, i4) are different in your division of the loop, then it should work since the array assignment would be to a different locations in your shared memory. Still, one always should look and hard at approaches like this.
I assume you would also produce a test routine to compare the results of the two versions.
Edit:
Looking at your bin_index(i1, i2, i3, i4), I suspect your process could not be parallelized without considerable effort.
The only way to divide up the work of calculation in your loop is, again, to be sure that your threads will access the same areas in memory. However, it looks like bin_index(i1, i2, i3, i4) will likely repeat values quite often. You might divide up the iteration into the conditions where bin_index is higher than a cutoff and where it is lower than a cut-off. Or you could divide it arbitrarily and see whether increment is implemented atomically. But any complex threading approach looks unlikely to provide improvement if you can only have two cores to work with to start with.