Following this answer, I actually have more complicated code with three loops:
!$omp parallel
!$omp do
do i=1,4 ! can be parallelized
...
do k=1,1000 !to be executed sequentially
...
do j=1,4 ! can be parallelized
call job(i,j)
The outer loops finish quickly except for i=4. So I want to start threads on the innermost loop, but leaving the k-loop sequentially within each i-iteration. In fact, k loops over the changing states of a random number generator, so this cannot be parallelized.
How can I collapse only the i and j loops? I suspect the ordered clause to be useful here, but I'm afraid that it would affect the inner loop again and still I'm unsure of the syntax.
I can't imagine how that could work. Anyway, the collapse syntax definitely does not support that.
If you have load balancing issue, think about reordering your loops, using dynamic scheduling, OpenMP tasks or nested parallelism. There is not enough code to tell which might be applicable here.
If 1,4 is the real values you use in the outer loop, then I suggest to parallelize inner loops only (which can be parallelized), since there will be not much overhead.
Another suggestion is to swap k and i loops, if it's possible, so the outer loop would be loop in k and the two new inner loops in i and j could be parallelized together using collapse.
A lightweight and uniform approach for this case is to use OpenMP tasks.
You can use them for both parallel loops or just for the inner one. In the second case, we will have a combination of the for and task constructs. This solution exploits nested parallelism but avoids the implications of nested parallel regions. The taskloop construct is an equivalent and more automated approach.
Related
In the multithreaded versions of merge sort that I've seen, the multithreading is typically done during the recursion on the left and right subarray (i.e., each thread is assigned their own subarray to work on) and the merge operation is done by the master thread after each thread completes their individual work.
I am wondering if there's a nice way to multithread the final merge operation where you're merging 2 sorted subarrays? If so, how can this be done?
Actually there is a way to split the merging task among 2 concurrent threads:
once both subarrays are sorted,
assign one thread the task to merge the elements from the beginning of the sorted subarrays to the first half of the target array and
assign the other thread a different but complementary task: merging from the end of the sorted subarrays to the second half of the target array, starting from the end.
you must write these merging functions carefully so the sort stays stable and each thread should will only write half of the target array, potentially reading the same elements from the sorted subarrays but selecting different ones.
I have not seen this approach mentioned in the literature about multithread merge sort. I wonder if it performs better than classic implementations.
My problem is a fluid flow simulation but I will try to make the question as generic as possible. I have gone through the OpenMP API manual and OpenMP for F95. But as I am only 5-days old to multithreading, I seek your help after being baffled by the smorgasbord of options to optimise the code. I am using Intel Xeon CPU E5-2630 v4 # 2.20GHz with one socket and 10 cores in that socket (with hyperthreading becoming 20 CPUs).
My whole simulation is basically filled with two kinds of nested loops as in (i) and (ii) below.
i) Where an array element (C(I,J,K) and D(I,J,K) below) depends on the previous K-1 grid point and hence I can't parallelise the outer most loop, e.g.,
Nx=256, Ny=209, Nz=64
DO K = 2,NY-1
!$OMP PARALLEL DO
DO J = 1, NZ
DO I = 1, NX/2+1
C(I,J,K) = C(I,J,K)/(A(I,J,K)*C(I,J,K-1))
D(I,J,K) = (D(I,J,K)-D(I,J,K-1))/(C(I,J,K-1))
END DO
END DO
!$OMP END PARALLEL DO
END DO
A(:,:,1:NY) is already calculated in a different subroutine and hence
is available as a shared variable to the OpenMP threads.
ii) Where the update variable (A) do no depend on other grid points and hence I can parallelise all the loops, like the following:
!$OMP PARALLEL DO
DO K = 1, NY
DO J=1,NZ
DO I=1,NX
A(I,J,K)=(B(I,J,K)-B(I,J,K-1))/C(K-1)
END DO
END DO
END DO
!$OMP END PARALLEL DO
B(:,:,1:NY) and C(:,:,1:NY) are already calculated in a different subroutine
Question (a): Do the above nested-loops have a race condition?
Question (b): The output is correct and matches the serial code, but:
b(i): are there any loopholes in the codes that can make them work incorrectly in certain situations?
b(ii): can the output be correct with a race condition?
Question (c): Are there any ways to optimise these code further? There are many options in the above-mentioned manuals, but some help on pointing me to the right direction would be highly appreciated.
I run the codes with
$ ulimit -s unlimited
$ export OMP_NUM_THREADS=16
$ gfortran -O3 mycode.f90 -fopenmp -o mycode
With 16 threads it takes about 80 time units while with 6, 10 and 20 # of threads it take 105, 101 and 100 time units.
Question (d): I know there could be many reasons for the above, but is there a thumb rule to follow on choosing the right number of threads (except hit-and-trial as somewhat implied in answers to this question)?
Question (e): Is ulimit -s unlimited a good option? (without it I get a segmentation fault (core dumped) error)
Thanks.
(a) You have a race condition only if multiple threads perform accesses to the same location (without synchronization) and at least one of those is a write.
The second code snippet does not have a race condition because you only write to each location of A exactly once and never read from it.
Similarly, the first code snippet does not have a race condition as you only read/write from/to each location in the K slices of C and D once and then don't read it again within the same parallel section (because K is fixed within each parallel region). Reading from the K-1 slice is of course not a race.
(b)
(bi) Have you looked at the numerics? There seems to be a lot of room for catastrophic cancellation. But that's not threading-related.
(bii) Theoretically, yes. Really depends on how egregious it is. There is no race condition here though.
(c) Profile! Are you memory bound or CPU bound (presumably the former)? Do you get a lot of cache misses? Is there false sharing (I doubt it)? Can you rearrange your data and/or loops to improve cache behavior? Are your strided accesses aligned critically? There are many of these kinds of performance gotchas, it'll take time and experience to understand and recognize them. My advice would be to become particularly familiar with the impact of caching, that's at the heart of most performance questions.
(d) If you care about performance, you have to profile and compare. Modern CPUs are so fiendishly complex that you have little chance to predict performance of any but the most trivial snippets. As mentioned in the answers you linked, if you're memory bound then more threads tend to make things worse, but your own results show that performance still improves by having slightly more than one thread per physical core (presumably because the divisions are a bit slow?).
(e) That sounds like you are allocating large arrays on the stack. That itself is probably a bad idea precisely because you tend to run out of stack space. But I'm not familiar with Fortran best practices so I can't help you much there.
I am surprised that Linux kernel has infinite loop in 'do_select' function implementation. Is it normal practice?
Also I am interested in how file changes monitoring implemented in Linux kernel? Is it infinite loop again?
select.c source code
This is not an infinite loop; that term is reserved for loops with no exit condition at all. This loop has its exit condition in the middle: http://lxr.linux.no/#linux+v3.9/fs/select.c#L482 This is a very common idiom in C. It's called "loop and a half" and there's a simple pseudocode example here: https://stackoverflow.com/a/10767975/388520 which clearly illustrates why you would want to do this. (That question talks about Java but that's not important; this is a general structured-programming idiom.)
I'm not a kernel expert, but this particular loop appears to have been written this way because the logic of the inner loop needs to run both before and after the call to poll_schedule_timeout at the very bottom of the outer loop. That code is checking whether there are any events to return; if there are already events to return when select is invoked, it's supposed to return immediately; if there aren't any initially, there will be when poll_schedule_timeout returns. So in normal operation the outer loop should cycle either 0.5 or 1.5 times. (There may be edge-case circumstances where the outer loop cycles more times than that.) I might have chosen to pull the inner loop out to its own function, but that might involve passing pointers to too many local variables around.
This is also not a spin loop, by which I mean, the CPU is not wasting electricity checking for events over and over again until one happens. If there are no events to report when control reaches the call to poll_schedule_timeout, that function (by, ultimately, calling __schedule) will cause the calling thread to block -- the CPU is taken away from that thread and assigned to another process that can do something useful with it. (If there are no processes that need the CPU, it'll be put into a low-power "halt" until the next interrupt fires.) When one of the events happens, or the timeout, the thread that called select will get "woken up" and poll_schedule_timeout will return.
On a larger note, operating system kernels often do things that would be considered strange, poor style, or even flat-out wrong, in the service of other engineering goals (efficiency, code reuse, avoidance of race conditions that can only occur on some CPUs, ...) They are written by people who know exactly what they are doing and exactly how far they can get away with bending the rules. You can learn a lot from reading though OS code, but you probably shouldn't try to imitate it until you have a bit more experience. You wouldn't try to pastiche the style of James Joyce as your first exercise in creative writing, ne? Same deal.
I have a matrix of a big size say 20000*20000 and this matrix keeps on changing every iteration. The matrix is being produced in Fortran, and Fortran calls a C++ function that processes the matrix into block diagonal form. I would like to have the c++ function creates two threads (using C++11) where each will handle 10000 * 10000. I can easily break the matrix into two parts since it's a special matrix. The matrix elements keeps on changing every iterations and if I create and join (kill) the two threads every iteration the over head becomes way expensive and the point of using multi-threading approach is lost. I decided to do the iterations inside the threads; however, I am not sure if I can keep the threads waiting for the updated matrix in order to solve in the next interation (we need to go back to Fortran to calculate the new matrix through).
The point I am stuck at is the following:
When I create the two threads from the function in c++, that function will return to Fortran and the function instance is destroyed (right?). What happen to the two threads that are currently waiting for the new matrix ?
If I have two datasets (having equal number of rows and columns) and I wish to run a piece of code that I have made, then there are two options obviously, either to go with sequential execution or parallel programming.
Now, the algorithm (code) that I have made is a big one and consists of multiple for loops. I wish to ask, is there any way to directly use it on both of them or will I have to transform the code in some way? A heads up would be great.
To answer your question: you do not have to transform the code to run it on two datasets in parallel, it should work fine like it is.
The need for parallel processing usually arises in two ways (for most users, I would imagine):
You have code you can run sequentially, but you would like to do it in parallel.
You have a function that is taking very long to execute on a large dataset, and you would like to run it in parallel to speed it up.
For the first case, you do not have to do anything, you can just execute it in parallel using one of the libraries designed for it, or just run two instances of R on the same computer and run the same code but with different datasets in each of them.
It doesn't matter how many for loops you have in there and you don't even need to have the same number of rows in columns in the datasets.
If it runs fine sequentially, it means there will be no dependence between the parallel chains and thus no problem.
Since your question falls in the first case, you can run it in parallel.
If you have the second case, you can sometimes turn it into the first case by splitting your dataset into pieces (where you can run each of the pieces sequentially) and then you run it in parallel. This is easier said than done, and won't always be possible. It is also why not all functions just have a run.in.parallel=TRUE option: it is not always obvious how you should split the data, nor is it always possible.
So you have already done most of the work by writing the functions, and splitting the data.
Here is a general way of doing parallel processing with one function, on two datasets:
library( doParallel )
cl <- makeCluster( 2 ) # for 2 processors, i.e. 2 parallel chains
registerDoParallel( cl )
datalist <- list(mydataset1 , mydataset2)
# now start the chains
nchains <- 2 # for two processors
results_list <- foreach(i=1:nchains ,
.packages = c( 'packages_you_need') ) %dopar% {
result <- find.string( datalist[[i]] )
return(result) }
The result will be a list with two elements, each containing the results from a chain. You can then combine it as you wish, or use a .combine function. See the foreach help for details.
You can use this code any time you have a case like number 1 described above. Most of the time you can also use it for cases like number 2, if you spend some time thinking about how you want to divide the data, and then combine the results. Think of it as a "parallel wrapper".
It should work in Windows, GNU/Linux, and Mac OS, but I haven't tested it on all of them.
I keep this script handy whenever I need a quick speed-up, but I still always start out by writing code I can run sequentially. Thinking in parallel hurts my brain.