From briefly looking online, I couldn't find a method for reading from or writing to a file in Fortran 90 using OpenMP in parallel. I want to know if it is possible to do something like
!$OMP PARALLEL
do i=1,Nx
do j=1,Ny
do k=1,Nz
read(1,*) B(i,j,k)
enddo
enddo
enddo
!$OMP END PARALLEL
!$OMP PARALLEL
do i=1,Nx
do j=1,Ny
do k=1,Nz
write(1,*) B(i,j,k)
enddo
enddo
enddo
!$OMP END PARALLEL
Where there are additional clauses to ensure that this is done correctly. I imagine there may be an easy way to do this for the read, but the write seems less obvious. Thanks for any and all help/suggestions!
Since you're doing formatted io, think about how thread N figures out the correct offset in the file to read/write.
Additionally, since the question is tagged gfortran, in gfortran there is a per-unit mutex which is held for the duration of a io statement.
Related
I have an array A_p which is defined as threadprivate for each thread. The code is complicated and does some calculations. Finally, I want to reduce all the arrays to one shared array A.
DO J=1,Y
DO I=1,X
a=0
!$omp parallel reduction(+:a)
a = A_p(I,J)
!$omp end parallel
A(I,J) = A(I,J)+ a
END DO
END DO
This solution works, but the problem is that the threads are probably created every iteration, which incurs a huge overhead. I would like to find a way to save the threads alive between the iterations, so they could be just created once.
I have also tried the following solution:
!$omp parallel reduction(+:A)
A = A_p
!$omp end parallel
but it seems to create a certain overhead for initializing a private variable A for each thread (which by the way is redundant, because there are already threadprivate variables and we do not really need more private arrays). Of course the overhead here is smaller than the overhead observed in the previous solution, but still not good enough for me.
Also, I would like to ask about the way OpenMP implement the reduction. For example, in the first solution I presented, does the reduction of variable a is serial, or it is implemented in a tree combining fashion (achieving a logarithmic running time for the reduction phase)?
in my current code I am using two OpenMP sections, with two threads only
as follows:
!$omp parallel NUM_THREADS(2)
!$omp sections
!$omp section
do work1
!$omp section
do work2
!$omp end sections
!$omp end parallel
work1 is more time consuming than work2, and because sections construct requires both threads to finish their tasks to synchronize, work1 becomes the limiting step, and one thread is wasted most of the time.
I want to use a more flexible construct such as task, but I don't know if it is
possible to do the following and if so, how to do it. First, start with two threads (1 task per thread as in sections construct), one solving work1 and the other work2. Second, as soon as the easiest task work2 is finished, the free thread could be used to speedup work1.
Thanks.
I have a code that is already paralleled by MPI. Every MPI process have to do a lot of calculations on the grid. The code looks something like this:
do i=starti,stopi
call function1(input1(i),output1)
call function1(input2(i+1),output2)
call function1(input3(i+2,output3)
call solve(output1,output2,output3,OUTPUT(i))
end do
Where MPI have different ranges of starti and stopi. For example for 2 processes I have:
process 1 starti=1 , stopi=100
process 2 starti=101 , stopi=200
This is working very nice but I want to use OpenMP to speed thing up a little bit. Modified code looks like this:
do i=starti,stopi
!$OMP PARALLEL SECTIONS
!$OMP SECTION
call function1(input1(i),output1)
!$OMP SECTION
call function1(input2(i+1),output2)
!$OMP SECTION
call function1(input3(i+2),output3)
!$OMP END PARALLEL SECTIONS
call solve(output1,output2,output3,OUTPUT(i))
end do
But when I run this code with 2 MPI Processes and 3 OpenMP threads it is 2x slower than pure MPI. The CPU is not the case as I have 4 cores/ 8 threads CPU.
Why is that?
I already saw several posts on this site which talk about this issue. However, I think my serious codes where overhead due to creation of threads and all should not be a big issue, have become much slower with open mp now! I am using a quad core machine with gfortran 4.6.3 as my compiler. Below is an example of a test code.
Program test
use omp_lib
integer*8 i,j,k,l
!$omp parallel
!$omp do
do i = 1,20000
do j = 1, 1000
do k = 1, 1000
l = i
enddo
enddo
enddo
!$omp end do nowait
!$omp end parallel
End program test
This code takes around 80 seconds if I run it without open mp, however, with open mp, it takes around 150 seconds. I have seen the same issue with my other serious codes whose runtime is around 5 minutes or so in serial mode. In those codes I am taking care that there are no dependencies from thread to thread. Then why should these codes become slower instead of faster?
Thanks in advance.
You have a race condition, more threads are writing in the same shared l. Thus the program is invalid, l should be private. It also leads to a slowdown because the threads invalidate the cache content the other cores have and the threads have to reload the memory content all the time. Similar thing happens when more threads use the same cache line and it is known as false sharing.
You also probably don't use any compiler optimizations. Enable them by -O2 -O3, -O5 or -Ofast. You will see that the program takes 0 seconds because the compiler optimizes everything out.
I have a quick question regarding the OpenMP. Usually one can do a section parallel like this (written in fortran, and has two sections):
!$OMP PARALLEL SECTIONS
!$OMP SECTION
< Fortran code block A>
!$OMP SECTION
< Fortran code block B>
!$OMP END SECTIONS
Now what I really want to run fortran code block A and B within a do loop, which itself should not be parallelized, because this do-loop is a time-dependent loop that every new step depend on previous step’s results. And before the parallel section, I need to run a serial code (let's call it block C). Now both block A, B, C are function of do loop variable t. Then naively one might propose such code by simply embedded this parallel within a do loop:
do t=1:tmax
< Fortran serial code block C>
!$OMP PARALLEL SECTIONS
!$OMP SECTION
< Fortran code block A>
!$OMP SECTION
< Fortran code block B>
!$OMP END SECTIONS
end do
However, it is obvious that the creation of the thread overheads will largely decelerate this speed, which even possibly make it slower than a standard serial code. Therefore, one might come up with smarter idea to solve this.
I was wondering whether you can help me on giving some hints on how to do this. What's the best approach (fastest computation) on this?
I concur with both comments that it is not at all obvious how much the OpenMP overhead would be compared to the computation. If you find it (after performing the corresponding measurements) to be really high, then the typical way to handle this case is to put the loop inside a parallel region:
!$OMP PARALLEL PRIVATE(t)
do t=1,tmax
!$OMP SINGLE
< Fortran code block C >
!$OMP END SINGLE
!$OMP SECTIONS
!$OMP SECTION
< Fortran code block A >
!$OMP SECTION
< Fortran code block B >
!$OMP END SECTIONS
end do
!$OMP END PARALLEL
Each thread will loop independently. The SECTIONS construct has an implicit barrier at its end so the threads are synchronised before the next loop iteration. If there is some additional code before the end of the parallel region that does not synchronise, an explicit barrier has to be inserted just before end do.
The SINGLE construct is used to isolate block C such that it gets executed by one thread only.