No speedup with OpenMP when using Matlab MEX in Linux

No speedup with OpenMP when using Matlab MEX in Linux - linux

I'm using OpenMP to speed up Fortran code in a Matlab MEX-file. However, I find that OpenMP seems not work on Linux, but actually works on Windows. I attach the code as follows:
1) Matlab Mex file:
clc; clear all; close all; tic
FLAG_SYS = 0; % 0 for Windows; 1 for Linux
%--------------------------------------------------------------------------
% Mex Fortran code
%--------------------------------------------------------------------------
if FLAG_SYS == 0
mex COMPFLAGS="-Qopenmp $COMPFLAGS"...
LINKFLAGS="/Qopenmp $LINKFLAGS"...
OPTIMFLAGS="/Qopenmp $OPTIMFLAGS"...
'-IC:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.5.267\windows\mkl\include'...
'-LC:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.5.267\windows\mkl\lib\intel64'...
-lmkl_intel_ilp64.lib -lmkl_intel_thread.lib -lmkl_core.lib libiomp5md.lib...
Test_OpenMP_Mex.f90...
-output Test_OpenMP_Mex
elseif FLAG_SYS == 1
mex COMPFLAGS="-fopenmp $COMPFLAGS"...
LINKFLAGS="-fopenmp $LINKFLAGS"...
FFLAGS='$FFLAGS -fdec-math -cpp' ...
'-I${MKLROOT}/include'...
'-L${MKLROOT}/lib'...
-lmkl_avx2 -lmkl_gf_ilp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread -lm -ldl...
Test_OpenMP_Mex.f90...
-output Test_OpenMP_Mex
end
Test_OpenMP_Mex;
2) Fortran code
#include "fintrf.h"
!GATEWAY ROUTINE
SUBROUTINE MEXFUNCTION(NLHS, PLHS, NRHS, PRHS)
!DECLARATIONS
IMPLICIT NONE
!MEXFUNCTION ARGUMENTS:
MWPOINTER PLHS(*), PRHS(*)
INTEGER NLHS, NRHS
!FUNCTION DECLARATIONS:
MWPOINTER MXCREATEDOUBLEMATRIX
MWPOINTER MXGETM, MXGETN
INTEGER MXISNUMERIC
!POINTERS TO INPUT MXARRAYS:
MWPOINTER MIV1, MIV2
!POINTERS TO OUTPUT MXARRAYS:
MWPOINTER MOV1, MOV2
!CALL FORTRAN CODE
CALL TEST_OPENMP
RETURN
END
!-----------------------------------------------------------------------
SUBROUTINE TEST_OPENMP
USE OMP_LIB
IMPLICIT NONE
INTEGER I, J, K, STEP
REAL*8 STARTTIME, ENDTIME,Y
OPEN(1,FILE='1.TXT')
!COUNT ELAPSED TIME START
STARTTIME = OMP_GET_WTIME()
DO I = 1,1000000
DO J = 1,50000
DO K = 1,1000
Y=(I+10)*J-SQRT(789.1)+SQRT(789.1)-(I+10)*J
END DO
END DO
END DO
ENDTIME = OMP_GET_WTIME()
WRITE(1,*) ENDTIME-STARTTIME
!COUNT ELAPSED TIME START
STARTTIME = OMP_GET_WTIME()
!$OMP PARALLEL
!$OMP DO PRIVATE(I,J)
DO I = 1,1000000
DO J = 1,50000
DO K = 1,1000
Y=(I+10)*J-SQRT(789.1)+SQRT(789.1)-(I+10)*J
END DO
END DO
END DO
!$OMP END DO
!$OMP END PARALLEL
ENDTIME = OMP_GET_WTIME()
WRITE(1,*) ENDTIME-STARTTIME
!$OMP PARALLEL
! GET THE NUMBER OF THREADS
WRITE(1,*) OMP_GET_THREAD_NUM(), OMP_GET_NUM_THREADS()
!$OMP END PARALLEL
CLOSE(1)
RETURN
END SUBROUTINE TEST_OPENMP
The output on Windows is:
1.09620520001044
4.50355500000296
0 6
1 6
3 6
5 6
2 6
4 6
and the output on Linux is:
0.0000
0.0000
0 1
It's obvious that OpenMP works on Windows, since the calculation time reduces from 4.5s to 1.0s. I can find that there are 6 threads being used for calculation. However, on Linux, no calculation seems to be executed, and there are only 2 threads (the number of threads on Linux is 36, but only 2 of them are used).
Any suggestions are welcome!
You can directly download code from this link:
https://www.dropbox.com/sh/crkuwhu22407sjs/AAAQrtzAvTmFOmAxv_jpTCBaa?dl=0

When compiling MEX-files under Linux (and MacOS) the COMPFLAGS variable is ignored. It is a Windows-specific environment variable. You need to use CFLAGS for C, CXXFLAGS for C++, or FFLAGS for Fortran, and LDFLAGS for the linker. These are the standard Unix environment variables to control compilation.
Your compile command will look like this:
mex LDFLAGS='-fopenmp $LDFLAGS'...
FFLAGS='-fopenmp -fdec-math -cpp $FFLAGS' ...
'-I${MKLROOT}/include'...
'-L${MKLROOT}/lib'...
-lmkl_avx2 -lmkl_gf_ilp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread -lm -ldl...
Test_OpenMP_Mex.f90...
-output Test_OpenMP_Mex
Reference:
MATLAB documentation
GCC environment variables

There is one note you shouldn't miss when lining against intel mkl ilp64 versions of libs:
you need to add -I4 compiler option, otherwise, you may see some kind of an unexpected segfault... Please refer to the mkl linker adviser to see more details: https://software.intel.com/content/www/us/en/develop/articles/intel-mkl-link-line-advisor.html

Related

Openmp: use of parallel do with omp_get_thread_num()

I'm splitting a do loop using parallel do and private clause. In this loop I add a variable to itself. Why do I get errors if I don't need a critical block or atomic statement in this case?
How can I fix it?
program trap
use omp_lib
implicit none
double precision::suma=0.d0 ! sum is a scalar
double precision:: h,x,lima,limb
integer::n,i, istart, iend, thread_num=4, total_threads, ppt
integer(kind=8):: tic, toc, rate
double precision:: time
double precision, dimension(4):: pi= 0.d0
call system_clock(count_rate = rate)
call system_clock(tic)
lima=0.0d0; limb=1.0d0; suma=0.0d0; n=100000000
h=(limb-lima)/n
suma=h*(f(lima)+f(limb))*0.5d0 !first and last points
ppt= n/total_threads
!$ call omp_set_num_threads(total_threads)
!$omp parallel do private (istart, iend, thread_num, i)
thread_num = omp_get_thread_num()
!$ istart = thread_num*ppt +1
!$ iend = min(thread_num*ppt + ppt, n)
do i=istart,iend ! this will control the loop in different threads
x=lima+i*h
suma=suma+f(x)
pi(thread_num+1)=suma
enddo
!$omp end parallel do
suma=sum(pi)
suma=suma*h
print *,"The value of pi is= ",suma ! print once from the first image
call system_clock(toc)
time = real(toc-tic)/real(rate)
print*, 'Time ', time, 's'
contains
double precision function f(y)
double precision:: y
f=4.0d0/(1.0d0+y*y)
end function f
end program trap
I get the following errors:
test.f90:23:35:
23 | thread_num = omp_get_thread_num()
Error: Unexpected assignment statement at (1)
test.f90:24:31:
24 | !$ istart = thread_num*ppt +1
Error: Unexpected assignment statement at (1)
test.f90:25:40:
25 | !$ iend = min(thread_num*ppt + ppt, n)
Error: Unexpected assignment statement at (1)
Compiled with:
gfortran -fopenmp -Wall -Wextra -O2 -Wall -o prog.exe test.f90
./prog.exe

I don't understand why you are manually splitting up the loop when the worksharing constructs in openmp, such as !$omp do, can do this automatically for you. Below is how I would do it
ian#eris:~/work/stack$ cat thread.f90
program trap
Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64
use omp_lib
implicit none
Real( wp ) ::suma=0.0_wp ! sum is a scalar
Real( wp ) :: h,x,lima,limb
integer(li):: tic, toc, rate
Real( wp ) :: time
Real( wp ) :: pi
Integer :: i, n
call system_clock(count_rate = rate)
call system_clock(tic)
lima=0.0_wp; limb=1.0_wp; suma=0.0_wp; n=100000000
h=(limb-lima)/n
suma=h*(f(lima)+f(limb))*0.5_wp !first and last points
pi = 0.0_wp
!$omp parallel default( None ) private( i, x, lima ) &
!$omp shared( pi, n, h )
!$omp do reduction( +:pi )
do i= 1, n
x = lima + i * h
pi = pi + f( x )
enddo
!$omp end do
!$omp end parallel
print *,"The value of pi is= ", pi / n
call system_clock(toc)
time = real(toc-tic)/real(rate)
print*, 'Time ', time, 's on ', omp_get_max_threads(), ' threads'
contains
function f(y)
Real( wp ) :: f
Real( wp ) :: y
f=4.0_wp/(1.0_wp+y*y)
end function f
end program trap
ian#eris:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ian#eris:~/work/stack$ export OMP_NUM_THREADS=1
ian#eris:~/work/stack$ ./a.out
The value of pi is= 3.1415926435902248
Time 1.8548842668533325 s on 1 threads
ian#eris:~/work/stack$ export OMP_NUM_THREADS=2
ian#eris:~/work/stack$ ./a.out
The value of pi is= 3.1415926435902120
Time 0.86763000488281250 s on 2 threads
ian#eris:~/work/stack$ export OMP_NUM_THREADS=4
ian#eris:~/work/stack$ ./a.out
The value of pi is= 3.1415926435898771
Time 0.54704123735427856 s on 4 threads
ian#eris:~/work/stack$

Instead of
!$omp parallel do private (istart, iend, thread_num, i)
thread_num = omp_get_thread_num()
!$ istart = thread_num*ppt +1
!$ iend = min(thread_num*ppt + ppt, n)
try the following:
!$omp parallel private (istart, iend, thread_num, i)
thread_num = omp_get_thread_num()
!$ istart = thread_num*ppt +1
!$ iend = min(thread_num*ppt + ppt, n)
....
!$omp end parallel

Efficiency of Fortran stream access vs. MPI-IO

I have a parallel section of the code where I write out n large arrays (representing a numerical mesh) in blocks that are later read in different sized blocks. To do this I used Stream access so each processor writes their block independently, but I've seen inconsistent timings taking from 0.5-4 seconds in this section testing with 2 processor groups.
I am aware you can do something similar with MPI-IO, but I'm not sure what the benefits would be since there is no synchronization necessary. I would like to know if there is a way to either improve performance of my writes, or if there is a reason MPI-IO would be a better choice for this section.
Here is a sample of the code section where I create the files to write norb arrays using two groups (mygroup = 0 or 1]:
do irbsic=1,norb
[various operations]
blocksize=int(nmsh_tot/ngroups)
OPEN(unit=iunit,FILE='ZPOT',STATUS='UNKNOWN',ACCESS='STREAM')
mypos = 1 + (IRBSIC-1)*nmsh_tot*8 ! starting point for writing IRBSIC
mypos = mypos + mygroup*(8*blocksize) ! starting point for mesh group
WRITE(iunit,POS=mypos) POT(1:nmsh)
CLOSE(iunit)
OPEN(unit=iunit,FILE='RHOI',STATUS='UNKNOWN',ACCESS='STREAM')
mypos = 1 + (IRBSIC-1)*nmsh_tot*8 ! starting point for writing IRBSIC
mypos = mypos + mygroup*(8*blocksize) ! starting point for mesh group
WRITE(iunit,POS=mypos) RHOG(1:nmsh,1,1)
CLOSE(iunit)
[various operations]
end do

(As discussed in the comments) I would strongly recommend against using Fortran stream access for this. Standard Fortran I/O is only guaranteed to work if the file is being accessed by a single process, and in my own work I have seen random corruptions of files when multiple processes try to write to them at once, even if the processes are writing to different parts of the file. MPI-I/O, or a library such as HDF5 or NetCDF which uses MPI-I/O is the only sensible way to achieve this. Below is a simple program illustrating the use of mpi_file_write_at_all
ian#eris:~/work/stack$ cat at.f90
Program write_at
Use mpi
Implicit None
Integer, Parameter :: n = 4
Real, Dimension( 1:n ) :: a
Real, Dimension( : ), Allocatable :: all_of_a
Integer :: me, nproc
Integer :: handle
Integer :: i
Integer :: error
! Set up MPI
Call mpi_init( error )
Call mpi_comm_size( mpi_comm_world, nproc, error )
Call mpi_comm_rank( mpi_comm_world, me , error )
! Provide some data
a = [ ( i, i = n * me, n * ( me + 1 ) - 1 ) ]
! Open the file
Call mpi_file_open( mpi_comm_world, 'stuff.dat', &
mpi_mode_create + mpi_mode_wronly, mpi_info_null, handle, error )
! Describe how the processes will view the file - in this case
! simply a stream of mpi_real
Call mpi_file_set_view( handle, 0_mpi_offset_kind, &
mpi_real, mpi_real, 'native', &
mpi_info_null, error )
! Write the data using a collective routine - generally the most efficent
! but as collective all processes within the communicator must call the routine
Call mpi_file_write_at_all( handle, Int( me * n,mpi_offset_kind ) , &
a, Size( a ), mpi_real, mpi_status_ignore, error )
! Close the file
Call mpi_file_close( handle, error )
! Read the file on rank zero using Fortran to check the data
If( me == 0 ) Then
Open( 10, file = 'stuff.dat', access = 'stream' )
Allocate( all_of_a( 1:n * nproc ) )
Read( 10, pos = 1 ) all_of_a
Write( *, * ) all_of_a
End If
! Shut down MPI
Call mpi_finalize( error )
End Program write_at
ian#eris:~/work/stack$ mpif90 --version
GNU Fortran (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ian#eris:~/work/stack$ mpif90 -Wall -Wextra -fcheck=all -std=f2008 at.f90
ian#eris:~/work/stack$ mpirun -np 2 ./a.out
0.00000000 1.00000000 2.00000000 3.00000000 4.00000000 5.00000000 6.00000000 7.00000000
ian#eris:~/work/stack$ mpirun -np 5 ./a.out
0.00000000 1.00000000 2.00000000 3.00000000 4.00000000 5.00000000 6.00000000 7.00000000 8.00000000 9.00000000 10.0000000 11.0000000 12.0000000 13.0000000 14.0000000 15.0000000 16.0000000 17.0000000 18.0000000 19.0000000
ian#eris:~/work/stack$

Fortran/OpenMP comparison on 2 platforms (GCC and PGI compilers). Unexpected execution times

I compiled (with GCC and PGI compilers) and run a small Fortran/OpenMP program on two different platforms (Haswell- and Skylake-based), just to get a feeling of the difference of the performance. I do not know how to interpret the results - they are a mistery to me.
Here is the small program (taken from Nvidia Developer website and slightly adapted).
PROGRAM main
use, intrinsic :: iso_fortran_env, only: sp=>real32, dp=>real64
use, intrinsic :: omp_lib
implicit none
real(dp), parameter :: tol = 1.0d-6
integer, parameter :: iter_max = 1000
real(dp), allocatable :: A(:,:), Anew(:,:)
real(dp) :: error
real(sp) :: cpu_t0, cpu_t1
integer :: it0, it1, sys_clock_rate, iter, i, j
integer :: N, M
character(len=8) :: arg
call get_command_argument(1, arg)
read(arg, *) N !!! N = 8192 provided from command line
call get_command_argument(2, arg)
read(arg, *) M !!! M = 8192 provided from command line
allocate( A(N,M), Anew(N,M) )
A(1,:) = 1
A(2:N,:) = 0
Anew(1,:) = 1
Anew(2:N,:) = 0
iter = 0
error = 1
call cpu_time(cpu_t0)
call system_clock(it0)
do while ( (error > tol) .and. (iter < iter_max) )
error = 0
!$omp parallel do reduction(max: error) private(i)
do j = 2, M-1
do i = 2, N-1
Anew(i,j) = (A(i+1,j)+A(i-1,j)+A(i,j-1)+A(i,j+1)) / 4
error = max(error, abs(Anew(i,j)-A(i,j)))
end do
end do
!$omp end parallel do
!$omp parallel do private(i)
do j = 2, M-1
do i = 2, N-1
A(i,j) = Anew(i,j)
end do
end do
!$omp end parallel do
iter = iter + 1
end do
call cpu_time(cpu_t1)
call system_clock(it1, sys_clock_rate)
write(*,'(a,f8.3,a)') "...cpu time :", cpu_t1-cpu_t0, " s"
write(*,'(a,f8.3,a)') "...wall time:", real(it1 it0)/real(sys_clock_rate), " s"
END PROGRAM
The two platforms I used are:
Intel i7-4770 # 3.40GHz (Haswell), 32 GB RAM / Ubuntu 16.04.2 LTS
Intel i7-6700 # 3.40GHz (Skylake), 32 GB RAM / Linux Mint 18.1 (~ Ubuntu 16.04)
On each platform I compiled the Fortran program with
GCC gfortran 6.2.0
PGI pgfortran 16.10 community edition
I obviously compiled the program independently on each platform (I only moved the .f90 file; I did not move any binary file)
I ran 5 times each of the 4 executables (2 for each platform), collecting the wall times measured in seconds (as printed out by the program). (Well, I ran the whole test several times, and the timings below are definitely representative)
Sequential execution. Program compiled with:
gfortran -Ofast main.f90 -o gcc-seq
pgfortran -fast main.f90 -o pgi-seq
Timings (best of 5):
Haswell > gcc-seq: 150.955, pgi-seq: 165.973
Skylake > gcc-seq: 277.400, pgi-seq: 121.794
Multithread execution (8 threads). Program compiled with:
gfortran -Ofast -fopenmp main.f90 -o gcc-omp
pgfortran -fast -mp=allcores main.f90 -o pgi-omp
Timings (best of 5):
Haswell > gcc-omp: 153.819, pgi-omp: 151.459
Skylake > gcc-omp: 113.497, pgi-omp: 107.863
When compiling with OpenMP, I checked the number of threads in the parallel regions with omp_get_num_threads(), and there are actually always 8 threads, as expected.
There are several things I don't get:
Using the GCC compiler: why on Skylake OpenMP has a substantial benefit (277 vs 113 s), while on Haswell it has no benefit at all? (150 vs 153 s) What's happening on Haswell?
Using the PGI compiler: Why OpenMP has such a small benefit (if any) on both platforms?
Focusing on the sequential runs, why are there such huge differences in execution times between Haswell and Skylake (especially when the program is compiled with GCC)? Why this difference is still so relevant - but with the role of Haswell and Skylake reversed! - when OpenMP is enabled?
Also, when OpenMP is enabled and GCC is used, the cpu time is always much larger than the wall time (as I expect), but when PGI is used, the cpu and wall times are always the same, also then the program used multiple threads.
How can I make some sense out of these results?

Code takes much more time to finish with more than 1 thread

I want to benchmark some Fortran code with OpenMP-threads with a critical-section. To simulate a realistic environment I tried to generate some load before this critical-section.
!Kompileraufruf: gfortran -fopenmp -o minExample.x minExample.f90
PROGRAM minExample
USE omp_lib
IMPLICIT NONE
INTEGER :: n_chars, real_alloced
INTEGER :: nx,ny,nz,ix,iy,iz, idx
INTEGER :: nthreads, lasteinstellung,i
INTEGER, PARAMETER :: dp = kind(1.0d0)
REAL (KIND = dp) :: j
CHARACTER(LEN=32) :: arg
nx = 2
ny = 2
nz = 2
lasteinstellung= 10000
CALL getarg(1, arg)
READ(arg,*) nthreads
CALL OMP_SET_NUM_THREADS(nthreads)
!$omp parallel
!$omp master
nthreads=omp_get_num_threads()
!$omp end master
!$omp end parallel
WRITE(*,*) "Running OpenMP benchmark on ",nthreads," thread(s)"
n_chars = 0
idx = 0
!$omp parallel do default(none) collapse(3) &
!$omp shared(nx,ny,nz,n_chars) &
!$omp private(ix,iy,iz, idx) &
!$omp private(lasteinstellung,j) !&
DO iz=-nz,nz
DO iy=-ny,ny
DO ix=-nx,nx
! WRITE(*,*) ix,iy,iz
j = 0.0d0
DO i=1,lasteinstellung
j = j + real(i)
END DO
!$omp critical
n_chars = n_chars + 1
idx = n_chars
!$omp end critical
END DO
END DO
END DO
END PROGRAM
I compiled this code with gfortran -fopenmp -o test.x test.f90 and executed it with time ./test.x THREAD
Executing this code gives some strange behaviour depending on the thread-count (set with OMP_SET_NUM_THREADS): compared with one thread (6ms) the execution with more threads costs a lot more time (2 threads: 16000ms, 4 threads: 9000ms) on my multicore machine.
What could cause this behaviour? Is there a better (but still easy) way to generate load without running in some cache-effects or related things?
edit: strange behaviour: if I have the write in the nested loops, the execution speeds dramatically up with 2 threads. If its commented out, the execution with 2 or 3 threads takes forever (write shows very slow incrementation of loop variables)...but not with 1 or 4 threads. I tried this code also on another multicore machine. There it takes for 1 and 3 threads forever but not for 2 or 4 threads.

If the code you are showing is really complete you are missing definition of loadSet in the parallel section in which it is private. It is undefined and loop
DO i=1,loadSet
j = j + real(i)
END DO
can take a completely arbitrary number of iterations.
If the value is defined somewhere before in the code you do not show you probably want firstprivate instead of private.

Fortran MPI fails when more than 2 threads on a 4 processors computer

I'm working on a paralleled dynamic programing problem. My weird problem of MPI is that when I parallel using more than 2 processors, there comes the error.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
When I'm using 2 processors, the code can run with some problem in results (I'll specify after the code that can replicate the problem). Here's a list of compilers, OS info:
OS: Linux Ubuntu 14.04
Fortran compiler: gfortran 4.8.4 + OpenMPI(mpif90) 1.10.0
And the simplified code: main:
PROGRAM main
USE UPDATE_VFUN
IMPLICIT NONE
INTEGER, PARAMETER :: nz=2,nb=2,nk=2,nxi=2,nnb=2,nnk=2,nnb_b=2,nnk_b=2,itmax=4
INTEGER:: myid, extra,numprocs,chunksize,indtop,indbot
INTEGER,PARAMETER:: N=2*nb*nk*nxi*nz*nz
REAL(8):: vc_ub(N), vc_b(N),vx(N),dist
CALL MPI_INIT(ierr)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
chunksize=int(real(N)/real(numprocs))
extra=N-chunksize*numprocs
IF (myid<extra) THEN
indtop=myid*(chunksize+1)+1
indbot=(myid+1)*(chunksize+1)
ELSE
indtop=extra*(chunksize+1)+(myid-extra)*chunksize+1
indbot=extra*(chunksize+1)+(myid-extra+1)*chunksize
END IF
DO i=1,itmax
IF (dist>stp_rule) THEN
CALL vupdate(vc_ub,vc_b,vx,dist,indtop,indbot)
PRINT *, "Current Iteration", i, "in thread", myid
PRINT *, "DISTANCE",dist
ELSE
EXIT
END IF
END DO
CALL MPI_FINALIZE(ierr)
END PROGRAM main
And the external MODULE UPDATE_VFUN:
MODULE UPDATE_VFUN
USE MPI
IMPLICIT NONE
CONTAINS
SUBROUTINE vupdate(vc_ub,vc_b,vx,dist,indtop,indbot)
REAL(8), INTENT(INOUT):: vc_ub(:), vc_b(:), vx(:)
INTEGER, INTENT(IN) :: indtop,indbot
REAL(8), INTENT(OUT) :: dist
INTEGER :: mychunk,N
REAL(8) :: vc_ub_temp(size(vc_ub)),vc_b_temp(size(vc_b)),&
vx_temp(size(vx))
REAL(8),DIMENSION(indbot-indtop+1) :: myvc_ub,myvc_b,myvx
N=size(vc_ub)
mychunk=indbot-indtop+1
myvc_ub=1.0d3
myvc_b=1.2d4
myvx=1.2d2
CALL MPI_ALLGATHER(myvc_ub,mychunk,MPI_REAL,vc_ub_temp,N,MPI_REAL,&
MPI_COMM_WORLD,ierr)
CALL MPI_ALLGATHER(myvc_b,mychunk,MPI_REAL,vc_b_temp,N,MPI_REAL,&
MPI_COMM_WORLD,ierr)
CALL MPI_ALLGATHER(myvx,mychunk,MPI_REAL,vx_temp,N,MPI_REAL,&
MPI_COMM_WORLD,ierr)
!Calculate distance
dist=max(maxval(abs(vc_ub-vc_ub_temp)),maxval(abs(vc_b- &
vc_b_temp)),maxval(abs(vx-vx_temp)))
!Updating
vc_ub=vc_ub_temp
vc_b=vc_b_temp
vx=vx_temp
END SUBROUTINE vupdate
END MODULE UPDATE_VFUN
If I specify
mpirun -np 4
in makefile, the above error happens. If I specify -np 2, the program can run. But in print outs:
vc_ub vc_b vx
[1] 1000 12000 120
[2] 1000 12000 120
... ... ...
[16] 1000 12000 120
[17] machine zero machine zero machine zero
... ... ...
[32] machine zero machine zero machine zero
SAME PATTERN FOR 33-48, AND 49-64.
It seems that the computer takes chunksize=16 as if there are 4 processors. But with -np 2, shouldn't that be 32?
To sum up, my two questions are:
(1) why in this case, -np 4 cannot work, while -np 2 can?
(2) why -np 2 produces such goofy results?
Many thanks for your thoughts and comments!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

No speedup with OpenMP when using Matlab MEX in Linux - linux

Related

Openmp: use of parallel do with omp_get_thread_num()

Efficiency of Fortran stream access vs. MPI-IO

Fortran/OpenMP comparison on 2 platforms (GCC and PGI compilers). Unexpected execution times

Code takes much more time to finish with more than 1 thread

Fortran MPI fails when more than 2 threads on a 4 processors computer

Categories

Resources