Efficiency of Fortran stream access vs. MPI-IO - io

I have a parallel section of the code where I write out n large arrays (representing a numerical mesh) in blocks that are later read in different sized blocks. To do this I used Stream access so each processor writes their block independently, but I've seen inconsistent timings taking from 0.5-4 seconds in this section testing with 2 processor groups.
I am aware you can do something similar with MPI-IO, but I'm not sure what the benefits would be since there is no synchronization necessary. I would like to know if there is a way to either improve performance of my writes, or if there is a reason MPI-IO would be a better choice for this section.
Here is a sample of the code section where I create the files to write norb arrays using two groups (mygroup = 0 or 1]:
do irbsic=1,norb
[various operations]
blocksize=int(nmsh_tot/ngroups)
OPEN(unit=iunit,FILE='ZPOT',STATUS='UNKNOWN',ACCESS='STREAM')
mypos = 1 + (IRBSIC-1)*nmsh_tot*8 ! starting point for writing IRBSIC
mypos = mypos + mygroup*(8*blocksize) ! starting point for mesh group
WRITE(iunit,POS=mypos) POT(1:nmsh)
CLOSE(iunit)
OPEN(unit=iunit,FILE='RHOI',STATUS='UNKNOWN',ACCESS='STREAM')
mypos = 1 + (IRBSIC-1)*nmsh_tot*8 ! starting point for writing IRBSIC
mypos = mypos + mygroup*(8*blocksize) ! starting point for mesh group
WRITE(iunit,POS=mypos) RHOG(1:nmsh,1,1)
CLOSE(iunit)
[various operations]
end do

(As discussed in the comments) I would strongly recommend against using Fortran stream access for this. Standard Fortran I/O is only guaranteed to work if the file is being accessed by a single process, and in my own work I have seen random corruptions of files when multiple processes try to write to them at once, even if the processes are writing to different parts of the file. MPI-I/O, or a library such as HDF5 or NetCDF which uses MPI-I/O is the only sensible way to achieve this. Below is a simple program illustrating the use of mpi_file_write_at_all
ian#eris:~/work/stack$ cat at.f90
Program write_at
Use mpi
Implicit None
Integer, Parameter :: n = 4
Real, Dimension( 1:n ) :: a
Real, Dimension( : ), Allocatable :: all_of_a
Integer :: me, nproc
Integer :: handle
Integer :: i
Integer :: error
! Set up MPI
Call mpi_init( error )
Call mpi_comm_size( mpi_comm_world, nproc, error )
Call mpi_comm_rank( mpi_comm_world, me , error )
! Provide some data
a = [ ( i, i = n * me, n * ( me + 1 ) - 1 ) ]
! Open the file
Call mpi_file_open( mpi_comm_world, 'stuff.dat', &
mpi_mode_create + mpi_mode_wronly, mpi_info_null, handle, error )
! Describe how the processes will view the file - in this case
! simply a stream of mpi_real
Call mpi_file_set_view( handle, 0_mpi_offset_kind, &
mpi_real, mpi_real, 'native', &
mpi_info_null, error )
! Write the data using a collective routine - generally the most efficent
! but as collective all processes within the communicator must call the routine
Call mpi_file_write_at_all( handle, Int( me * n,mpi_offset_kind ) , &
a, Size( a ), mpi_real, mpi_status_ignore, error )
! Close the file
Call mpi_file_close( handle, error )
! Read the file on rank zero using Fortran to check the data
If( me == 0 ) Then
Open( 10, file = 'stuff.dat', access = 'stream' )
Allocate( all_of_a( 1:n * nproc ) )
Read( 10, pos = 1 ) all_of_a
Write( *, * ) all_of_a
End If
! Shut down MPI
Call mpi_finalize( error )
End Program write_at
ian#eris:~/work/stack$ mpif90 --version
GNU Fortran (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ian#eris:~/work/stack$ mpif90 -Wall -Wextra -fcheck=all -std=f2008 at.f90
ian#eris:~/work/stack$ mpirun -np 2 ./a.out
0.00000000 1.00000000 2.00000000 3.00000000 4.00000000 5.00000000 6.00000000 7.00000000
ian#eris:~/work/stack$ mpirun -np 5 ./a.out
0.00000000 1.00000000 2.00000000 3.00000000 4.00000000 5.00000000 6.00000000 7.00000000 8.00000000 9.00000000 10.0000000 11.0000000 12.0000000 13.0000000 14.0000000 15.0000000 16.0000000 17.0000000 18.0000000 19.0000000
ian#eris:~/work/stack$

Related

GFortran unformatted I/O throughput on NVMe SSDs

Please help me understand how I can improve sequential, unformatted I/O throughput with (G)Fortran, especially when working on NVMe SSDs.
I wrote a little test program, see bottom of this post. What this does is open one or more files in parallel (OpenMP) and write an array of random numbers into it. Then it flushes system caches (root required, otherwise the read test will most likely read from memory) opens the files, and reads from them. Time is measured in wall time (trying to include only I/O-related times), and performance numbers are given in MiB/s. The program loops until aborted.
The hardware I am using for testing is a Samsung 970 Evo Plus 1TB SSD, connected via 2 PCIe 3.0 lanes. So in theory, it should be capable of ~1500MiB/s sequential reads and writes.
Testing beforehand with "dd if=/dev/zero of=./testfile bs=1G count=1 oflag=direct" results in ~750MB/s. Not too great, but still better than what I get with Gfortran. And depending on who you ask, dd should not be used for benchmarking anyway. This is just to make sure that the hardware is in theory capable of more.
Results with my code tend to get better with larger file size, but even with 1GiB it caps out at around 200MiB/s write, 420MiB/s read. Using more threads (e.g. 4) increases write speeds a bit, but only to around 270MiB/s.
I made sure to keep the benchmark runs short, and give the SSD time to relax between tests.
I was under the impression that it should be possible to saturate 2 PCIe 3.0 lanes worth of bandwidth, even with only a single thread. At least when using unformatted I/O.
The code does not seem to be CPU limited, top shows less than 50% usage on a single core if I move the allocation and initialization of the "values" field out of the loop. Which still does not bode well for overall performance, considering that I would like to see numbers that are at least 5 times higher.
I also tried to use access=stream for the open statements, but to no avail.
So what seems to be the problem?
Is my code wrong/unoptimized? Are my expectations too high?
Platform used:
Opensuse Leap 15.1, Kernel 4.12.14-lp151.28.36-default
2x AMD Epyc 7551, Supermicro H11DSI, Samsung 970 Evo Plus 1TB (2xPCIe 3.0)
gcc version 8.2.1, compiler options: -ffree-line-length-none -O3 -ffast-math -funroll-loops -flto
MODULE types
implicit none
save
INTEGER, PARAMETER :: I8B = SELECTED_INT_KIND(18)
INTEGER, PARAMETER :: I4B = SELECTED_INT_KIND(9)
INTEGER, PARAMETER :: SP = KIND(1.0)
INTEGER, PARAMETER :: DP = KIND(1.0d0)
END MODULE types
MODULE parameters
use types
implicit none
save
INTEGER(I4B) :: filesize ! file size in MiB
INTEGER(I4B) :: nthreads ! number of threads for parallel ececution
INTEGER(I4B) :: alloc_size ! size of the allocated data field
END MODULE parameters
PROGRAM iometer
use types
use parameters
use omp_lib
implicit none
CHARACTER(LEN=100) :: directory_char, filesize_char, nthreads_char
CHARACTER(LEN=40) :: dummy_char1
CHARACTER(LEN=110) :: filename
CHARACTER(LEN=10) :: filenumber
INTEGER(I4B) :: thread, tunit, n
INTEGER(I8B) :: counti, countf, count_rate
REAL(DP) :: telapsed_read, telapsed_write, mib_written, write_speed, mib_read, read_speed
REAL(SP), DIMENSION(:), ALLOCATABLE :: values
call system_clock(counti,count_rate)
call getarg(1,directory_char)
dummy_char1 = ' directory to test:'
write(*,'(A40,A)') dummy_char1, trim(adjustl(directory_char))
call getarg(2,filesize_char)
dummy_char1 = ' file size (MiB):'
read(filesize_char,*) filesize
write(*,'(A40,I12)') dummy_char1, filesize
call getarg(3,nthreads_char)
dummy_char1 = ' number of parallel threads:'
read(nthreads_char,*) nthreads
write(*,'(A40,I12)') dummy_char1, nthreads
alloc_size = filesize * 262144
dummy_char1 = ' allocation size:'
write(*,'(A40,I12)') dummy_char1, alloc_size
mib_written = real(alloc_size,kind=dp) * real(nthreads,kind=dp) / 1048576.0_dp
mib_read = mib_written
CALL OMP_SET_NUM_THREADS(nthreads)
do while(.true.)
!$OMP PARALLEL default(shared) private(thread, filename, filenumber, values, tunit)
thread = omp_get_thread_num()
write(filenumber,'(I0.10)') thread
filename = trim(adjustl(directory_char)) // '/' // trim(adjustl(filenumber)) // '.temp'
allocate(values(alloc_size))
call random_seed()
call RANDOM_NUMBER(values)
tunit = thread + 100
!$OMP BARRIER
!$OMP MASTER
call system_clock(counti)
!$OMP END MASTER
!$OMP BARRIER
open(unit=tunit, file=trim(adjustl(filename)), status='replace', action='write', form='unformatted')
write(tunit) values
close(unit=tunit)
!$OMP BARRIER
!$OMP MASTER
call system_clock(countf)
telapsed_write = real(countf-counti,kind=dp)/real(count_rate,kind=dp)
write_speed = mib_written/telapsed_write
!write(*,*) 'write speed (MiB/s): ', write_speed
call execute_command_line ('echo 3 > /proc/sys/vm/drop_caches', wait=.true.)
call system_clock(counti)
!$OMP END MASTER
!$OMP BARRIER
open(unit=tunit, file=trim(adjustl(filename)), status='old', action='read', form='unformatted')
read(tunit) values
close(unit=tunit)
!$OMP BARRIER
!$OMP MASTER
call system_clock(countf)
telapsed_read = real(countf-counti,kind=dp)/real(count_rate,kind=dp)
read_speed = mib_read/telapsed_read
write(*,'(A29,2F10.3)') ' write / read speed (MiB/s): ', write_speed, read_speed
!$OMP END MASTER
!$OMP BARRIER
deallocate(values)
!$OMP END PARALLEL
call sleep(1)
end do
END PROGRAM iometer
The mistake in your code is that in your calculation of mib_written you have forgotten to take into account the size of a real(sp) variable (4 bytes). Thus your results are a factor of 4 too low. E.g. calculate it as
mib_written = filesize * nthreads
Some minor nits, some specific to GFortran:
Don't repeatedly call random_seed, particularly not from each thread. If you want to call it, call it once in the beginning of the program.
You can use open(newunit=tunit, ...) to let the compiler runtime allocate a unique unit number for each file.
If you want the 'standard' 64-bit integer/floating point kinds, you can use the variables int64 and real64 from the iso_fortran_env intrinsic module.
For testing with larger files, you need to make alloc_size of kind int64.
Use the standard get_command_argument intrinsic instead of the nonstandard getarg.
access='stream' is slightly faster than the default (sequential) as there's no need to handle the record length markers.
Your test program with these fixes (and the parameters module folded into the main program) below:
PROGRAM iometer
use iso_fortran_env
use omp_lib
implicit none
CHARACTER(LEN=100) :: directory_char, filesize_char, nthreads_char
CHARACTER(LEN=40) :: dummy_char1
CHARACTER(LEN=110) :: filename
CHARACTER(LEN=10) :: filenumber
INTEGER :: thread, tunit
INTEGER(int64) :: counti, countf, count_rate
REAL(real64) :: telapsed_read, telapsed_write, mib_written, write_speed, mib_read, read_speed
REAL, DIMENSION(:), ALLOCATABLE :: values
INTEGER :: filesize ! file size in MiB
INTEGER :: nthreads ! number of threads for parallel ececution
INTEGER(int64) :: alloc_size ! size of the allocated data field
call system_clock(counti,count_rate)
call get_command_argument(1, directory_char)
dummy_char1 = ' directory to test:'
write(*,'(A40,A)') dummy_char1, trim(adjustl(directory_char))
call get_command_argument(2, filesize_char)
dummy_char1 = ' file size (MiB):'
read(filesize_char,*) filesize
write(*,'(A40,I12)') dummy_char1, filesize
call get_command_argument(3, nthreads_char)
dummy_char1 = ' number of parallel threads:'
read(nthreads_char,*) nthreads
write(*,'(A40,I12)') dummy_char1, nthreads
alloc_size = filesize * 262144_int64
dummy_char1 = ' allocation size:'
write(*,'(A40,I12)') dummy_char1, alloc_size
mib_written = filesize * nthreads
dummy_char1 = ' MiB written:'
write(*, '(A40,g0)') dummy_char1, mib_written
mib_read = mib_written
CALL OMP_SET_NUM_THREADS(nthreads)
!$OMP PARALLEL default(shared) private(thread, filename, filenumber, values, tunit)
do while (.true.)
thread = omp_get_thread_num()
write(filenumber,'(I0.10)') thread
filename = trim(adjustl(directory_char)) // '/' // trim(adjustl(filenumber)) // '.temp'
if (.not. allocated(values)) then
allocate(values(alloc_size))
call RANDOM_NUMBER(values)
end if
open(newunit=tunit, file=filename, status='replace', action='write', form='unformatted', access='stream')
!$omp barrier
!$omp master
call system_clock(counti)
!$omp end master
!$omp barrier
write(tunit) values
close(unit=tunit)
!$omp barrier
!$omp master
call system_clock(countf)
telapsed_write = real(countf - counti, kind=real64)/real(count_rate, kind=real64)
write_speed = mib_written/telapsed_write
call execute_command_line ('echo 3 > /proc/sys/vm/drop_caches', wait=.true.)
!$OMP END MASTER
open(newunit=tunit, file=trim(adjustl(filename)), status='old', action='read', form='unformatted', access='stream')
!$omp barrier
!$omp master
call system_clock(counti)
!$omp end master
!$omp barrier
read(tunit) values
close(unit=tunit)
!$omp barrier
!$omp master
call system_clock(countf)
telapsed_read = real(countf - counti, kind=real64)/real(count_rate, kind=real64)
read_speed = mib_read/telapsed_read
write(*,'(A29,2F10.3)') ' write / read speed (MiB/s): ', write_speed, read_speed
!$OMP END MASTER
call sleep(1)
end do
!$OMP END PARALLEL
END PROGRAM iometer

Fortran/OpenMP comparison on 2 platforms (GCC and PGI compilers). Unexpected execution times

I compiled (with GCC and PGI compilers) and run a small Fortran/OpenMP program on two different platforms (Haswell- and Skylake-based), just to get a feeling of the difference of the performance. I do not know how to interpret the results - they are a mistery to me.
Here is the small program (taken from Nvidia Developer website and slightly adapted).
PROGRAM main
use, intrinsic :: iso_fortran_env, only: sp=>real32, dp=>real64
use, intrinsic :: omp_lib
implicit none
real(dp), parameter :: tol = 1.0d-6
integer, parameter :: iter_max = 1000
real(dp), allocatable :: A(:,:), Anew(:,:)
real(dp) :: error
real(sp) :: cpu_t0, cpu_t1
integer :: it0, it1, sys_clock_rate, iter, i, j
integer :: N, M
character(len=8) :: arg
call get_command_argument(1, arg)
read(arg, *) N !!! N = 8192 provided from command line
call get_command_argument(2, arg)
read(arg, *) M !!! M = 8192 provided from command line
allocate( A(N,M), Anew(N,M) )
A(1,:) = 1
A(2:N,:) = 0
Anew(1,:) = 1
Anew(2:N,:) = 0
iter = 0
error = 1
call cpu_time(cpu_t0)
call system_clock(it0)
do while ( (error > tol) .and. (iter < iter_max) )
error = 0
!$omp parallel do reduction(max: error) private(i)
do j = 2, M-1
do i = 2, N-1
Anew(i,j) = (A(i+1,j)+A(i-1,j)+A(i,j-1)+A(i,j+1)) / 4
error = max(error, abs(Anew(i,j)-A(i,j)))
end do
end do
!$omp end parallel do
!$omp parallel do private(i)
do j = 2, M-1
do i = 2, N-1
A(i,j) = Anew(i,j)
end do
end do
!$omp end parallel do
iter = iter + 1
end do
call cpu_time(cpu_t1)
call system_clock(it1, sys_clock_rate)
write(*,'(a,f8.3,a)') "...cpu time :", cpu_t1-cpu_t0, " s"
write(*,'(a,f8.3,a)') "...wall time:", real(it1 it0)/real(sys_clock_rate), " s"
END PROGRAM
The two platforms I used are:
Intel i7-4770 # 3.40GHz (Haswell), 32 GB RAM / Ubuntu 16.04.2 LTS
Intel i7-6700 # 3.40GHz (Skylake), 32 GB RAM / Linux Mint 18.1 (~ Ubuntu 16.04)
On each platform I compiled the Fortran program with
GCC gfortran 6.2.0
PGI pgfortran 16.10 community edition
I obviously compiled the program independently on each platform (I only moved the .f90 file; I did not move any binary file)
I ran 5 times each of the 4 executables (2 for each platform), collecting the wall times measured in seconds (as printed out by the program). (Well, I ran the whole test several times, and the timings below are definitely representative)
Sequential execution. Program compiled with:
gfortran -Ofast main.f90 -o gcc-seq
pgfortran -fast main.f90 -o pgi-seq
Timings (best of 5):
Haswell > gcc-seq: 150.955, pgi-seq: 165.973
Skylake > gcc-seq: 277.400, pgi-seq: 121.794
Multithread execution (8 threads). Program compiled with:
gfortran -Ofast -fopenmp main.f90 -o gcc-omp
pgfortran -fast -mp=allcores main.f90 -o pgi-omp
Timings (best of 5):
Haswell > gcc-omp: 153.819, pgi-omp: 151.459
Skylake > gcc-omp: 113.497, pgi-omp: 107.863
When compiling with OpenMP, I checked the number of threads in the parallel regions with omp_get_num_threads(), and there are actually always 8 threads, as expected.
There are several things I don't get:
Using the GCC compiler: why on Skylake OpenMP has a substantial benefit (277 vs 113 s), while on Haswell it has no benefit at all? (150 vs 153 s) What's happening on Haswell?
Using the PGI compiler: Why OpenMP has such a small benefit (if any) on both platforms?
Focusing on the sequential runs, why are there such huge differences in execution times between Haswell and Skylake (especially when the program is compiled with GCC)? Why this difference is still so relevant - but with the role of Haswell and Skylake reversed! - when OpenMP is enabled?
Also, when OpenMP is enabled and GCC is used, the cpu time is always much larger than the wall time (as I expect), but when PGI is used, the cpu and wall times are always the same, also then the program used multiple threads.
How can I make some sense out of these results?

MPI loop increases memory usage/ memory leak

I am working on a Fortran program using MPI where an array is split up into strips, each strip is sent to a rank for calculations to be done and then the edge of each array is sent to the rank next to it to update the next timestep. It is an iterative process so each edge is passed to its neighboring rank many times. It works fine, however, as I have started to run the program for larger arrays and more time steps I noticed (via top) that there appears to be a memory leak in that each process is continually increasing the amount of memory it is using. If I run the program long enough eventually it uses up all the system memory and crashes my machine.
All I am trying to do is send a string of data from one rank to another many times in a row, I don't know why the memory should be increasing from passing this information. Below is an example which exhibits the behavior. Am I somehow not releasing the passed data from memory which is causing it to build up over time?
(EDIT: changed example code to reflect comment below)
Program main
use mpi
REAL(KIND=KIND(0.0D0)) ::putbuf ( 1000 ), getbuf ( 1000 )
INTEGER, PARAMETER :: from = 2, to = 3,fromtag = 123, totag = 456
INTEGER, DIMENSION(MPI_STATUS_SIZE) :: status
INTEGER :: error, rank
! Initialize MPI.
call MPI_Init ( error )
! Get the number of processes.
call MPI_Comm_size ( MPI_COMM_WORLD, num_procs, error )
! Get the individual process ID.
call MPI_Comm_rank ( MPI_COMM_WORLD, rank, error )
do i = 1,50000000
putbuf = i
if (rank == 0) then
CALL MPI_Sendrecv ( putbuf, 1000,MPI_DOUBLE_PRECISION, 1, 123,getbuf, 1000, MPI_DOUBLE_PRECISION,&
1, 456,MPI_COMM_WORLD, status, error )
else
CALL MPI_Sendrecv ( putbuf, 1000,MPI_DOUBLE_PRECISION, 0, 456,getbuf, 1000, MPI_DOUBLE_PRECISION,&
0, 123,MPI_COMM_WORLD, status, error )
endif
enddo
call MPI_Finalize ( error )
endprogram

Fortran MPI fails when more than 2 threads on a 4 processors computer

I'm working on a paralleled dynamic programing problem. My weird problem of MPI is that when I parallel using more than 2 processors, there comes the error.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
When I'm using 2 processors, the code can run with some problem in results (I'll specify after the code that can replicate the problem). Here's a list of compilers, OS info:
OS: Linux Ubuntu 14.04
Fortran compiler: gfortran 4.8.4 + OpenMPI(mpif90) 1.10.0
And the simplified code: main:
PROGRAM main
USE UPDATE_VFUN
IMPLICIT NONE
INTEGER, PARAMETER :: nz=2,nb=2,nk=2,nxi=2,nnb=2,nnk=2,nnb_b=2,nnk_b=2,itmax=4
INTEGER:: myid, extra,numprocs,chunksize,indtop,indbot
INTEGER,PARAMETER:: N=2*nb*nk*nxi*nz*nz
REAL(8):: vc_ub(N), vc_b(N),vx(N),dist
CALL MPI_INIT(ierr)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
chunksize=int(real(N)/real(numprocs))
extra=N-chunksize*numprocs
IF (myid<extra) THEN
indtop=myid*(chunksize+1)+1
indbot=(myid+1)*(chunksize+1)
ELSE
indtop=extra*(chunksize+1)+(myid-extra)*chunksize+1
indbot=extra*(chunksize+1)+(myid-extra+1)*chunksize
END IF
DO i=1,itmax
IF (dist>stp_rule) THEN
CALL vupdate(vc_ub,vc_b,vx,dist,indtop,indbot)
PRINT *, "Current Iteration", i, "in thread", myid
PRINT *, "DISTANCE",dist
ELSE
EXIT
END IF
END DO
CALL MPI_FINALIZE(ierr)
END PROGRAM main
And the external MODULE UPDATE_VFUN:
MODULE UPDATE_VFUN
USE MPI
IMPLICIT NONE
CONTAINS
SUBROUTINE vupdate(vc_ub,vc_b,vx,dist,indtop,indbot)
REAL(8), INTENT(INOUT):: vc_ub(:), vc_b(:), vx(:)
INTEGER, INTENT(IN) :: indtop,indbot
REAL(8), INTENT(OUT) :: dist
INTEGER :: mychunk,N
REAL(8) :: vc_ub_temp(size(vc_ub)),vc_b_temp(size(vc_b)),&
vx_temp(size(vx))
REAL(8),DIMENSION(indbot-indtop+1) :: myvc_ub,myvc_b,myvx
N=size(vc_ub)
mychunk=indbot-indtop+1
myvc_ub=1.0d3
myvc_b=1.2d4
myvx=1.2d2
CALL MPI_ALLGATHER(myvc_ub,mychunk,MPI_REAL,vc_ub_temp,N,MPI_REAL,&
MPI_COMM_WORLD,ierr)
CALL MPI_ALLGATHER(myvc_b,mychunk,MPI_REAL,vc_b_temp,N,MPI_REAL,&
MPI_COMM_WORLD,ierr)
CALL MPI_ALLGATHER(myvx,mychunk,MPI_REAL,vx_temp,N,MPI_REAL,&
MPI_COMM_WORLD,ierr)
!Calculate distance
dist=max(maxval(abs(vc_ub-vc_ub_temp)),maxval(abs(vc_b- &
vc_b_temp)),maxval(abs(vx-vx_temp)))
!Updating
vc_ub=vc_ub_temp
vc_b=vc_b_temp
vx=vx_temp
END SUBROUTINE vupdate
END MODULE UPDATE_VFUN
If I specify
mpirun -np 4
in makefile, the above error happens. If I specify -np 2, the program can run. But in print outs:
vc_ub vc_b vx
[1] 1000 12000 120
[2] 1000 12000 120
... ... ...
[16] 1000 12000 120
[17] machine zero machine zero machine zero
... ... ...
[32] machine zero machine zero machine zero
SAME PATTERN FOR 33-48, AND 49-64.
It seems that the computer takes chunksize=16 as if there are 4 processors. But with -np 2, shouldn't that be 32?
To sum up, my two questions are:
(1) why in this case, -np 4 cannot work, while -np 2 can?
(2) why -np 2 produces such goofy results?
Many thanks for your thoughts and comments!

What is the best way to transfer data (Real and Integer arrays) between two runnings fortran programs on the same machine?

We are currently using file I/O but need a better/faster way. Sample code would be appreciated.
By using files for transfer, you're already implementing a form of message passing, and so I think that would be the most natural fit for this sort of program. Now, you could write something yourself that uses shared memory when available and something like TCP/IP when not - or you could just use a library that already does that, like MPI, which is widely available, works, will take advantage of shared memory if you are running on the same machine, but would then also extend to letting you run them on different machines entirely without you changing your code.
So as a simple example of one program sending data to a second and then waiting for data back, we'd have two programs as follows; first.f90
program first
use protocol
use mpi
implicit none
real, dimension(n,m) :: inputdata
real, dimension(n,m) :: processeddata
integer :: rank, comsize, ierr, otherrank
integer :: rstatus(MPI_STATUS_SIZE)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, comsize, ierr)
if (comsize /= 2) then
print *,'Error: this assumes n=2!'
call MPI_ABORT(1,MPI_COMM_WORLD,ierr)
endif
!! 2 PEs; the other is 1 if we're 0, or 0 if we're 1.
otherrank = comsize - (rank+1)
inputdata = 1.
inputdata = exp(sin(inputdata))
print *, rank, ': first: finished computing; now sending to second.'
call MPI_SEND(inputdata, n*m, MPI_REAL, otherrank, firsttag, &
MPI_COMM_WORLD, ierr)
print *, rank, ': first: Now waiting for return data...'
call MPI_RECV(processeddata, n*m, MPI_REAL, otherrank, backtag, &
MPI_COMM_WORLD, rstatus, ierr)
print *, rank, ': first: recieved data from partner.'
call MPI_FINALIZE(ierr)
end program first
and second.f90:
program second
use protocol
use mpi
implicit none
real, dimension(n,m) :: inputdata
real, dimension(n,m) :: processeddata
integer :: rank, comsize, ierr, otherrank
integer :: rstatus(MPI_STATUS_SIZE)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, comsize, ierr)
if (comsize /= 2) then
print *,'Error: this assumes n=2!'
call MPI_ABORT(1,MPI_COMM_WORLD,ierr)
endif
!! 2 PEs; the other is 1 if we're 0, or 0 if we're 1.
otherrank = comsize - (rank+1)
print *, rank, ': second: Waiting for initial data...'
call MPI_RECV(inputdata, n*m, MPI_REAL, otherrank, firsttag, &
MPI_COMM_WORLD, rstatus, ierr)
print *, rank, ': second: adding 1 and sending back.'
processeddata = inputdata + 1
call MPI_SEND(processeddata, n*m, MPI_REAL, otherrank, backtag, &
MPI_COMM_WORLD, ierr)
print *, rank, ': second: completed'
call MPI_FINALIZE(ierr)
end program second
For clarity, stuff that the two programs must agree on could be ina module they both use, here protocol.f90:
module protocol
!! shared information like tag ids, etc goes here
integer, parameter :: firsttag = 1
integer, parameter :: backtag = 2
!! size of problem
integer, parameter :: n = 10, m = 20
end module protocol
(A makefile for building the executables follows:)
all: first second
FFLAGS=-g -Wall
F90=mpif90
%.mod: %.f90
$(F90) -c $(FFLAGS) $^
%.o: %.f90
$(F90) -c $(FFLAGS) $^
first: protocol.mod first.o
$(F90) -o $# first.o protocol.o
second: protocol.mod second.o
$(F90) -o $# second.o protocol.o
clean:
rm -rf *.o *.mod
and then you run the two programs as following:
$ mpiexec -n 1 ./first : -n 1 ./second
1 : second: Waiting for initial data...
0 : first: finished computing; now sending to second.
0 : first: Now waiting for return data...
1 : second: adding 1 and sending back.
1 : second: completed
0 : first: recieved data from partner.
$
We could certainly give you a more relevant example if you give us more information about the workflow between the two programs.
Are you using binary (unformatted) file I/O? Unless the data quantity is huge, that should be fast.
Otherwise you could use interprocess communication, but it would be more complicated. You might find code in C, which you could call from Fortran using the ISO C Binding.

Resources