intel MKL pardiso won't run parallel in fortran - multithreading

I am trying to get the intel MKL version of pardiso to work with multiple cores. Im using it to solve a structurally symmetric system (mtype=1) with around 60K equations.
iparm= 0
iparm(1) = 1 !
iparm(2) = 3 !
iparm(3) = omp_get_max_threads() !
iparm(4) = 0 !
iparm(5) = 0 !
iparm(6) = 0 !
iparm(7) = 0 !
iparm(8) = 9 !
iparm(9) = 0 !
iparm(10) = 13
iparm(11) = 1
iparm(12) = 0
iparm(13) = 0
iparm(14) = 0
iparm(15) = 0
iparm(16) = 0
iparm(17) = 0
iparm(18) = -1
iparm(19) = -1
iparm(20) = 0
These are the my ipram parameters. When compiling I have
F90FLAGS = ${F77FLAGS} -I${SOLIDroot} -openmp -mkl=parallel -d-lines -debug
Before calling pardiso I also set the number of threads available to MKL and openmp
call mkl_set_num_threads(3)
call omp_set_num_threads(3)
call mkl_set_dynamic(0) ! disabling dynamic adjustment of the number of threads
As far as I have understood, all MKL functions will try to use multiple threads if allowed or enabled for "sufficiently" large problems. I already have some parallelism using OMP and the code runs on several cores. The region from which I call pardiso is serial. My question is, what else is needed to make pardiso work with multiple cores?
Tried with the default values for iparm, ie iparm(1)=0 and there was no change

I cannot add a comment (not enough reputations).
You can try setting the environment variable OMP_NUM_THREADS before you run the code and see if that works.

Related

Efficiency of Fortran stream access vs. MPI-IO

I have a parallel section of the code where I write out n large arrays (representing a numerical mesh) in blocks that are later read in different sized blocks. To do this I used Stream access so each processor writes their block independently, but I've seen inconsistent timings taking from 0.5-4 seconds in this section testing with 2 processor groups.
I am aware you can do something similar with MPI-IO, but I'm not sure what the benefits would be since there is no synchronization necessary. I would like to know if there is a way to either improve performance of my writes, or if there is a reason MPI-IO would be a better choice for this section.
Here is a sample of the code section where I create the files to write norb arrays using two groups (mygroup = 0 or 1]:
do irbsic=1,norb
[various operations]
blocksize=int(nmsh_tot/ngroups)
OPEN(unit=iunit,FILE='ZPOT',STATUS='UNKNOWN',ACCESS='STREAM')
mypos = 1 + (IRBSIC-1)*nmsh_tot*8 ! starting point for writing IRBSIC
mypos = mypos + mygroup*(8*blocksize) ! starting point for mesh group
WRITE(iunit,POS=mypos) POT(1:nmsh)
CLOSE(iunit)
OPEN(unit=iunit,FILE='RHOI',STATUS='UNKNOWN',ACCESS='STREAM')
mypos = 1 + (IRBSIC-1)*nmsh_tot*8 ! starting point for writing IRBSIC
mypos = mypos + mygroup*(8*blocksize) ! starting point for mesh group
WRITE(iunit,POS=mypos) RHOG(1:nmsh,1,1)
CLOSE(iunit)
[various operations]
end do
(As discussed in the comments) I would strongly recommend against using Fortran stream access for this. Standard Fortran I/O is only guaranteed to work if the file is being accessed by a single process, and in my own work I have seen random corruptions of files when multiple processes try to write to them at once, even if the processes are writing to different parts of the file. MPI-I/O, or a library such as HDF5 or NetCDF which uses MPI-I/O is the only sensible way to achieve this. Below is a simple program illustrating the use of mpi_file_write_at_all
ian#eris:~/work/stack$ cat at.f90
Program write_at
Use mpi
Implicit None
Integer, Parameter :: n = 4
Real, Dimension( 1:n ) :: a
Real, Dimension( : ), Allocatable :: all_of_a
Integer :: me, nproc
Integer :: handle
Integer :: i
Integer :: error
! Set up MPI
Call mpi_init( error )
Call mpi_comm_size( mpi_comm_world, nproc, error )
Call mpi_comm_rank( mpi_comm_world, me , error )
! Provide some data
a = [ ( i, i = n * me, n * ( me + 1 ) - 1 ) ]
! Open the file
Call mpi_file_open( mpi_comm_world, 'stuff.dat', &
mpi_mode_create + mpi_mode_wronly, mpi_info_null, handle, error )
! Describe how the processes will view the file - in this case
! simply a stream of mpi_real
Call mpi_file_set_view( handle, 0_mpi_offset_kind, &
mpi_real, mpi_real, 'native', &
mpi_info_null, error )
! Write the data using a collective routine - generally the most efficent
! but as collective all processes within the communicator must call the routine
Call mpi_file_write_at_all( handle, Int( me * n,mpi_offset_kind ) , &
a, Size( a ), mpi_real, mpi_status_ignore, error )
! Close the file
Call mpi_file_close( handle, error )
! Read the file on rank zero using Fortran to check the data
If( me == 0 ) Then
Open( 10, file = 'stuff.dat', access = 'stream' )
Allocate( all_of_a( 1:n * nproc ) )
Read( 10, pos = 1 ) all_of_a
Write( *, * ) all_of_a
End If
! Shut down MPI
Call mpi_finalize( error )
End Program write_at
ian#eris:~/work/stack$ mpif90 --version
GNU Fortran (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ian#eris:~/work/stack$ mpif90 -Wall -Wextra -fcheck=all -std=f2008 at.f90
ian#eris:~/work/stack$ mpirun -np 2 ./a.out
0.00000000 1.00000000 2.00000000 3.00000000 4.00000000 5.00000000 6.00000000 7.00000000
ian#eris:~/work/stack$ mpirun -np 5 ./a.out
0.00000000 1.00000000 2.00000000 3.00000000 4.00000000 5.00000000 6.00000000 7.00000000 8.00000000 9.00000000 10.0000000 11.0000000 12.0000000 13.0000000 14.0000000 15.0000000 16.0000000 17.0000000 18.0000000 19.0000000
ian#eris:~/work/stack$

Fortran/OpenMP comparison on 2 platforms (GCC and PGI compilers). Unexpected execution times

I compiled (with GCC and PGI compilers) and run a small Fortran/OpenMP program on two different platforms (Haswell- and Skylake-based), just to get a feeling of the difference of the performance. I do not know how to interpret the results - they are a mistery to me.
Here is the small program (taken from Nvidia Developer website and slightly adapted).
PROGRAM main
use, intrinsic :: iso_fortran_env, only: sp=>real32, dp=>real64
use, intrinsic :: omp_lib
implicit none
real(dp), parameter :: tol = 1.0d-6
integer, parameter :: iter_max = 1000
real(dp), allocatable :: A(:,:), Anew(:,:)
real(dp) :: error
real(sp) :: cpu_t0, cpu_t1
integer :: it0, it1, sys_clock_rate, iter, i, j
integer :: N, M
character(len=8) :: arg
call get_command_argument(1, arg)
read(arg, *) N !!! N = 8192 provided from command line
call get_command_argument(2, arg)
read(arg, *) M !!! M = 8192 provided from command line
allocate( A(N,M), Anew(N,M) )
A(1,:) = 1
A(2:N,:) = 0
Anew(1,:) = 1
Anew(2:N,:) = 0
iter = 0
error = 1
call cpu_time(cpu_t0)
call system_clock(it0)
do while ( (error > tol) .and. (iter < iter_max) )
error = 0
!$omp parallel do reduction(max: error) private(i)
do j = 2, M-1
do i = 2, N-1
Anew(i,j) = (A(i+1,j)+A(i-1,j)+A(i,j-1)+A(i,j+1)) / 4
error = max(error, abs(Anew(i,j)-A(i,j)))
end do
end do
!$omp end parallel do
!$omp parallel do private(i)
do j = 2, M-1
do i = 2, N-1
A(i,j) = Anew(i,j)
end do
end do
!$omp end parallel do
iter = iter + 1
end do
call cpu_time(cpu_t1)
call system_clock(it1, sys_clock_rate)
write(*,'(a,f8.3,a)') "...cpu time :", cpu_t1-cpu_t0, " s"
write(*,'(a,f8.3,a)') "...wall time:", real(it1 it0)/real(sys_clock_rate), " s"
END PROGRAM
The two platforms I used are:
Intel i7-4770 # 3.40GHz (Haswell), 32 GB RAM / Ubuntu 16.04.2 LTS
Intel i7-6700 # 3.40GHz (Skylake), 32 GB RAM / Linux Mint 18.1 (~ Ubuntu 16.04)
On each platform I compiled the Fortran program with
GCC gfortran 6.2.0
PGI pgfortran 16.10 community edition
I obviously compiled the program independently on each platform (I only moved the .f90 file; I did not move any binary file)
I ran 5 times each of the 4 executables (2 for each platform), collecting the wall times measured in seconds (as printed out by the program). (Well, I ran the whole test several times, and the timings below are definitely representative)
Sequential execution. Program compiled with:
gfortran -Ofast main.f90 -o gcc-seq
pgfortran -fast main.f90 -o pgi-seq
Timings (best of 5):
Haswell > gcc-seq: 150.955, pgi-seq: 165.973
Skylake > gcc-seq: 277.400, pgi-seq: 121.794
Multithread execution (8 threads). Program compiled with:
gfortran -Ofast -fopenmp main.f90 -o gcc-omp
pgfortran -fast -mp=allcores main.f90 -o pgi-omp
Timings (best of 5):
Haswell > gcc-omp: 153.819, pgi-omp: 151.459
Skylake > gcc-omp: 113.497, pgi-omp: 107.863
When compiling with OpenMP, I checked the number of threads in the parallel regions with omp_get_num_threads(), and there are actually always 8 threads, as expected.
There are several things I don't get:
Using the GCC compiler: why on Skylake OpenMP has a substantial benefit (277 vs 113 s), while on Haswell it has no benefit at all? (150 vs 153 s) What's happening on Haswell?
Using the PGI compiler: Why OpenMP has such a small benefit (if any) on both platforms?
Focusing on the sequential runs, why are there such huge differences in execution times between Haswell and Skylake (especially when the program is compiled with GCC)? Why this difference is still so relevant - but with the role of Haswell and Skylake reversed! - when OpenMP is enabled?
Also, when OpenMP is enabled and GCC is used, the cpu time is always much larger than the wall time (as I expect), but when PGI is used, the cpu and wall times are always the same, also then the program used multiple threads.
How can I make some sense out of these results?

Is MATLAB cellfun GPU-friendly

I have a loop in MATLAB which solves a linear system of equation. The total number of equations is about 500,000. Each system has 4 unknowns but the number of equations varies between 10 to 50. In addition, this system of equations needs to be solved over time (about 3000 timesteps)
for ivertex = 1 : nVertices
b = obj.f( obj.neighIDs{ ivertex } );
x = obj.coeffMatrix{ ivertex } \ b;
obj.solution( ivertex , : ) = x( 1 : 3 );
end
I tried to use parfor to accelerate this loop but I did not see any improvement. I also tried using cellfun to avoid looping through equations using MATLAB for:
tmpSolution = cellfun( ...
#(x, y) x \ obj.f( y ), ...
obj.coeffMatrix, ...
obj.neighIDs , ...
'UniformOutput', false );
But each timestep takes about 11 seconds (3000×11 = 9 hours). Is there a way to make this faster? Is this problem GPU friendly? My MATLAB version is 2013b in Windows 7 64 bit with 32 GB RAM.

in linux kernel - init_task is per cpu?

Is init_task thread is per cpu? or just one ?
If Im iterating from Init_task ill get to all of the threads in all of cpus ?
for example , using the following Macro which define at sched.h :
#define do_each_thread(g, t) \
for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
will iterate over all threads ?
thanks !

Determining no of processors and cores

I have a Dell Inspiron 620. The System control panel says 1 Intel i3-2100. Intel says (http://ark.intel.com/products/53422/) it has 2 cores and four threads. Three questions: In system environment variables the no_of_processors=4; so is that 4 threads?
Regarding the cores, when I run this code:
// get a list of all processor devices
deviceList = SetupDiGetClassDevs(ref processorGuid, "ACPI", IntPtr.Zero, (int)DIGCF.PRESENT);
// attempt to process each item in the list
for (int deviceNumber = 0; ; deviceNumber++)
{
SP_DEVINFO_DATA deviceInfo = new SP_DEVINFO_DATA();
deviceInfo.cbSize = Marshal.SizeOf(deviceInfo);
// attempt to read the device info from the list, if this fails, we're at the end of the list
if (!SetupDiEnumDeviceInfo(deviceList, deviceNumber, ref deviceInfo))
{
deviceCount = deviceNumber - 1;
break;
}
}
I get 3 as the number of cores, not 2.
Also, in terms of the number of threads this system will support adequately, is that cores x processors?
Thanks.
This is pure supposition but could it be:
2 Core with 4 threads implies hyper threading
System properties are referring 4 processors => its reading the number of hyper threads available as 'logical single thread processors'
Your loop starts at 0 so you're still seeing 4.

Resources