MPI loop increases memory usage/ memory leak

MPI loop increases memory usage/ memory leak - memory-leaks

I am working on a Fortran program using MPI where an array is split up into strips, each strip is sent to a rank for calculations to be done and then the edge of each array is sent to the rank next to it to update the next timestep. It is an iterative process so each edge is passed to its neighboring rank many times. It works fine, however, as I have started to run the program for larger arrays and more time steps I noticed (via top) that there appears to be a memory leak in that each process is continually increasing the amount of memory it is using. If I run the program long enough eventually it uses up all the system memory and crashes my machine.
All I am trying to do is send a string of data from one rank to another many times in a row, I don't know why the memory should be increasing from passing this information. Below is an example which exhibits the behavior. Am I somehow not releasing the passed data from memory which is causing it to build up over time?
(EDIT: changed example code to reflect comment below)
Program main
use mpi
REAL(KIND=KIND(0.0D0)) ::putbuf ( 1000 ), getbuf ( 1000 )
INTEGER, PARAMETER :: from = 2, to = 3,fromtag = 123, totag = 456
INTEGER, DIMENSION(MPI_STATUS_SIZE) :: status
INTEGER :: error, rank
! Initialize MPI.
call MPI_Init ( error )
! Get the number of processes.
call MPI_Comm_size ( MPI_COMM_WORLD, num_procs, error )
! Get the individual process ID.
call MPI_Comm_rank ( MPI_COMM_WORLD, rank, error )
do i = 1,50000000
putbuf = i
if (rank == 0) then
CALL MPI_Sendrecv ( putbuf, 1000,MPI_DOUBLE_PRECISION, 1, 123,getbuf, 1000, MPI_DOUBLE_PRECISION,&
1, 456,MPI_COMM_WORLD, status, error )
else
CALL MPI_Sendrecv ( putbuf, 1000,MPI_DOUBLE_PRECISION, 0, 456,getbuf, 1000, MPI_DOUBLE_PRECISION,&
0, 123,MPI_COMM_WORLD, status, error )
endif
enddo
call MPI_Finalize ( error )
endprogram

Related

Efficiency of Fortran stream access vs. MPI-IO

I have a parallel section of the code where I write out n large arrays (representing a numerical mesh) in blocks that are later read in different sized blocks. To do this I used Stream access so each processor writes their block independently, but I've seen inconsistent timings taking from 0.5-4 seconds in this section testing with 2 processor groups.
I am aware you can do something similar with MPI-IO, but I'm not sure what the benefits would be since there is no synchronization necessary. I would like to know if there is a way to either improve performance of my writes, or if there is a reason MPI-IO would be a better choice for this section.
Here is a sample of the code section where I create the files to write norb arrays using two groups (mygroup = 0 or 1]:
do irbsic=1,norb
[various operations]
blocksize=int(nmsh_tot/ngroups)
OPEN(unit=iunit,FILE='ZPOT',STATUS='UNKNOWN',ACCESS='STREAM')
mypos = 1 + (IRBSIC-1)*nmsh_tot*8 ! starting point for writing IRBSIC
mypos = mypos + mygroup*(8*blocksize) ! starting point for mesh group
WRITE(iunit,POS=mypos) POT(1:nmsh)
CLOSE(iunit)
OPEN(unit=iunit,FILE='RHOI',STATUS='UNKNOWN',ACCESS='STREAM')
mypos = 1 + (IRBSIC-1)*nmsh_tot*8 ! starting point for writing IRBSIC
mypos = mypos + mygroup*(8*blocksize) ! starting point for mesh group
WRITE(iunit,POS=mypos) RHOG(1:nmsh,1,1)
CLOSE(iunit)
[various operations]
end do

(As discussed in the comments) I would strongly recommend against using Fortran stream access for this. Standard Fortran I/O is only guaranteed to work if the file is being accessed by a single process, and in my own work I have seen random corruptions of files when multiple processes try to write to them at once, even if the processes are writing to different parts of the file. MPI-I/O, or a library such as HDF5 or NetCDF which uses MPI-I/O is the only sensible way to achieve this. Below is a simple program illustrating the use of mpi_file_write_at_all
ian#eris:~/work/stack$ cat at.f90
Program write_at
Use mpi
Implicit None
Integer, Parameter :: n = 4
Real, Dimension( 1:n ) :: a
Real, Dimension( : ), Allocatable :: all_of_a
Integer :: me, nproc
Integer :: handle
Integer :: i
Integer :: error
! Set up MPI
Call mpi_init( error )
Call mpi_comm_size( mpi_comm_world, nproc, error )
Call mpi_comm_rank( mpi_comm_world, me , error )
! Provide some data
a = [ ( i, i = n * me, n * ( me + 1 ) - 1 ) ]
! Open the file
Call mpi_file_open( mpi_comm_world, 'stuff.dat', &
mpi_mode_create + mpi_mode_wronly, mpi_info_null, handle, error )
! Describe how the processes will view the file - in this case
! simply a stream of mpi_real
Call mpi_file_set_view( handle, 0_mpi_offset_kind, &
mpi_real, mpi_real, 'native', &
mpi_info_null, error )
! Write the data using a collective routine - generally the most efficent
! but as collective all processes within the communicator must call the routine
Call mpi_file_write_at_all( handle, Int( me * n,mpi_offset_kind ) , &
a, Size( a ), mpi_real, mpi_status_ignore, error )
! Close the file
Call mpi_file_close( handle, error )
! Read the file on rank zero using Fortran to check the data
If( me == 0 ) Then
Open( 10, file = 'stuff.dat', access = 'stream' )
Allocate( all_of_a( 1:n * nproc ) )
Read( 10, pos = 1 ) all_of_a
Write( *, * ) all_of_a
End If
! Shut down MPI
Call mpi_finalize( error )
End Program write_at
ian#eris:~/work/stack$ mpif90 --version
GNU Fortran (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ian#eris:~/work/stack$ mpif90 -Wall -Wextra -fcheck=all -std=f2008 at.f90
ian#eris:~/work/stack$ mpirun -np 2 ./a.out
0.00000000 1.00000000 2.00000000 3.00000000 4.00000000 5.00000000 6.00000000 7.00000000
ian#eris:~/work/stack$ mpirun -np 5 ./a.out
0.00000000 1.00000000 2.00000000 3.00000000 4.00000000 5.00000000 6.00000000 7.00000000 8.00000000 9.00000000 10.0000000 11.0000000 12.0000000 13.0000000 14.0000000 15.0000000 16.0000000 17.0000000 18.0000000 19.0000000
ian#eris:~/work/stack$

Getting a Hard Fault when trying to list all tasks using vTaskList()

I am trying to list the state of all the tasks that are currently running using vTaskList().Whenever I call the function I get a HardFault and I have no idea where it faults. I tried increasing the Heap size and stack size. This causes the vTaskList() to work once but for the second time it throws a hard fault again.
Following is how I am using vTaskList() in osThreadList()
osStatus osThreadList (uint8_t *buffer)
{
#if ( ( configUSE_TRACE_FACILITY == 1 ) && ( configUSE_STATS_FORMATTING_FUNCTIONS == 1 ) )
vTaskList((char *)buffer);
#endif
return osOK;
}
Following is how i use osThreadList() to print all the tasks on my serial terminal.
uint8_t TskBuf[1024];
bool IOParser::TSK(bool print_help)
{
if(print_help)
{
uart_printf("\nTSK: Display list of tasks.\r\n");
}
else
{
uart_printf("\r\nName State Priority Stack Num\r\n" );
uart_printf("---------------------------------------------\r\n");
/* The list of tasks and their status */
osThreadList(TskBuf);
uart_printf( (char *)TskBuf);
uart_printf("---------------------------------------------\r\n");
uart_printf("B : Blocked, R : Ready, D : Deleted, S : Suspended");
}
return true;
}
When I comment out any one of the tasks I am able to get it working. I am guessing it is something related to memory but I havent been able to find a solution.

vTaskList is dependent on sprintf. So, your guess about memory and heap is right. But you have to use malloc and pass that block instead of what you do. Use pvPortmalloc and after you finish, free it up using vportfree.
Also, it is worthwhile noting that vTaskList is a blocking function.
I do not have a working code example to show this as at now, but this should work.
Hard Faults are often (almost all the time) happens due to uninitialised pointer. Above approach will eliminate that.

why does a a nodejs array shift/push loop run 1000x slower above array length 87369?

Why is the speed of nodejs array shift/push operations not linear in the size of the array? There is a dramatic knee at 87370 that completely crushes the system.
Try this, first with 87369 elements in q, then with 87370. (Or, on a 64-bit system, try 85983 and 85984.) For me, the former runs in .05 seconds; the latter, in 80 seconds -- 1600 times slower. (observed on 32-bit debian linux with node v0.10.29)
q = [];
// preload the queue with some data
for (i=0; i<87369; i++) q.push({});
// fetch oldest waiting item and push new item
for (i=0; i<100000; i++) {
q.shift();
q.push({});
if (i%10000 === 0) process.stdout.write(".");
}
64-bit debian linux v0.10.29 crawls starting at 85984 and runs in .06 / 56 seconds. Node v0.11.13 has similar breakpoints, but at different array sizes.

Shift is a very slow operation for arrays as you need to move all the elements but V8 is able to use a trick to perform it fast when the array contents fit in a page (1mb).
Empty arrays start with 4 slots and as you keep pushing, it will resize the array using formula 1.5 * (old length + 1) + 16.
var j = 4;
while (j < 87369) {
j = (j + 1) + Math.floor(j / 2) + 16
console.log(j);
}
Prints:
23
51
93
156
251
393
606
926
1406
2126
3206
4826
7256
10901
16368
24569
36870
55322
83000
124517
So your array size ends up actually being 124517 items which makes it too large.
You can actually preallocate your array just to the right size and it should be able to fast shift again:
var q = new Array(87369); // Fits in a page so fast shift is possible
// preload the queue with some data
for (i=0; i<87369; i++) q[i] = {};
If you need larger than that, use the right data structure

I started digging into the v8 sources, but I still don't understand it.
I instrumented deps/v8/src/builtins.cc:MoveElemens (called from Builtin_ArrayShift, which implements the shift with a memmove), and it clearly shows the slowdown: only 1000 shifts per second because each one takes 1ms:
AR: at 1417982255.050970: MoveElements sec = 0.000809
AR: at 1417982255.052314: MoveElements sec = 0.001341
AR: at 1417982255.053542: MoveElements sec = 0.001224
AR: at 1417982255.054360: MoveElements sec = 0.000815
AR: at 1417982255.055684: MoveElements sec = 0.001321
AR: at 1417982255.056501: MoveElements sec = 0.000814
of which the memmove is 0.000040 seconds, the bulk is the heap->RecordWrites (deps/v8/src/heap-inl.h):
void Heap::RecordWrites(Address address, int start, int len) {
if (!InNewSpace(address)) {
for (int i = 0; i < len; i++) {
store_buffer_.Mark(address + start + i * kPointerSize);
}
}
}
which is (store-buffer-inl.h)
void StoreBuffer::Mark(Address addr) {
ASSERT(!heap_->cell_space()->Contains(addr));
ASSERT(!heap_->code_space()->Contains(addr));
Address* top = reinterpret_cast<Address*>(heap_->store_buffer_top());
*top++ = addr;
heap_->public_set_store_buffer_top(top);
if ((reinterpret_cast<uintptr_t>(top) & kStoreBufferOverflowBit) != 0) {
ASSERT(top == limit_);
Compact();
} else {
ASSERT(top < limit_);
}
}
when the code is running slow, there are runs of shift/push ops followed by runs of 5-6 calls to Compact() for every MoveElements. When it's running fast, MoveElements isn't called until a handful of times at the end, and just a single compaction when it finishes.
I'm guessing memory compaction might be thrashing, but it's not falling in place for me yet.
Edit: forget that last edit about output buffering artifacts, I was filtering duplicates.

this bug had been reported to google, who closed it without studying the issue.
https://code.google.com/p/v8/issues/detail?id=3059
When shifting out and calling tasks (functions) from a queue (array)
the GC(?) is stalling for an inordinate length of time.
114467 shifts is OK
114468 shifts is problematic, symptoms occur
the response:
he GC has nothing to do with this, and nothing is stalling either.
Array.shift() is an expensive operation, as it requires all array
elements to be moved. For most areas of the heap, V8 has implemented a
special trick to hide this cost: it simply bumps the pointer to the
beginning of the object by one, effectively cutting off the first
element. However, when an array is so large that it must be placed in
"large object space", this trick cannot be applied as object starts
must be aligned, so on every .shift() operation all elements must
actually be moved in memory.
I'm not sure there's a whole lot we can do about this. If you want a
"Queue" object in JavaScript with guaranteed O(1) complexity for
.enqueue() and .dequeue() operations, you may want to implement your
own.
Edit: I just caught the subtle "all elements must be moved" part -- is RecordWrites not GC but an actual element copy then? The memmove of the array contents is 0.04 milliseconds. The RecordWrites loop is 96% of the 1.1 ms runtime.
Edit: if "aligned" means the first object must be at first address, that's what memmove does. What is RecordWrites?

What is openCL equivalent for this cuda "cudaMallocPitch "code.?

My PC has an AMD processor with an ATI 3200 GPU which doesn't support OpenCL. The rest of the codes all running by "Falling back to CPU itself".
I am converting one of the code from CUDA to OpenCL but stuck in some particular part for which there is no exact conversion code in OpenCL. since i have less experience in OpenCL I can't make out this, please suggest me some solution if any of you think will work,
The CUDA code is,
size_t pitch = 0;
cudaError error = cudaMallocPitch((void**)&gpu_data, (size_t*)&pitch,
instances->cols * sizeof(float), instances->rows);
for( int i = 0; i < instances->rows; i++ ){
error = cudaMemcpy((void*)(gpu_data + (pitch/sizeof(float))*i),
(void*)(instances->data + (instances->cols*i)),
instances->cols * sizeof(float) ,cudaMemcpyHostToDevice);
If I remove the pitch value from the above I end up with an problem which doesn't write to the device memory "gpu_data".
Somebody please convert this code to OpenCL and reply. I have converted it to OpenCL, but its not working and the data is not written to "gpu_data". My converted OpenCL code is
gpu_data = clCreateBuffer(context, CL_MEM_READ_WRITE, ((instances->cols)*(instances->rows))*sizeof(float), NULL, &ret);
for( int i = 0; i < instances->rows; i++ ){
ret = clEnqueueWriteBuffer(command_queue, gpu_data, CL_TRUE, 0, ((instances->cols)*(instances->rows))*sizeof(float),(void*)(instances->data + (instances->cols*i)) , 0, NULL, NULL);
Sometimes it runs well for this code and gets stuck in the reading part i.e.
ret = clEnqueueReadBuffer(command_queue, gpu_data, CL_TRUE, 0,sizeof( float ) * instances->cols* 1 , instances->data, 0, NULL, NULL);
overhere. And it gives error like
Unhandled exception at 0x10001098 in CL_kmeans.exe: 0xC000001D: Illegal Instruction.
when break is pressed , it gives:
No symbols are loaded for any call stack frame. The source code cannot be displayed.
while debugging. In the call stack it is displaying:
OCL8CA9.tmp.dll!10001098()
[Frames below may be incorrect and/or missing, no symbols loaded for OCL8CA9.tmp.dll]
amdocl.dll!5c39de16()
I really dont know what it means. someone please help me to rid of this problem.

First of all, in the CUDA code you're doing a horribly inefficient thing to copy the data. The CUDA runtime has the function cudaMemcpy2D that does exactly what you are trying to do by looping over different rows.
What cudaMallocPitch does is to compute an optimal pitch (= distance in byte between rows in a 2D array) such that each new row begins at an address that is optimal for coalescing, and then allocates a memory area as large as pitch times the number of rows you specify. You can emulate the same thing in OpenCL by first computing the optimal pitch and then doing the allocation of the correct size.
The optimal pitch is computed by (1) getting the base address alignment preference for your card (CL_DEVICE_MEM_BASE_ADDR_ALIGN property with clGetDeviceInfo: note that the returned value is in bits, so you have to divide by 8 to get it in bytes); let's call this base (2) find the largest multiple of base that is no less than your natural data pitch (sizeof(type) times number of columns); this will be your pitch.
You then allocate pitch times number of rows bytes, and pass the pitch information to kernels.
Also, when copying data from the host to the device and converesely, you want to use clEnqueue{Read,Write}BufferRect, that are specifically designed to copy 2D data (they are the counterparts to cudaMemcpy2D).

Lost messages between PVM processes?

I'm trying to parallelize an algorithm using PVM for a University assignment. I've got the algorithm sorted, but parallelization only almost works - the process intermittently gets stuck for no apparent reason. I can see no pattern, a run with the same parameters might work 10 times and then just gets stuck on the next effort...
None of the pvm functions (in the master or any child process) are returning any error codes, the children seem to complete successfully, no errors are reaching the console. It really does just look like the master isn't receiving every communication from the children - but only on occasional runs.
Oddly, though, I don't think it's just skipping a message - I've yet to have a result missing from a child that then successfully sent over a complete signal (that is to say I've not had a run reach completion and return an unexpected result) - it's as though the child just becomes disconnected, and all messages from a certain point cease arriving.
Batching the results up and sending less, but larger, messages seems to improve reliability, at least it feels like it's sticking less often - I don't have hard numbers to back this up...
Is it normal, common or expected that PVM will lose messages sent via pvm_send and it's friends? Please note the error occurs if all processes run on a single host or multiple hosts.
Am I doing something wrong? Is there something I can do to help prevent this?
Update
I've reproduced the error on a very simple test case, code below, which just spawns four children sends a single number to each, each child multiplies the number it receives by five and sends it back. It works almost all the time, but occasionally we freeze with only three numbers printed out - with one child's result missing (and said child will have completed).
Master:
int main()
{
pvm_start_pvmd( 0 , NULL , 0 );
int taskIDs[global::taskCount];
pvm_spawn( "/path/to/pvmtest/child" , NULL , 0 , NULL , global::taskCount , taskIDs );
int numbers[constant::taskCount] = { 5 , 10 , 15 , 20 };
for( int i=0 ; i<constant::taskCount ; ++i )
{
pvm_initsend( 0 );
pvm_pkint( &numbers[i] , 1 , 1 );
pvm_send( taskIDs[i] , 0 );
}
int received;
for( int i=0 ; i<global::taskCount ; ++i )
{
pvm_recv( -1 , -1 );
pvm_upkint( &received , 1 , 1 );
std::cout << recieved << std::endl;
}
pvm_halt();
}
Child:
int main()
{
int number;
pvm_recv( -1 , -1 );
pvm_upkint( &number , 1 , 1 );
number *= 10;
pvm_initsend( 0 );
pvm_pkint( &number , 1 , 1 );
pvm_send( pvm_parent() , 0 );
}

Not really an answer, but two things have changed together and the problem seems to have subsided:
I added pvm_exit() a call to the end of the slave binary, which apparently is best to do.
The configuration of PVM over the cluster changed ... somehow ... I don't have any specifics, but a few nodes were previously unable to take part in PVM operations and can now can. Other things may have changed as well.
I suspect something within the second changed also happened to fix my problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

MPI loop increases memory usage/ memory leak - memory-leaks

Related

Efficiency of Fortran stream access vs. MPI-IO

Getting a Hard Fault when trying to list all tasks using vTaskList()

why does a a nodejs array shift/push loop run 1000x slower above array length 87369?

What is openCL equivalent for this cuda "cudaMallocPitch "code.?

Lost messages between PVM processes?

Categories

Resources