MPI_REDUCE causing memory leak - memory-leaks

I have recently encountered a weir behavior. If I run the following code on my machine (using the most recent version of cygwin, Open MPI version 1.8.6) I get a linearly growing memory usage that quickly overwhelms my pc.
program memoryTest
use mpi
implicit none
integer :: ierror,errorStatus ! error codes
integer :: my_rank ! rank of process
integer :: p ! number of processes
integer :: i,a,b
call MPI_Init(ierror)
call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
call MPI_Comm_size(MPI_COMM_WORLD, p, ierror)
b=0
do i=1,10000000
a=1*my_rank
call MPI_REDUCE(a,b,1,MPI_INTEGER,MPI_MAX,0,MPI_COMM_WORLD,errorStatus)
end do
call MPI_Finalize(ierror)
stop
end program memoryTest
Any idea what the problem might be? The code looks fine to my beginner's eyes. The compilation line is
mpif90 -O2 -o memoryTest.exe memoryTest.f90

This has been discussed in a related thread here.
The problem is that the root process needs to receive data from other processes and perform the reduction while other processes only need to send the data to the root process. So the root process is running slower and it could be overwhelmed by the number of incoming messages. If you insert at MPI_BARRIER call after the MPI_REDUCE call then the code should run without a problem.
The relevant part of the MPI specification says: "Collective operations can (but are not required to) complete as soon as the caller's
participation in the collective communication is finished. A blocking operation is complete
as soon as the call returns. A nonblocking (immediate) call requires a separate completion
call (cf. Section
3.7
). The completion of a collective operation indicates that the caller is free
to modify locations in the communication buffer. It does not indicate that other processes
in the group have completed or even started the operation (unless otherwise implied by the
description of the operation). Thus, a collective communication operation may, or may not,
have the effect of synchronizing all calling processes. This statement excludes, of course,
the barrier operation."

To add a bit more support for macelee's answer: if you run this program to completion under MPICH with MPICH's internal memory leak tracing/reporting turned on, you see no leaks. Furthermore, valgrind's leak-check reports
==12866== HEAP SUMMARY:
==12866== in use at exit: 0 bytes in 0 blocks
==12866== total heap usage: 20,001,601 allocs, 20,000,496 frees, 3,369,410,210 bytes allocated
==12866==
==12866== All heap blocks were freed -- no leaks are possible
==12866==

Related

Double check locking in fortran with openmp

So i have some code with a double checked locking solution for reading data files in multi threaded (with openmp) application, which looks something like:
logical, dimension(10,10) :: is_data_loaded
is_data_loaded=.false.
! Other code
subroutine load(i,j)
integer,intent(in) :: i,j ! Indexes into array is_data_loaded
if(is_data_loaded(i,j)) return
!$OMP CRITICAL(load data)
if(.not.is_data_loaded(i,j)) then
call load_single_file(i,j)
is_data_loaded(i,j) = .true.
endif
!$OMP END CRITICAL(load_data)
end subroutine
Where I'm worried that if two threads get to the critical region at the same time (with the same i,j index) the second gets blocked by the first one entering the region but once the first finishes the second thread may start executing the critical block before seeing the updated is_data_loaded flag and thus we get into a problem with two threads updating the same data.
So firstly is this an issue with opemp critical blocks? I'm unsure of the semantics and whether the standard says something like "everything must be consistent across threads before the next thread runs in a critical block" or not. And if it is a problem, would just wrapping the read/writes to is_data_loaded in an omp atomic statement be sufficient?
I think the code is wrong, as the threads might indeed not see the updates of is_data_loaded after another thread has set it from the critical region. While the critical region will ensure that the corresponding memory flushes occur, the thread executing the if(is_data_loaded(i,j)) return might not see the update, as this statement might still see outdated data.
I think adding !$omp flush before the if(is_data_loaded(i,j)) return is needed to ensure that all data has been flushed and is_data_loaded(i,j) is loaded with the most recent data.

Compare and swap with and without garbage collector

How does CAS works? How does it work with garbage collector? Where is the problem and how does it work without garbage collector?
I was reading a presentation about CAS and using it on "write rarely, read many" problem and there was said, that use of CAS is convenient while you can use garbage collector, but there is problem (not specified) while you can not use garbage collector.
Can you tell me something about this? If you can sum up principle of CAS at first, it would be appreciated.
Ok, so CAS is an atomic instruction, that is there is special hardware support for it.
Its main use is to not use locks at all when implementing your data structures and other operations, since using locks, if a thread takes a page fault, a cache miss or is being descheduled by the OS for instance the thread takes the lock with it and all the rest of the threads are blocked. This obviously yields serious performance issues.
CAS is the core of lock-free programming and here and here.
CAS basically is the following:
CAS(CURRENT_VALUE, OLD_VALUE, NEW_VALUE) <=>
if CURRENT_VALUE==OLD_VALUE then CURRENT_VALUE = NEW_VALUE
You have a variable (e.g. class variable) and you have no clue if it was modified or not by other threads in the time you read from it and you want to write to it.
CAS helps you here on the write part since this CAS is done atomically (in hardware) and no lock is being implemented there, thus even if your thread goes to sleep the rest of the threads can operate on your data structure.
The issue with CAS on non-GC systems is the ABA problem and an example is the following:
You have a single linked list: HEAD->A->X->Y->Z
Thread 1: let's read A: localA = A; localA_Value = A.Value (let's say 5)
Thread 2: let's delete A: HEAD->A->X->Y->Z
Thread 3: let's add a new node at start (the malloc will find the right spot right were old A was): HEAD->A'->X->Y->Z (A'.Value = 10)
Thread 1 resumes and wants to swap A with B: CAS(localA, A', B) => but this thread expects that if CAS passes the value of A to be 5; wrong: since CAS passes given that localA and A' have the same memory location but localA.Value!=A'.Value => thus the operation shouldn't be performed.
The thing is that in GC enabled systems this will never happen since localA holds a reference to that memory location and thus A' will never get allocated to that memory location.

When does the garbage collector run when calling Haskell exports from C?

When exporting a Haskell function to be called from C, when does Haskell's garbage get collected? If C owns main then there is no way to predict the next call in to Haskell. This question is especially pertinent when running single-threaded Haskell or without parallel GC.
When you initialize the ghc runtime, you can pass rts flags to it via argc and argv like so:
RtsConfig conf = defaultRtsConfig;
conf.rts_opts_enabled = RtsOptsAll;
hs_init_ghc(&argc, &argv, conf);
This lets you set options to, for example fix a smaller maximum heap size or use a compaction algorithm on the nursery to further reduce allocation. Further, note there is an idle GC whose interval can be set (or disabled), and if you link the threaded runtime, that should run whether or not you ever yield back to a Haskell call.
Edit: I haven't actually performed experimentation to verify the following, but if we look at the source of hs_init_ghc we see that it initializes signal handlers, which should include the timer handlers that respond on SIGVTALRM and indeed it also starts the time, which calls (on POSIX) timer_create that should throw those signals on regular intervals. In turn, this periodically should "wake up" the RTS whether or not anything is happening, which in turn should mean that it will run idle GC whether or not the system yields back to Haskell from C. But again, I have only read the code and commentary, not tested this myself.

If one thread writes to a location and another thread is reading, can the second thread see the new value then the old?

Start with x = 0. Note there are no memory barriers in any of the code below.
volatile int x = 0
Thread 1:
while (x == 0) {}
print "Saw non-zer0"
while (x != 0) {}
print "Saw zero again!"
Thread 2:
x = 1
Is it ever possible to see the second message, "Saw zero again!", on any (real) CPU? What about on x86_64?
Similarly, in this code:
volatile int x = 0.
Thread 1:
while (x == 0) {}
x = 2
Thread 2:
x = 1
Is the final value of x guaranteed to be 2, or could the CPU caches update main memory in some arbitrary order, so that although x = 1 gets into a CPU's cache where thread 1 can see it, then thread 1 gets moved to a different cpu where it writes x = 2 to that cpu's cache, and the x = 2 gets written back to main memory before x = 1.
Yes, it's entirely possible. The compiler could, for example, have just written x to memory but still have the value in a register. One while loop could check memory while the other checks the register.
It doesn't happen due to CPU caches because cache coherency hardware logic makes the caches invisible on all CPUs you are likely to actually use.
Theoretically, the write race you talk about could happen due to posted write buffering and read prefetching. Miraculous tricks were used to make this impossible on x86 CPUs to avoid breaking legacy code. But you shouldn't expect future processors to do this.
Leaving aside for a second tricks done by the compiler (even ones allowed by language standards), I believe you're asking how the micro-architecture could behave in such scenario. Keep in mind that the code would most likely expand into a busy wait loop of cmp [x] + jz or something similar, which hides a load inside it. This means that [x] is likely to live in the cache of the core running thread 1.
At some point, thread 2 would come and perform the store. If it resides on a different core, the line would first be invalidated completely from the first core. If these are 2 threads running on the same physical core - the store would immediately affect all chronologically younger loads.
Now, the most likely thing to happen on a modern out-of-order machine is that all the loads in the pipeline at this point would be different iterations of the same first loop (since any branch predictor facing so many repetitive "taken" resolution is likely to assume the branch will continue being taken, until proven wrong), so what would happen is that the first load to encounter the new value modified by the other thread will cause the matching branch to simply flush the entire pipe from all younger operations, without the 2nd loop ever having a chance to execute.
However, it's possible that for some reason you did get to the 2nd loop (let's say the predictor issue a not-taken prediction just at the right moment when the loop condition check saw the new value) - in this case, the question boils down to this scenario:
Time -->
----------------------------------------------------------------
thread 1
cmp [x],0 execute
je ... execute (not taken)
...
cmp [x],0 execute
jne ... execute (not taken)
Can_We_Get_Here:
...
thread2
store [x],1 execute
In other words, given that most modern CPUs may execute instructions out of order, can a younger load be evaluated before an older one to the same address, allowing the store (from another thread) to change the value so it may be observed inconsistently by the loads.
My guess is that the above timeline is quite possible given the nature of out-of-order execution engines today, as they simply arbitrate and perform whatever operation is ready. However, on most x86 implementations there are safeguards to protect against such a scenario, since the memory ordering rules strictly say -
8.2.3.2 Neither Loads Nor Stores Are Reordered with Like Operations
Such mechanisms may detect this scenario and flush the machine to prevent the stale/wrong values becoming visible. So The answer is - no, it should not be possible, unless of course the software or the compiler change the nature of the code to prevent the hardware from noticing the relation. Then again, memory ordering rules are sometimes flaky, and i'm not sure all x86 manufacturers adhere to the exact same wording, but this is a pretty fundamental example of consistency, so i'd be very surprised if one of them missed it.
The answer seems to be, "this is exactly the job of the CPU cache coherency." x86 processors implement the MESI protocol, which guarantee that the second thread can't see the new value then the old.

Win32 SendMessage between threads

I have a worker thread which collects data from an external device. The worker thread keeps the main thread with the UI informed about its state.
For this I'm using variations of:
SendMessage( hwndParentThread, WM_NOTIFY, 0, TEXT("Connection successful.")).
Now, the debugger complains about a memory leak. As I'm not really sure about what happens with the memory allocated for the strings I'm wondering whether the leak stems from the strings I pass between the threads (e.g. TEXT("Connection successful.")).
If anyone could point me into the right direction I would really appreciate it.
In C++ Literal text constants have static location (e.g. memory for "Connection successful." string is not allocated during function call). See this answer for details https://stackoverflow.com/a/349031/1025209.
I can't see any problems with your line of code. Is the memory leak exactly in this line?

Resources