When is memory allocated by malloc garbage collected? - malloc

I am guessing (hoping) the answer is never.
That such memory must be explicitly freed.
For example if if I wrote:
julia> x = Libc.malloc(1_000_000)
Ptr{Void} #0x0000000002f6bd80
julia> x = nothing
have I just leaked ~1MB of memory?
However I am not 100% certain this is true,
because the docs don't mention it at all.
help?> Libc.malloc(3)
malloc(size::Integer) -> Ptr{Void}
Call malloc from the C standard library.

Yes, you are correct.
Julia is designed to seamlessly interoperate with C on a low level, so when you use the C wrapper libraries, you you get C semantics and no garbage collection.
The docs for Libc.malloc is not written to teach C, but could be improved to mention Libc.free, in case anyone gets confused.

Yet one more answer
Yes you leaked 1MB of memory. But there's a mechanism that implements ownership transfer
struct MyStruct
...
end
n = 10
x = Base.Libc.malloc(n * sizeof(MyStruct)) # returns Ptr{Nothing}
xtyped = convert(Ptr{MyStruct}, x) # something like reinterpret cast
vector = unsafe_wrap(Array, xtyped, n; own = true) # returns Vector{MyStruct}
N.B. The last line transfers ownership of memory to Julia, hence, from this moment it's better to avoid using of x and xtyped as they can point to already freed memory.
Such low-level kung fu can prove helpful while dealing with binary files especially with function unsafe_read.
Alternatively, as it was mentioned you can use Base.Libc.free(x) to manually free up memory.
P.S. However it is often better to rely on built-in memory management. By default immutable structs are tried to be allocated on stack, which improves performance.

Related

How to shrink a Vec or String from an offset without reallocating it?

Multiple structures in rust have shrink_to or shrink_to_fit methods, such as Vec and String. But apparently there's nothing like shrink_from_to.
Why would I want that?
Assume I've a XY gigabyte string or vector in memory and know the exact start and end positions of the part I am interesting in (which allocates only Z GB from start to end, somewhere in the middle). I could call truncate and then shrink_from_to effectivly freeing memory.
However I've still gigabytes of memory occupied by [0..start] which are of no relevance for my further processing.
Question
Is there any way to free this memory too without reallocating and copying the relevant parts there?
Note that shrink_to_fit does reallocate by copying into a smaller buffer and freeing the old buffer. Your best bet is probably just converting the slice you care about into an owned Vec and then dropping the original Vec.
fn main() {
let v1 = (0..1000).collect::<Vec<_>>(); // needs to be dropped
println!("{}", v1.capacity()); // 1000
let v2 = v1[100..150].to_owned(); // don't drop this!
println!("{}", v2.capacity()); // 50
drop(v1);
}
Is there any way to free this memory too without reallocating and copying the relevant parts there?
Move the segment you want to keep to the start of the collection (e.g. replace_range, drain, copy_within, rotate, ...), then truncate, then shrink.
APIs like realloc and mremap work in terms of "memory blocks" (aka allocations returned by malloc/mmap), they don't work in terms of random pointers. So a hypothetical shrink_from_to would just be doing that under the cover, since you can't really resize allocations from both ends.

Is there inherent "cost of carry" of garbage thunks in Haskell?

I often see high number of cycles spent in GC when running GHC-compiled programs.
These numbers tend to be order of magnitude higher than my JVM experience suggests they should be. In particular, number of bytes "copied" by GC seems to be vastly larger than amounts of data I'm computing.
Is such difference between non- and strict languages fundamental?
tl;dr: Most of the stuff that the JVM does in stack frames, GHC does on the heap. If you wanted to compare GHC heap/GC stats with the JVM equivalent, you'd really need to account for some portion of the bytes/cycles the JVM spends pushing arguments on the stack or copying return values between stack frames.
Long version:
Languages targeting the JVM typically make use of its call stack. Each invoked method has an active stack frame that includes storage for the parameters passed to it, additional local variables, and temporary results, plus room for an "operand stack" used for passing arguments to and receiving results from other methods it calls.
As a simple example, if the Haskell code:
bar :: Int -> Int -> Int
bar a b = a * b
foo :: Int -> Int -> Int -> Int
foo x y z = let u = bar y z in x + u
were compiled to JVM, the byte code would probably look something like:
public static int bar(int, int);
Code:
stack=2, locals=2, args_size=2
0: iload_0 // push a
1: iload_1 // push b
2: imul // multiply and push result
3: ireturn // pop result and return it
public static int foo(int, int, int);
Code:
stack=2, locals=4, args_size=3
0: iload_1 // push y
1: iload_2 // push z
2: invokestatic bar // call bar, pushing result
5: istore_3 // pop and save to "u"
6: iload_0 // push x
7: iload_3 // push u
8: iadd // add and push result
9: ireturn // pop result and return it
Note that calls to built-in primitives like imul and user-defined methods like bar involve copying/pushing the parameter values from local storage to the operand stack (using iload instructions) and then invoking the primitive or method. Return values then need to be saved/popped to local storage (with istore) or returned to the caller with ireturn; occasionally, a return value can be left on the stack to serve as an operand for another method invocation. Also, while it's not explicit in the byte code, the ireturn instruction involves a copy, from the callee's operand stack to the caller's operand stack. Of course, in actual JVM implementations, various optimizations are presumably possible to reduce copying.
When something else eventually calls foo to produce a computation, for example:
some_caller t = foo (1+3) (2+4) t + 1
the (unoptimized) code might look like:
iconst_1
iconst_3
iadd // put 1+3 on the stack
iconst_2
iconst_4
iadd // put 2+4 on the stack
iload_0 // put t on the stack
invokestatic foo
iconst 1
iadd
ireturn
Again, subexpressions are evaluated with a lot of pushing and popping on the operand stack. Eventually, foo is invoked with its arguments pushed on the stack and its result popped off for further processing.
All of this allocation and copying takes place on this stack, so there's no heap allocation involved in this example.
Now, what happens if that same code is compiled with GHC 8.6.4 (without optimization and on an x86_64 architecture for the sake of concreteness)? Well, the pseudocode for the generated assembly is something like:
foo [x, y, z] =
u = new THUNK(sat_u) // thunk, 32 bytes on heap
jump: (+) x u
sat_u [] = // saturated closure for "bar y z"
push UPDATE(sat_u) // update frame, 16 bytes on stack
jump: bar y z
bar [a, b] =
jump: (*) a b
The calls/jumps to the (+) and (*) "primitives" are actually more complicated than I've made them out to be because of the typeclass that's involved. For example, the jump to (+) looks more like:
push CONTINUATION(\f -> f x u) // continuation, 24 bytes on stack
jump: (+) dNumInt // get the right (+) from typeclass instance
If you turn on -O2, GHC optimizes away this more complicated call, but it also optimizes away everything else that's interesting about this example, so for the sake of argument, let's pretend the pseudocode above is accurate.
Again, foo isn't of much use until someone calls it. For the some_caller example above, the portion of code that calls foo will look something like:
some_caller [t] =
...
foocall = new THUNK(sat_foocall) // thunk, 24 bytes on heap
...
sat_foocall [] = // saturated closure for "foo (1+3) (2+4) t"
...
v = new THUNK(sat_v) // thunk "1+3", 16 bytes on heap
w = new THUNK(sat_w) // thunk "2+4", 16 bytes on heap
push UPDATE(sat_foocall) // update frame, 16 bytes on stack
jump: foo sat_v sat_w t
sat_v [] = ...
sat_w [] = ...
Note that nearly all of this allocation and copying takes place on the heap, rather than the stack.
Now, let's compare these two approaches. At first blush, it looks like the culprit really is lazy evaluation. We're creating these thunks all over the place that wouldn't be necessary if evaluation was strict, right? But let's look at one of these thunks more carefully. Consider the thunk for sat_u in the definition of foo. It's 32 bytes / 4 words with the following contents:
// THUNK(sat_u)
word 0: ptr to sat_u info table/code
1: space for return value
// variables we closed over:
2: ptr to "y"
3: ptr to "z"
The creation of this thunk isn't fundamentally different than the JVM code:
0: iload_1 // push y
1: iload_2 // push z
2: invokestatic bar // call bar, pushing result
5: istore_3 // pop and save to "u"
Instead of pushing y and z onto the operand stack, we loaded them into a heap-allocated thunk. Instead of popping the result off the operand stack into our stack frame's local storage and managing stack frames and return addresses, we left space for the result in the thunk and pushed a 16-byte update frame onto the stack before transferring control to bar.
Similarly, in the call to foo in some_caller, instead of evaluating the argument subexpressions by pushing constants on the stack and invoking primitives to push results on the stack, we created thunks on the heap, each of which included a pointer to info table / code for invoking primitives on those arguments and space for the return value; an update frame replaced the stack bookkeeping and result copying implicit in the JVM version.
Ultimately, thunks and update frames are GHC's replacement for stack-based parameter and result passing, local variables, and temporary workspace. A lot of activity that takes place in JVM stack frames takes place in the GHC heap.
Now, most of the stuff in JVM stack frames and on the GHC heap quickly becomes garbage. The main difference is that in the JVM, stack frames are automatically tossed out when a function returns, after the runtime has copied the important stuff out (e.g., return values). In GHC, the heap needs to be garbage collected. As others have noted, the GHC runtime is built around the idea that the vast majority of heap objects will immediately become garbage: a fast bump allocator is used for initial heap object allocation, and instead of copying out the important stuff every time a function returns (as for the JVM), the garbage collector copies it out when the bump heap gets kind of full.
Obviously, the above toy example is ridiculous. In particular, things are going to get much more complicated when we start talking about code that operates on Java objects and Haskell ADTs, rather than Ints. However, it serves to illustrate the point that a direct comparison of heap usage and GC cycles between GHC and JVM doesn't make a whole lot of sense. Certainly, an exact accounting doesn't really seem possible as the JVM and GHC approaches are too fundamentally different, and the proof would be in real-world performance. At the very least, an apples-to-apples comparison of GHC heap usage and GC stats needs to account for some portion of the cycles the JVM spends pushing, popping, and copying values between operand stacks. In particular, at least some fraction of JVM return instructions should count towards GHC's "bytes copied".
As for the contribution of "laziness" to heap usage (and heap "garbage" in particular), it seems hard to isolate. Thunks really play a dual role as a replacement for stack-based operand passing and as a mechanism for deferred evaluation. Certainly a switch from laziness to strictness can reduce garbage -- instead of first creating a thunk and then eventually evaluating it to another closure (e.g., a constructor), you can just create the evaluated closure directly -- but that just means that instead of your simple program allocating a mind-blowing 172 gigabytes on the heap, maybe the strict version "only" allocates a modest 84 gigabytes.
As far as I can see, the specific contribution of lazy evaluation to "bytes copied" should be minimal -- if a closure is important at GC time, it will need to be copied. If it's still an unevaluated thunk, the thunk will be copied. If it's been evaluated, just the final closure will need to be copied. If anything, since thunks for complicated structures are much smaller than their evaluated versions, laziness should typically reduce bytes copied. Instead, the usual big win with strictness is that it allows certain heap objects (or stack objects) to become garbage faster so we don't end up with space leaks.
No, laziness does not inherently lead to a large amount of copying in GC. The programmer's failure to manage laziness properly, however, can certainly do so. For example, if a persistent data structure ends up full of chains of thunks due to lazy modification, then it will end up badly bloated.
Another major issue you may be encountering, as Daniel Wagner mentioned, is the cost of immutability. While it is certainly possible to program with mutable structures in Haskell, it is much more idiomatic to work with immutable ones when possible. Immutable structure designs have various trade-offs. For example, ones designed for high performance when used persistently tend to have low branching factors to increase sharing, which leads to some bloat when they're used ephemerally.

Using aligned memory for Fortran FFTs (FFTW) without memory leaks

I want to use the modern Fortran interface of FFTW, but in a way that allows simple function calls like ifftshift(fft_c2c(vec)*exp(vec)) et cetera. This is my understanding of how to do this (I also understand that doing a new plan every call is not the most efficient thing). Currently this code is functional (returns correct results); however, there is a memory leak so that repeated calls result in losses. I'm not quite sure where though! I had hoped that the association of the return variable `fft' with the only unfreed memory would result in no leaks but this is evidently not true. What am I missing, and how can I better structure what I want to do with proper modern fortran? Thanks!
function fft_c2c(x) result(fft)
integer :: N
type(C_PTR) :: plan
complex(C_DOUBLE_COMPLEX), pointer :: fft(:)
complex(C_DOUBLE_COMPLEX), dimension(:), intent(in) :: x
! Use an auxiliary array that is allocated with fftw_alloc_complex
! to ensure memory alignment for performance, see FFTW docs
complex(C_DOUBLE_COMPLEX), pointer :: x_align(:)
type(C_PTR) :: p
N = size(x)
p = fftw_alloc_complex(int(N, C_SIZE_T))
call c_f_pointer(p, fft, [N]);
p = fftw_alloc_complex(int(N, C_SIZE_T))
call c_f_pointer(p, x_align, [N]);
plan = fftw_plan_dft_1d(N, x_align, fft, FFTW_FORWARD, FFTW_MEASURE);
! FFTW overwrites x_align and fft during planning process, so assign
! data here
x_align = x
call fftw_execute_dft(plan, x_align, fft);
call fftw_free(p);
end function fft_c2c
You can't do that easily. You are forcing your notin of "modern"="everything is a function" on Fortran, here it does not fit that well (or not at all).
For the meory leaks the rule is simple - deallocate all the pointers. Using them for the result variable is a guarantee of a memory leak. If you need local allocted aligned memory, you need to locally allocate it, copy the data there, copy the data out and deallocate it.
Every pointer in Fortran need explicit deallocation, there is no reference counting or garbage collection to deallocate them for you.
You think about just using the nonaligned memory with the appropriate flags and measure the difference, you seem not to care about the top performance anyway.
Finally, doingFFTW_MEASURE before every transform is not just "not the most efficient thing", it is an absolute performance disaster. You should, at the very least, use FFTW_ESTIMATE to mitigate it.

malloc/realloc/free capacity optimization

When you have a dynamically allocated buffer that varies its size at runtime in unpredictable ways (for example a vector or a string) one way to optimize its allocation is to only resize its backing store on powers of 2 (or some other set of boundaries/thresholds), and leave the extra space unused. This helps to amortize the cost of searching for new free memory and copying the data across, at the expense of a little extra memory use. For example the interface specification (reserve vs resize vs trim) of many C++ stl containers have such a scheme in mind.
My question is does the default implementation of the malloc/realloc/free memory manager on Linux 3.0 x86_64, GLIBC 2.13, GCC 4.6 (Ubuntu 11.10) have such an optimization?
void* p = malloc(N);
... // time passes, stuff happens
void* q = realloc(p,M);
Put another way, for what values of N and M (or in what other circumstances) will p == q?
From the realloc implementation in glibc trunk at http://sources.redhat.com/git/gitweb.cgi?p=glibc.git;a=blob;f=malloc/malloc.c;h=12d2211b0d6603ac27840d6f629071d1c78586fe;hb=HEAD
First, if the memory has been obtained via mmap() instead of sbrk(), which glibc malloc does for large requests, >= 128 kB by default IIRC:
if (chunk_is_mmapped(oldp))
{
void* newmem;
#if HAVE_MREMAP
newp = mremap_chunk(oldp, nb);
if(newp) return chunk2mem(newp);
#endif
/* Note the extra SIZE_SZ overhead. */
if(oldsize - SIZE_SZ >= nb) return oldmem; /* do nothing */
/* Must alloc, copy, free. */
newmem = public_mALLOc(bytes);
if (newmem == 0) return 0; /* propagate failure */
MALLOC_COPY(newmem, oldmem, oldsize - 2*SIZE_SZ);
munmap_chunk(oldp);
return newmem;
}
(Linux has mremap(), so in practice this is what is done).
For smaller requests, a few lines below we have
newp = _int_realloc(ar_ptr, oldp, oldsize, nb);
where _int_realloc is a bit big to copy-paste here, but you'll find it starting at line 4221 in the link above. AFAICS, it does NOT do the constant factor optimization increase that e.g. the C++ std::vector does, but rather allocates exactly the amount requested by the user (rounded up to the next chunk boundaries + alignment stuff and so on).
I suppose the idea is that if the user wants this factor of 2 size increase (or any other constant factor increase in order to guarantee logarithmic efficiency when resizing multiple times), then the user can implement it himself on top of the facility provided by the C library.
Perhaps you can use malloc_usable_size (google for it) to find the answer experimentally. This function, however, seems undocumented, so you will need to check out if it is still available at your platform.
See also How to find how much space is allocated by a call to malloc()?

Memory leak in Ada.Strings.Unbounded ?

I have a curious memory leak, it seems that the library function to_unbounded_string is leaking!
Code snippets:
procedure Parse (Str : in String;
... do stuff...
declare
New_Element : constant Ada.Strings.Unbounded.Unbounded_String :=
Ada.Strings.Unbounded.To_Unbounded_String (Str); -- this leaks
begin
valgrind output:
==6009== 10,276 bytes in 1 blocks are possibly lost in loss record 153 of 153
==6009== at 0x4025BD3: malloc (vg_replace_malloc.c:236)
==6009== by 0x42703B8: __gnat_malloc (in /usr/lib/libgnat-4.4.so.1)
==6009== by 0x4269480: system__secondary_stack__ss_allocate (in /usr/lib/libgnat-4.4.so.1)
==6009== by 0x414929B: ada__strings__unbounded__to_unbounded_string (in /usr/lib/libgnat-4.4.so.1)
==6009== by 0x80F8AD4: syntax__parser__dash_parser__parseXn (token_parser_g.adb:35)
Where token_parser_g.adb:35 is listed above as the "-- this leaks" line.
Other info: Gnatmake version 4.4.5. gcc version 4.4 valgrind version valgrind-3.6.0.SVN-Debian, valgrind options -v --leak-check=full --read-var-info=yes --show-reachable=no
Any help or insights appreciated,
NWS.
Valgrind clearly says that there is possibly a memory leak. It doesn't necessarily mean there is one. For example, if first call to that function allocates a pool of memory that is re-used during the life time of the program but is never freed, Valgrind will report it as a possible memory leak, even though it is not, as this is a common practice and memory will be returned to OS upon process termination.
Now, if you think that there is a memory leak for real, call this function in a loop, and see it memory continues to grow. If it does - file a bug report or even better, try to find and fix the leak and send a patch along with a bug report.
Hope it helps.
Was trying to keep this to comments, but what I was saying got too long and started to need formatting.
In Ada string objects are generally assumed to be perfectly-sized. The language provies functions to return the size and bounds of any string. Because of this, string handling in Ada is very different than C, and in fact more resembles how you'd do it in a functional language like Lisp.
But the basic principle is that, except in some very unusual situations, if you find yourself using Ada.Strings.Unbounded, you are going about things the wrong way.
The one case where you really can't get around using a variable-length string (or perhaps a buffer with a separate valid_length variable), is when reading strings as input from some external source. As you say, your parsing example is such a situation.
However, even here you should only have that situation on the initial buffer. Your call to your Parse routine should look something like this:
Ada.Text_IO.Get_Line (Buffer, Buffer_Len);
Parse (Buffer(Buffer'first..Buffer'first + Buffer_Len - 1));
Now inside the Parse routine you have a perfectly-sized constant Ada string to work with. If for some reason you need to pull out a subslice, you would do the following:
... --// Code to find start and end indices of my subslice
New_Element : constant String := Str(Element_Start...Element_End);
If you don't actually need to make a copy of that data for some reason though, you are better off just finding Element_Start and Element_End and working with a slice of the original string buffer. Eg:
if Str(Element_Start..Element_End) = "MyToken" then
I know this doesn't answer your question about Ada.Strings.Unbounded possibly leaking. But even if it doesn't leak, that code is relatively wasteful of machine resources (CPU and memory), and probably shouldn't be used for string manipulation unless you really need it.
Are bound[ed] strings scoped?
Expanding on #T.E.D.'s comments, Ada.Strings.Bounded "objects should not be implemented by implicit pointers and dynamic allocation." Instead, the maximum size is fixed when the generic in instantiated. As an implmentation detail, GNAT uses a discriminant to specify the maximum size of the string and a record to store the current size & contents.
In contrast, Ada.Strings.Unbounded requires that "No storage associated with an Unbounded_String object shall be lost upon assignment or scope exit." As an implmentation detail, GNAT uses a buffered implementation derived from Ada.Finalization.Controlled. As a result, the memory used by an Unbounded_String may appear to be a leak until the object is finalized, as for example when the code returns to an enclosing scope.

Resources