So I tried to test whether the D garbage collector works properly by running this program on Windows.
DMD 2.057 and 2.058 beta both give the same result, whether or not I specify -release, -inline, -O, etc.
The code:
import core.memory, std.stdio;
extern(Windows) int GlobalMemoryStatusEx(ref MEMORYSTATUSEX lpBuffer);
struct MEMORYSTATUSEX
{
uint Length, MemoryLoad;
ulong TotalPhys, AvailPhys, TotalPageFile, AvailPageFile;
ulong TotalVirtual, AvailVirtual, AvailExtendedVirtual;
}
void testA(size_t count)
{
size_t[] a;
foreach (i; 0 .. count)
a ~= i;
//delete a;
}
void main()
{
MEMORYSTATUSEX ms;
ms.Length = ms.sizeof;
foreach (i; 0 .. 32)
{
testA(16 << 20);
GlobalMemoryStatusEx(ms);
stderr.writefln("AvailPhys: %s MiB", ms.AvailPhys >>> 20);
}
}
The output was:
AvailPhys: 3711 MiB
AvailPhys: 3365 MiB
AvailPhys: 3061 MiB
AvailPhys: 2747 MiB
AvailPhys: 2458 MiB
core.exception.OutOfMemoryError
When I uncommented the delete a; statement, the output was
AvailPhys: 3714 MiB
AvailPhys: 3702 MiB
AvailPhys: 3701 MiB
AvailPhys: 3702 MiB
AvailPhys: 3702 MiB
...
So I guess the question is obvious... does the GC actually work?
The problem here is false pointers. D's garbage collector is conservative, meaning it doesn't always know what's a pointer and what isn't. It sometimes has to assume that bit patterns that would point into GC-allocated memory if interpreted as pointers, are pointers. This is mostly a problem for large allocations, since large blocks are a bigger target for false pointers.
You're allocating about 48 MB each time you call testA(). In my experience this is enough to almost guarantee there will be a false pointer into the block on a 32-bit system. You'll probably get better results if you compile your code in 64-bit mode (supported on Linux, OSX and FreeBSD but not Windows yet) since 64-bit address space is much more sparse.
As far as my GC optimizations (I'm the David Simcha that CyberShadow mentions) there were two batches. One's been in for >6 months and hasn't caused any problems. The other is still in review as a pull request and isn't in the main druntime tree yet. These probably aren't the problem.
Short term, the solution is to manually free these huge blocks. Long term, we need to add precise scanning, at least for the heap. (Precise stack scanning is a much harder problem.) I wrote a patch to do this a couple years ago but it was rejected because it relied on templates and compile time function evaluation to generate pointer offset information for each datatype. Hopefully this information will eventually be generated directly by the compiler and I can re-create my precise heap scanning patch for the garbage collector.
This looks like a regression - it doesn't happen in D1 (DMD 1.069). David Simcha has been optimizing the GC lately, so it might have something to do with that. Please file a bug report.
P.S. If you rebuild Druntime with DFLAGS set to -debug=PRINTF in the makefile, you will get information on when the GC allocates/deallocates via console. :)
It does work. Current implementation just never releases memory to operating system. Though GC reuses acquired memory, so it's not a really a leak.
Related
Symptoms:
I allocate TLS key with a destructor, create a bundle of threads and pass the TLS key to each thread. Each thread allocates memory and sets its pointer in TLS, the TLS destructor deallocates memory. I wait for threads to finish before app exits.
The app is run under valgrind/massif that reports this memory not deallocated.
int main(int argc, char **argv)
{
pthread_key_t* key = new pthread_key_t();
pthread_key_create(key, my_destructor);
pthread_t threads[32000];
for(int i=0; i<32000; ++i)
pthread_create(&threads[i], NULL, my_thread, key);
for(int i=0; i<32000; ++i)
pthread_join(threads[i], NULL);
return 0;
}
In the thread runner I allocate the memory and set it up in the TLS:
extern "C" void* my_thread(void* p)
{
pthread_setspecific(*(pthread_key_t*)p, malloc(100));
return NULL;
}
In the TLS destructor, I release the memory:
extern "C" void my_destructor(void *p)
{
free(p);
}
I run this under valgrind/massif 3.19 with the following options:
--tool=massif
--heap=yes
--pages-as-heap=yes
--log-file=/tmp/my.log
--massif-out-file=/tmp/my.massif.log
Then I run ms_print /tmp/my.massif.log. I am getting the leaks reported like the following:
| ->01.75% (67,108,864B) 0x76F92D0: new_heap (in /usr/lib64/libc-2.17.so)
| | ->01.75% (67,108,864B) 0x76F98D3: arena_get2.isra.3 (in /usr/lib64/libc-2.17.so)
| | ->01.75% (67,108,864B) 0x76FF77D: malloc (in /usr/lib64/libc-2.17.so)
| | ->01.75% (67,108,864B) 0x410300: my_thread (threadsT.cpp:136)
| | ...
| | <skipped by author>
| | ...
| |
| ->00.00% (73,728B) in 1+ places, all below ms_print's threshold (01.00%)
...while I would not expect anything reported leaked at all.
I added the instrumentation to my_destructor and manually verified that:
it is invoked, indeed
it deallocates the memory, as it is supposed to do
Is there something apparent I am doing wrong here that makes valgrind/massif report these?
Is it a valgrind/massif limitation that it cannot detect the memory deallocation when invoked from TLS destructors?
Building and running that with gcc 4.9.4 on Red Hat Enterprise Linux Server release 7.9 (Maipo).
A second answer, this time concentrating on the 'leak' aspect.
Massif isn't really a leak detector. It's for profiling heap use.
If I compile the example (with 320 threads) then at the end I get about 89 million bytes still allocated. That is made up of
75% the arena used by malloc called from start_thread
9% pthread_create
15% loading shared libraries
None of that looks like much of a concern to me. I assume that the start_thread memory is the pthread stack cache.
If I use massif for profiling malloc/new, then the last sample is
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
73 2,929,610 2,360 2,308 52 0
You should check the return status for your thread creation. It's unlikely that you are succeeding in creating 32000 threads.
A bit of Valgrind source:
coregrind/pub_core_options.h:#define MAX_THREADS_DEFAULT 500
coregrind/m_scheduler/scheduler.c: VG_(printf)("Use --max-threads=INT to specify a larger number of threads\n"
Assuming that this is amd64 Linux, I believe that the default pthread stack size is 8Mbytes. That means you need 256Gbytes for stack memory. Does your machine have that much?
Please try the following
Use pthread_attr_setstacksize to set the stack sizes to PTHREAD_STACK_MIN (16k).
Run Valgrind with --max-threads=32001
Even with the above you may still hit some Valgrind limits such as VG_N_SEGMENTS.
If you see a message like
"Valgrind: FATAL: VG_N_SEGMENTS is too low.
Increase it and rebuild.
Exiting now."
Then you will need to rebuild Valgrind with an increased limit.
I am trying to reproduc a problem .
My c code giving SIGABRT , i traced it back to this line number :3174
https://elixir.bootlin.com/glibc/glibc-2.27/source/malloc/malloc.c
/* Little security check which won't hurt performance: the allocator
never wrapps around at the end of the address space. Therefore
we can exclude some size values which might appear here by
accident or by "design" from some intruder. We need to bypass
this check for dumped fake mmap chunks from the old main arena
because the new malloc may provide additional alignment. */
if ((__builtin_expect ((uintptr_t) oldp > (uintptr_t) -oldsize, 0)
|| __builtin_expect (misaligned_chunk (oldp), 0))
&& !DUMPED_MAIN_ARENA_CHUNK (oldp))
malloc_printerr ("realloc(): invalid pointer");
My understanding is that when i call calloc function memory get allocated when I call realloc function and try to increase memory area ,heap is not available for some reason giving SIGABRT
My another question is, How can I limit the heap area to some bytes say, 10 bytes to replicate the problem. In stackoverflow RSLIMIT and srlimit is mentioned but no sample code is mentioned. Can you provide sample code where heap size is 10 Bytes ?
How can I limit the heap area to some bytes say, 10 bytes
Can you provide sample code where heap size is 10 Bytes ?
From How to limit heap size for a c code in linux , you could do:
You could use (inside your program) setrlimit(2), probably with RLIMIT_AS (as cited by Ouah's answer).
#include <sys/resource.h>
int main() {
setrlimit(RLIMIT_AS, &(struct rlimit){10,10});
}
Better yet, make your shell do it. With bash it is the ulimit builtin.
$ ulimit -v 10
$ ./your_program.out
to replicate the problem
Most probably, limiting heap size will result in a different problem related to heap size limit. Most probably it is unrelated, and will not help you to debug the problem. Instead, I would suggest to research address sanitizer and valgrind.
The manual page told me so much and through it I know lots of the background knowledge of memory management of "glibc".
But I still get confused. does "malloc_trim(0)"(note zero as the parameter) mean (1.)all the memory in the "heap" section will be returned to OS ? Or (2.)just all "unused" memory of the top most region of the heap will be returned to OS ?
If the answer is (1.), what if the still used memory in the heap? if the heap has used momery at places somewhere, will them be eliminated, or the function wouldn't execute successfully?
While if the answer is (2.), what about those "holes" at places rather than top in the heap? they're unused memory anymore, but the top most region of the heap is still used, will this calling work efficiently?
Thanks.
Man page of malloc_trim was committed here: https://github.com/mkerrisk/man-pages/blob/master/man3/malloc_trim.3 and as I understand, it was written by man-pages project maintainer, kerrisk in 2012 from scratch: https://github.com/mkerrisk/man-pages/commit/a15b0e60b297e29c825b7417582a33e6ca26bf65
As I can grep the glibc's git, there are no man pages in the glibc, and no commit to malloc_trim manpage to document this patch. The best and the only documentation of glibc malloc is its source code: https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c
There are malloc_trim comments from malloc/malloc.c:
Additional functions:
malloc_trim(size_t pad);
609 /*
610 malloc_trim(size_t pad);
611
612 If possible, gives memory back to the system (via negative
613 arguments to sbrk) if there is unused memory at the `high' end of
614 the malloc pool. You can call this after freeing large blocks of
615 memory to potentially reduce the system-level memory requirements
616 of a program. However, it cannot guarantee to reduce memory. Under
617 some allocation patterns, some large free blocks of memory will be
618 locked between two used chunks, so they cannot be given back to
619 the system.
620
621 The `pad' argument to malloc_trim represents the amount of free
622 trailing space to leave untrimmed. If this argument is zero,
623 only the minimum amount of memory to maintain internal data
624 structures will be left (one page or less). Non-zero arguments
625 can be supplied to maintain enough trailing space to service
626 future expected allocations without having to re-obtain memory
627 from the system.
628
629 Malloc_trim returns 1 if it actually released any memory, else 0.
630 On systems that do not support "negative sbrks", it will always
631 return 0.
632 */
633 int __malloc_trim(size_t);
634
Freeing from the middle of the chunk is not documented as text in malloc/malloc.c and not documented in man-pages project. Man page from 2012 may be the first man page of the function, written not by authors of glibc. Info page of glibc only mentions M_TRIM_THRESHOLD of 128 KB:
https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html#Malloc-Tunable-Parameters and don't list malloc_trim function https://www.gnu.org/software/libc/manual/html_node/Summary-of-Malloc.html#Summary-of-Malloc (and it also don't document memusage/memusagestat/libmemusage.so).
In december 2007 there was commit https://sourceware.org/git/?p=glibc.git;a=commit;f=malloc/malloc.c;h=68631c8eb92ff38d9da1ae34f6aa048539b199cc by Ulrich Drepper (it is part of glibc 2.9 and newer) which changed mtrim implementation (but it didn't change any documentation or man page as there are no man pages in glibc):
malloc/malloc.c (public_mTRIm): Iterate over all arenas and call
mTRIm for all of them.
(mTRIm): Additionally iterate over all free blocks and use madvise
to free memory for all those blocks which contain at least one
memory page.
Unused parts of chunks (anywhere, including chunks in the middle), aligned on page size and having size more than page may be marked as MADV_DONTNEED https://sourceware.org/git/?p=glibc.git;a=blobdiff;f=malloc/malloc.c;h=c54c203cbf1f024e72493546221305b4fd5729b7;hp=1e716089a2b976d120c304ad75dd95c63737ad75;hb=68631c8eb92ff38d9da1ae34f6aa048539b199cc;hpb=52386be756e113f20502f181d780aecc38cbb66a
INTERNAL_SIZE_T size = chunksize (p);
if (size > psm1 + sizeof (struct malloc_chunk))
{
/* See whether the chunk contains at least one unused page. */
char *paligned_mem = (char *) (((uintptr_t) p
+ sizeof (struct malloc_chunk)
+ psm1) & ~psm1);
assert ((char *) chunk2mem (p) + 4 * SIZE_SZ <= paligned_mem);
assert ((char *) p + size > paligned_mem);
/* This is the size we could potentially free. */
size -= paligned_mem - (char *) p;
if (size > psm1)
madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
}
This is one of total two usages of madvise with MADV_DONTNEED in glibc now, one for top part of heaps (shrink_heap) and other is marking of any chunk (mtrim): http://code.metager.de/source/search?q=MADV_DONTNEED&path=%2Fgnu%2Fglibc%2Fmalloc%2F&project=gnu
H A D arena.c 643 __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
H A D malloc.c 4535 __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
We can test the malloc_trim with this simple C program (test_malloc_trim.c) and strace/ltrace:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <malloc.h>
int main()
{
int *m1,*m2,*m3,*m4;
printf("%s\n","Test started");
m1=(int*)malloc(20000);
m2=(int*)malloc(40000);
m3=(int*)malloc(80000);
m4=(int*)malloc(10000);
// check that all arrays are allocated on the heap and not with mmap
printf("1:%p 2:%p 3:%p 4:%p\n", m1, m2, m3, m4);
// free 40000 bytes in the middle
free(m2);
// call trim (same result with 2000 or 2000000 argument)
malloc_trim(0);
// call some syscall to find this point in the strace output
sleep(1);
free(m1);
free(m3);
free(m4);
// malloc_stats(); malloc_info(0, stdout);
return 0;
}
gcc test_malloc_trim.c -o test_malloc_trim, strace ./test_malloc_trim
write(1, "Test started\n", 13Test started
) = 13
brk(0) = 0xcca000
brk(0xcef000) = 0xcef000
write(1, "1:0xcca010 2:0xccee40 3:0xcd8a90"..., 441:0xcca010 2:0xccee40 3:0xcd8a90 4:0xcec320
) = 44
madvise(0xccf000, 36864, MADV_DONTNEED) = 0
...
nanosleep({1, 0}, 0x7ffffafbfff0) = 0
brk(0xceb000) = 0xceb000
So, there was madvise with MADV_DONTNEED for 9 pages after malloc_trim(0) call, when there was hole of 40008 bytes in the middle of the heap.
The man page for malloc_trim says it releases free memory, so if there is allocated memory in the heap, it won't release the whole heap. The parameter is there if you know you're still going to need a certain amount of memory, so freeing more than that would cause glibc to have to do unnecessary work later.
As for holes, this is a standard problem with memory management and returning memory to the OS. The primary low-level heap management available to the program is brk and sbrk, and all they can do is extend or shrink the heap area by changing the top. So there's no way for them to return holes to the operating system; once the program has called sbrk to allocate more heap, that space can only be returned if the top of that space is free and can be handed back.
Note that there are other, more complex ways to allocate memory (with anonymous mmap, for example), which may have different constraints than sbrk-based allocation.
When you have a dynamically allocated buffer that varies its size at runtime in unpredictable ways (for example a vector or a string) one way to optimize its allocation is to only resize its backing store on powers of 2 (or some other set of boundaries/thresholds), and leave the extra space unused. This helps to amortize the cost of searching for new free memory and copying the data across, at the expense of a little extra memory use. For example the interface specification (reserve vs resize vs trim) of many C++ stl containers have such a scheme in mind.
My question is does the default implementation of the malloc/realloc/free memory manager on Linux 3.0 x86_64, GLIBC 2.13, GCC 4.6 (Ubuntu 11.10) have such an optimization?
void* p = malloc(N);
... // time passes, stuff happens
void* q = realloc(p,M);
Put another way, for what values of N and M (or in what other circumstances) will p == q?
From the realloc implementation in glibc trunk at http://sources.redhat.com/git/gitweb.cgi?p=glibc.git;a=blob;f=malloc/malloc.c;h=12d2211b0d6603ac27840d6f629071d1c78586fe;hb=HEAD
First, if the memory has been obtained via mmap() instead of sbrk(), which glibc malloc does for large requests, >= 128 kB by default IIRC:
if (chunk_is_mmapped(oldp))
{
void* newmem;
#if HAVE_MREMAP
newp = mremap_chunk(oldp, nb);
if(newp) return chunk2mem(newp);
#endif
/* Note the extra SIZE_SZ overhead. */
if(oldsize - SIZE_SZ >= nb) return oldmem; /* do nothing */
/* Must alloc, copy, free. */
newmem = public_mALLOc(bytes);
if (newmem == 0) return 0; /* propagate failure */
MALLOC_COPY(newmem, oldmem, oldsize - 2*SIZE_SZ);
munmap_chunk(oldp);
return newmem;
}
(Linux has mremap(), so in practice this is what is done).
For smaller requests, a few lines below we have
newp = _int_realloc(ar_ptr, oldp, oldsize, nb);
where _int_realloc is a bit big to copy-paste here, but you'll find it starting at line 4221 in the link above. AFAICS, it does NOT do the constant factor optimization increase that e.g. the C++ std::vector does, but rather allocates exactly the amount requested by the user (rounded up to the next chunk boundaries + alignment stuff and so on).
I suppose the idea is that if the user wants this factor of 2 size increase (or any other constant factor increase in order to guarantee logarithmic efficiency when resizing multiple times), then the user can implement it himself on top of the facility provided by the C library.
Perhaps you can use malloc_usable_size (google for it) to find the answer experimentally. This function, however, seems undocumented, so you will need to check out if it is still available at your platform.
See also How to find how much space is allocated by a call to malloc()?
i am trying to put a head-tag and a foot tag inside struct malloc_chunk, like this:
struct malloc_shunk {
INTERNAL_SIZE prev_size;
INTERNAL_SIZE size;
}
Here is what i did:
1.
struct malloc_shunk {
INTERNAL_SIZE foot_tag;
INTERNAL_SIZE prev_size;
INTERNAL_SIZE size;
}
before putting the head_tag inside malloc_chunk, i just added foot_tag only and compiled glibc. I made a small test program that mallocs 60 bytes from the system and then frees it. Thou i can see that the malloc returned properly, free complained, saying "invalid pointer". The pointer malloc returned was 0x9313010. That makes the pointer to the start of malloc_chunk to be 0x9313004.
So when the free is passed 0x9313010, it converts it to 0x9313004 through (mem2chunk) and the check it for alignment. Since my wordsize is 4, alignment check with 0x9313010 is where i am getting problems. Can you please tell me if the Mem pointer (returned by malloc) needs to be absolutely double-word aligned. (as here it may not satisfy that criterion, as difference betwwen the pointer returned by malloc and start of chunk will be 12 bytes here and not 8).
To ever come 1. issue , i just added 1 head-tag to the structure so that it becomes
struct malloc_shunk {
INTERNAL_SIZE foot_tag;
INTERNAL_SIZE prev_size;
INTERNAL_SIZE size;
INTERNAL_SIZE head_tag;
}
Now the difference betwwen the pointer returned by malloc and start of chunk will be 16 which will always be double word aligned. But Here i am facing a bigger problem as the time when first malloc is called the arena is setup and bins are initallised. The size of the victim here is not coming out to be zero as it should be in normal cases. The problem is that the victim->size is actually comes out to be the place where 'top' is stored rather than 'last_remainder'. i'd like to ask for you opinion if there is any other way/workaround/solution, so that i can over come this initialization of arena issue i am currently facing.
Thanks and Regards,
Kapil
From what ever i have learnt, try not to modify the structure malloc_chunk. Even if you have to then make sure that the top pointer is initially 0 otherwise heap will never be formed. Its the sYSMALLOC() function in _int_malloc() that first MMAPs a bigger chunk from kernel. It will be invoked only if the top is 0 initially.