Does Linux guarantee freeing malloc'd unfreed memory on program exit? - memory-leaks

I used to believe it does for certain but... I can't find it explicitly stated.
man 3 exit and man 2 _exit verbosely specify the effects of process termination, but don't mention memory leaks.
Posix comes closer: it mentions this:
Memory mappings that were created in the process shall be unmapped before the process is destroyed.
[TYM] [Option Start] Any blocks of typed memory that were mapped in the calling process shall be unmapped, as if munmap() was implicitly called to unmap them. [Option End]
Intermixing this with man 3 malloc:
Normally, malloc() allocates memory from the heap, and adjusts the size of the heap as required, using sbrk(2). When allocating blocks of memory larger than MMAP_THRESHOLD bytes, the glibc malloc() implementation allocates the memory as a private anonymous mapping using mmap(2).
So we could conclude that if malloc called mmap then process termination might make a corresponding call to munmap, BUT... (a) this "optional feature" tags in this POSIX specification are kind of worrying, (b) This is mmap but what about sbrk? (c) Linux isn't 100% POSIX conformant so I'm uncertain if intermixing Linux docu with Posix specs is mandated
The reason I'm asking is this... When a library call fails, am I allowed to just exit?
if(somecall() == -1) {
error(EXIT_FAILURE, errno, "Big fat nasty error.\n");
}
Or do I have to go up the stack making sure everything all the way up to main() is free()'d and only call exit or error in main()?
The former is so much simpler. But to feel easy going with the former, I'd like to find it in the docs explicitly mentioned that this is not an error and that doing this is safe. As I said, the fact that the docs care to explicitly mention a number of guarantees of what will surely be cleaned up BUT fail to mention this specific guarantee unsettles me. (Isn't it the most common and most obvious case? Wouldn't this be mentioned in the first place?)

This "freeing" is done at kernel level. So you are unlikely to find anything direct in POSIX API or C specifications as virtual memory is well "below" them. So you would hardly find anything relevant - let alone guarantees.
On Linux, the kernel reclaims the memory on process exit (both sbrk and mmap), which is guaranteed. See source code of mm.
When a library call fails, am I allowed to just exit?
Yes. This is fine to do.
However, note that there may be other considerations you need to think of, such uncleaned temporary files, open database/network connections and so on. E.g., if your program leaves a database connection open and exits, the server side may not know when to close the connection.
You can read more about Virtual Memory Manager (it's based on older kernel but the idea is still applicable).

I am certain that the POSIX committee intended that all memory allocated by malloc should be deallocated as one of the "consequences of process termination" listed in the specification of _exit that you linked to. I can also tell you that in practice every implementation of Unix I've ever used has done this.
However, I think you have found a genuine lacuna in the specifications. POSIX doesn't say anything about memory allocated by sbrk, because it doesn't specify sbrk at all. Its specification for malloc is taken more-or-less verbatim from the C standard, and the C standard intentionally doesn't say that all memory allocated by malloc should be deallocated upon "normal termination" because there exist embedded environments that don't do that. And, as you point out, "memory mappings that were created in the process" could be read to apply only to allocations made directly by mmap, shmat, and similar. It might be worth filing an interpretation request with the Austin Group.

Related

mmap: will the mapped file be loaded into memory immediately?

From the manual, I just know that mmap() maps a file to a virtual address space, so the file can be randomly accessed. But, it is unclear to me that whether the mapped file is loaded into memory immediately? I guess that kernel manages the mapped memory by pages, and they are loaded on demand, if I only do a few of reads and writes, only a few pages are loaded. Is it correct?
No, yes, maybe. It depends.
Calling mmap generally only means that to your application, the mapped file's contents are mapped to its address space as if the file was loaded there. Or, as if the file really existed in memory, as if they were one and the same (which includes changes being written back to disk, assuming you have write access).
No more, no less. It has no notion of loading something, nor does the application know what this means.
An application does not truly have knowledge of any such thing as memory, although the virtual memory system makes it appear like that. The memory that an application can "see" (and access) may or may not correspond to actual physical memory, and this can in principle change at any time, without prior warning, and without an obvious reason (obvious to your application).
Other than possibly experiencing a small delay due to a page fault, an application is (in principle) entirely unaware of any such thing happening and has little or no control over it1.
Applications will, generally, load pages from mapped files (including the main executable!) on demand, as a consequence of encountering a fault. However, an operating system will usually try to speculatively prefetch data to optimize performance.
In practice, calling mmap will immediately begin to (asynchronously) prefetch pages from the beginning of the mapping, up to a certain implementation-specified size. Which means, in principle, for small files the answer would be "yes", and for larger files it would be "no".
However, mmap does not block to wait for completion of the readahead, which means that you have no guarantee that any of the file is in RAM immediately after mmap returns (not that you have that guarantee at any time anyway!). Insofar, the answer is "maybe".
Under Linux, last time I looked, the default prefetch size was 31 blocks (~127k) -- but this may have changed, plus it's a tuneable parameter. As pages near or at the end of the prefetched area are touched, more pages are being prefetched asynchronously.
If you have hinted MADV_RANDOM to madvise, prefetching is "less likely to happen", under Linux this completely disables prefetch.
On the other hand, giving the MADV_SEQUENTIAL hint will asynchronously prefetch "more aggressively" beginning from the beginning of the mapping (and may discard accessed pages quicker). Under Linux, "more aggressively" means twice the normal amount.
Giving the MADV_WILLNEED hint suggests (but does not guarantee) that all pages in the given range are loaded as soon as possible (since you're saying you're going to access them). The OS may ignore this, but under Linux, it is treated rather as an order than a hint, up to the process' maximum RSS limit, and an implementation-specified limit (if I remember correctly, 1/2 the amount of physical RAM).
Note that MADV_DONTNEED is arguably implemented wrongly under Linux. The hint is not interpreted in the way specified by POSIX, i.e. you're OK with pages being paged out for the moment, but rather that you mean to discard them. Which makes no big difference for readonly mapped pages (other than a small delay, which you said would be OK), but it sure does matter for everything else.
In particular, using MADV_DONTNEED thinking Linux will release unneeded pages after the OS has written them lazily to disk is not how things work! You must explicitly sync, or prepare for a surprise.
Having called readahead on the file descriptor prior to calling mmap (or alternatively, having had read/written the file previously), the file's contents will in practice indeed be in RAM immediately.
This is, however, only an implementation detail (unified virtual memory system), and subject to memory pressure on the system.
Calling mlock will -- assuming it succeeds2 -- immediately load the requested pages into RAM. It blocks until all pages are physically present, and you have the guarantee that the pages will stay in RAM until you unlock them.
1 There exist functionality to query (mincore) whether any or all of the pages in a particular range are actually present at the very moment, and functionality to hint the OS about what you would like to see happening without any hard guarantees (madvise), and finally functionality to force a limited subset of pages to be present in memory (mlock) for privilegued processes.
2 It might not, both for lack of privilegues and for exceeding quotas or the amount of physical RAM present.
Yes, mmap creates a mapping. It does not normally read the entire content of whatever you have mapped into memory. If you wish to do that you can use the mlock/mlockall system call to force the kernel to read into RAM the content of the mapping, if applicable.
By default, mmap() only configure the mapping and returns (fast).
Linux (at least) has the option MAP_POPULATE (see 'man mmap') that does exactly what your question is about.
Yes. The whole point of mmap is that is manages memory more efficiently than just slurping everything into memory.
Of course, any given implementation may in some situations decide that it's more efficient to read in the whole file in one go, but that should be transparent to the program calling mmap.

Does madvise(___, ___, MADV_DONTNEED) instruct the OS to lazily write to disk?

Hypothetically, suppose I want to perform sequential writing to a potentially very large file.
If I mmap() a gigantic region and madvise(MADV_SEQUENTIAL) on that entire region, then I can write to the memory in a relatively efficient manner. This I have gotten to work just fine.
Now, in order to free up various OS resources as I am writing, I occasionally perform a munmap() on small chunks of memory that have already been written to. My concern is that munmap() and msync()will block my thread, waiting for the data to be physically committed to disk. I cannot slow down my writer at all, so I need to find another way.
Would it be better to use madvise(MADV_DONTNEED) on the small, already-written chunk of memory? I want to tell the OS to write that memory to disk lazily, and not to block my calling thread.
The manpage on madvise() has this to say, which is rather ambiguous:
MADV_DONTNEED
Do not expect access in the near future. (For the time being, the
application is finished with the given range, so the kernel can free
resources associated with it.) Subsequent accesses of pages in this
range will succeed, but will result either in re-loading of the memory
contents from the underlying mapped file (see mmap(2)) or
zero-fill-on-demand pages for mappings without an underlying file.
No!
For your own good, stay away from MADV_DONTNEED. Linux will not take this as a hint to throw pages away after writing them back, but to throw them away immediately. This is not considered a bug, but a deliberate decision.
Ironically, the reasoning is that the functionality of a non-destructive MADV_DONTNEED is already given by msync(MS_INVALIDATE|MS_ASYNC), MS_ASYNC on the other hand does not start I/O (in fact, it does nothing at all, following the reasoning that dirty page writeback works fine anyway), fsync always blocks, and sync_file_range may block if you exceed some obscure limit and is considered "extremely dangerous" by the documentation, whatever that means.
Either way, you must msync(MS_SYNC), or fsync (both blocking), or sync_file_range (possibly blocking) followed by fsync, or you will lose data with MADV_DONTNEED. If you cannot afford to possibly block, you have no choice, sadly, but to do this in another thread.
For recent Linux kernels (just tested on Linux 5.4), MADV_DONTNEED works as expected when the mapping is NOT private, e.g. mmap without MAP_PRIVATE flag. I'm not sure what's the behavior on previous versions of Linux kernel.
From release 4.15 of the Linux man-pages project's madvise manpage:
After a successful MADV_DONTNEED operation, the semantics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either repopulating the memory contents from the up-to-date contents of the underlying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared memory segments) or zero-fill-on-demand pages for anonymous private mappings.
Linux added a new flag MADV_FREE with the same behavior in BSD systems in Linux 4.5
which just mark pages as available to free if needed, but it doesn't free them immediately, making possible to reuse the memory range without incurring in the costs of faulting the pages again.
For why MADV_DONTNEED for private mapping may result zero filled pages upon future access, watch Bryan Cantrill's rant as mentioned in comments of #Damon's answer. Spoiler: it comes from Tru64 UNIX.
As already mentioned, MADV_DONTNEED is not your friend. Since Linux 5.4, you can use MADV_COLD to tell the kernel it should page out that memory when there is memory pressure. This seems to be exactly what is wanted in this situation.
Read more here:
https://lwn.net/Articles/793462/
first, madv_sequential enables aggressive readahead, so you don't need it.
second, os will lazily write dirty file-baked memory to disk anyway, even if you will do nothing. but madv_dontneed will instruct it to free memory immediately (what you call "various os resources"). third, it is not clear that mmapping files for sequential writing has any advantage. you probably will be better served by just write(2) (but use buffers - either manual or stdio).

Nvidia Information Disclosure / Memory Vulnerability on Linux and General OS Memory Protection

I thought this was expected behavior?
From: http://classic.chem.msu.su/cgi-bin/ceilidh.exe/gran/gamess/forum/?C35e9ea936bHW-7675-1380-00.htm
Paraphrased summary: "Working on the Linux port we found that cudaHostAlloc/cuMemHostAlloc CUDA API calls return un-initialized pinned memory. This hole may potentially allow one to examine regions of memory previously used by other programs and Linux kernel. We recommend everybody to stop running CUDA drivers on any multiuser system."
My understanding was that "Normal" malloc returns un-initialized memory, so I don't see what the difference here is...
The way I understand how memory allocation works would allow the following to happen:
-userA runs a program on a system that crunches a bunch of sensitive information. When the calculations are done, the results are written to disk, the processes exits, and userA logs off.
-userB logs in next. userB runs a program that requests all available memory in the system, and writes the content of his un-initialized memory, which contains some of userA's sensitive information that was left in RAM, to disk.
I have to be missing something here. What is it? Is memory zero'd-out somewhere? Is kernel/pinned memory special in a relevant way?
Memory returned by malloc() may be nonzero, but only after being used and freed by other code in the same process. Never another process. The OS is supposed to rigorously enforce memory protections between processes, even after they have exited.
Kernel/pinned memory is only special in that it apparently gave a kernel mode driver the opportunity to break the OS's process protection guarantees.
So no, this is not expected behavior; yes, this was a bug. Kudos to NVIDIA for acting on it so quickly!
The only part that requires root priviledges to install CUDA is the NVIDIA driver. As a result all operations done using NVIDIA compiler and link can be done using regular system calls, and standard compiling (provided you have the proper information -lol-). If any security holes lies there, it remains, wether or not cudaHostAlloc/cuMemHostAlloc is modified.
I am dubious about the first answer seen on this post. The man page for malloc specifies that
the memory is not cleared. The man page for free does not mention any clearing of the memory.
The clearing of memory seems to be in the responsability of the coder of a sensitive section -lol-, that leave the problem of an unexpected (rare) exit. Apart from VMS (good but not widely used OS), I dont think any OS accept the performance cost of a systematic clearing. I am not clear about the way the system may track in the heap of a newly allocated memory what was previously in the process area, and what was not.
My conclusion is: if you need a strict level of privacy, do not use a multi-user system
(or use VMS).

User-space memory editing programs

How do programs that edit memory of other processes work, such as Cheat Engine and iHaxGamez? My understanding is that a process reading from (let alone writing to) another process' memory is immediate grounds for a segmentation fault.
Gaining access to another processes memory under linux is fairly straightforward (assuming you have sufficient user privileges).
For example the file /dev/mem will provide access to the entire memory space of cpu. Details of the mappings for an individual process can be found in /proc/<pid>/maps.
Another example has been given here.
The operation system's hardware abstraction layer usually offers functions to manipulate the memory of other processes. In Windows, the corresponding functions are ReadProcessMemory and WriteProcessMemory.
It has no reason to segfault; OS (kernel, ...) API is used to write.
Segfault occurs (get signalled) from OS when a process attempts to access it's own memory in a bad way (char[] overflow).
About the games: well, if a value is stored at an address, and gets read sometimes, then it could be modified before next reading occurs.
You can use WinAPI WriteProcessMemory to write to memory space of other process.
Also read some PE/COFF documentation and use VirtualQueryEx and ReadProcessMemory to know what and where to write.

Is fork() copy-on-write a stable exposed behavior that can be used to implement read-only shared memory?

The man page on fork() states that it does not copy data pages, it maps them into the child process and puts a copy-on-write flag. Is that behavior:
consistent between flavors of Linux?
considered an implementation detail and therefore likely to change?
I'm wondering if I can use fork() as a means to get a shared read-only memory block on the cheap. If the memory is physically copied, it would be rather expensive - there's a lot of forking going on, and the data area is big enough - but I'm hoping not...
Linux running on machines without a MMU (memory management unit) will copy all process memory on fork().
However, those systems are usually very small and embedded and you probably don't have to worry about them.
Many services such as Apache's fork model, use the initialize and fork() method to share initialized data structures.
You should be aware that if you are using languages like Perl and Python that use reference-counted variables, or C++ shared_ptr's, this model will not work. It will not work because as the reference counts are adjusted up and down, the memory becomes unshared and gets copied.
This causes huge amounts of memory usage in Perl daemons like SpamAssassin that attempt to use an initialize and fork model.
Yes you can certainly rely on it on MMU-Linux kernels; this is almost everything.
However, the page size isn't the same everywhere.
It is possible to explicitly make a shared memory area for forked process, by using mmap() to create an anonymous map - one which is not backed by a physical file. On fork, this area will always remain shared (provided the child doesn't unmap it, or map something else in at the same address). You can mprotect it to be readonly if you want.
Memory allocated with (for example) malloc can easily end up sharing a page with something that isn't readonly, which means it gets copied anyway when another structure is modified. This includes internal structures used by the malloc implementation. So you might want to mmap a specific area for this purpose and allocate from that.
Can you rely on the fact that all Linux flavors do it this way? No. But you can rely on the fact that those who don't use an even faster method.
Therefore you should use the feature and rely on it and revisit your decision if you get a performance problem.
The success of this approach depends on how well you stick to your self-imposed "read-only" limitation. Both parent and child have to obey this stricture, else the memory gets copied.
This may not be the catastrophe you're envisioning, however. The kernel can copy as little as a single page (typically 4 KB) to implement CoW semantics. A typical Linux server will use something more complex, some sort of slab allocator, so the copied region could be much larger.
The main point is that this is decoupled from your program's conception of its memory use. If you malloc() 1 GB of RAM, fork off a child, and the child changes just the first byte of that memory block, the entire 1 GB block isn't copied. Perhaps as little as one page is copied, up to the slab size containing that first byte.
Yes
All the linux distros use the same kernel, albeit with slightly different versions and releases of it.
It's unlikely that another underlying fork(2) implementation will be faster any time soon, so it's a safe bet that copy-on-write will continue to be the mechanism. Perhaps it won't be forever, but for years, definitely.
Certainly some major software systems (for example, Phusion Passenger) use fork(2) in the same way that you want to, so you would not be the only one taking advantage of CoW.

Resources