mmap File-backed mapping vs Anonymous mapping in Linux

mmap File-backed mapping vs Anonymous mapping in Linux - linux

what is the main difference between File-backed mapping & Anonymous
mapping.
How can we choose between File-backed mapping or Anonymous
mapping, when we need an IPC between processes.
What is the advantage,disadvantage of using these.?

mmap() system call allows you to go for either file-backed mapping or anonymous mapping.
void *mmap(void *addr, size_t lengthint " prot ", int " flags ,int fd,
off_t offset)
File-backed mapping- In linux , there exists a file /dev/zero which is an infinite source of 0 bytes. You just open this file, and pass its descriptor to the mmap() call with appropriate flag, i.e., MAP_SHARED if you want the memory to be shared by other process or MAP_PRIVATE if you don't want sharing.
Ex-
.
.
if ((fd = open("/dev/zero", O_RDWR)) < 0)
printf("open error");
if ((area = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,fd, 0)) == MAP_FAILED)
{
printf("Error in memory mapping");
exit(1);
}
close(fd); //close the file because memory is mapped
//create child process
.
.
Quoting the man-page of mmap() :-
The contents of a file mapping (as opposed to an anonymous mapping;
see MAP_ANONYMOUS below), are initialized using length bytes starting
at offset offset in the file (or other object) referred to by the file
descriptor fd. offset must be a multiple of the page size as returned
by sysconf(_SC_PAGE_SIZE).
In our case, it has been initialized with zeroes(0s).
Quoting the text from the book Advanced Programming in the UNIX Environment by W. Richard Stevens, Stephen A. Rago II Edition
The advantage of using /dev/zero in the manner that we've shown is
that an actual file need not exist before we call mmap to create the
mapped region. Mapping /dev/zero automatically creates a mapped region
of the specified size. The disadvantage of this technique is that it
works only between related processes. With related processes, however,
it is probably simpler and more efficient to use threads (Chapters 11
and 12). Note that regardless of which technique is used, we still
need to synchronize access to the shared data
After the call to mmap() succeeds, we create a child process which will be able to see the writes to the mapped region(as we specified MAP_SHARED flag).
Anonymous mapping - The similar thing that we did above can be done using anonymous mapping.For anonymous mapping, we specify the MAP_ANON flag to mmap and specify the file descriptor as -1.
The resulting region is anonymous (since it's not associated with a pathname through a file descriptor) and creates a memory region that can be shared with descendant processes.
The advantage is that we don't need any file for mapping the memory, the overhead of opening and closing file is also avoided.
if ((area = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_SHARED, -1, 0)) == MAP_FAILED)
printf("Error in anonymous memory mapping");
So, these file-backed mapping and anonymous mapping necessarily work only with related processes.
If you need this between unrelated processes, then you probably need to create named shared memory by using shm_open() and then you can pass the returned file descriptor to mmap().

Related

How to unmap an mmap'd file by replacing with a mapping to empty pages

Is there a way from Linux userspace to replace the pages of a mapped file (or mmap'd pages within a certain logical address range) with empty pages (mapped from /dev/null, or maybe a single empty page, mapped repeatedly over the top of the pages mapped from the file)?
For context, I want to find a fix for this JDK bug:
https://bugs.openjdk.java.net/browse/JDK-4724038
To summarize the bug: it is not currently possible to unmap files in Java until the JVM can garbage collect the MappedByteBuffer that wraps an mmap'd file, because forcibly unmapping the file could give rise to security issues due to race conditions (e.g. native code could still be trying to access the same address range that the file was mapped to, and the OS may have already mapped a new file into that same logical address range).
I'm looking to replace the mapped pages in the logical address range, and then unmap the file. Is there any way to accomplish this?
(Bonus points if you know a way of doing this in other operating systems too, particularly Windows and Mac OS X.)
Note that this doesn't have to be an atomic operation. The main goal is to separate the unmapping of the memory (or the replacing of the mapped file contents with zero-on-read pages) from the closing of the file, since that will solve a litany of issues on both Linux (which has a low limit on the number of file descriptors per process) and Windows (the fact you can't delete a file while it is mapped).
UPDATE: see also: Memory-mapping a file in Windows with SHARE attribute (so file is not locked against deletion)

On Linux you can use mmap with MAP_FIXED to replace the mapping with any mapping you want. If you replace the entire mapping the reference to the file will be removed.

The reason the bug remains in the JDK so long is fundamentally because of the race condition in between unmapping the memory and mapping the dummy memory, some other memory could end up mapped there (potentially by native code). I have been over the OS APIs and there exist no memory operations atomic at the syscall level that unmap a file and map something else to the same address. However there are solutions that block the whole process while swapping out the mapping from underneath it.
The unmap works correctly in finalize without a guard because the GC has proven the object is unreachable first, so there is no race.
Highly Linux specific solution:
1) vfork()
2) send parent a STOP signal
3) unmap the memory
4) map the zeros in its place
5) send parent a CONT signal
6) _exit (which unblocks parent thread)
In Linux, memory mapping changes propagate to the parent.
The code actually looks more like this (vfork() is bonkers man):
int unmap(void *addr, int length)
{
int wstatus;
pid_t child;
pid_t parent;
int thread_cancel_state;
signal_set signal_set;
signal_set old_signal_set;
parent = getpid();
pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &thread_cancel_state);
sigfillset(&signal_set);
pthread_sigmask(SIG_SETMASK, &signal_set, &old_signal_set);
if (0 == (child = vfork()) {
int err = 0;
kill(parent, SIGSTOP);
if (-1 == munmap(addr, length))
err = 1;
else if ((void*)-1 == mmap(addr, length, PROT_NONE, MAP_ANONYMOUS, -1, 0);
err = 1;
kill(parent, SIGCONT);
_exit(err);
}
if (child > 0)
waitpid(child, &wstatus, 0);
else
wstatus = 255;
pthread_sigmask(SIG_SETMASK, &old_signal_set, &signal_set);
pthread_setcancelstate(thread_cancel_state, &thread_cancel_state);
return (wstatus & 255) != 0;
}
Under Windows you can do stop all threads but this one using SuspendThread which feels tailor made for this. However, enumerating threads is going to be hard because you're racing against CreateThread. You have to run the enumerate thread ntdll.dll APIs (you cannot use ToolHelp here trust me) and SuspendThread each one but your own, carefully only using VirtualAlloc to allocate memory because SuspendThread just broke all the heap allocation routines, and you're going to have to do all that in a loop until you find no more.
There's some writeup here that I don't quite feel like I can distill down accurately:
http://forums.codeguru.com/showthread.php?200588-How-to-enumerate-threads-in-currently-running-process
I did not find any solutions for Mac OSX.

Direct disk IO to/from buffer was allocated by mmap

I need help regarding to direct disk IO. I open a file by file descriptor (fd) with flag O_DIRECT. In my user space application, I want to read large amount of data from the file and these data was used once only. A piece of un-cached memory buffer was allocated in my kernel module through "set_memory_uc" (using x86) and "remap_pfn_range" with vm_page_prot set noncached (pgrot_noncached). This buffer is aim to be used for DMA transfer via PCIe.
I tried
read(fd, buffer, len)
and
lseek(fd, 0x1000, SEEK_SET)
'buffer' VA is aligned to 4k boundary. So does 'len' (n*4k)
for somehow ,'lseek'seems works because after calling lseek it returns 0x1000
but 'read' return -1
Is there any restriction for direct disk read disk data to a mmap buffer?

Instead of O_DIRECT, consider posix_fadvise() with the POSIX_FADV_NOREUSE flag to indicate "the data will be used only once."

Write-only mapping a O_WRONLY opened file supposed to work?

Is mmap() supposed to be able to create a write-only mapping of a O_WRONLY opened file?
I am asking because following fails on a Linux 4.0.4 x86-64 system (strace log):
mkdir("test", 0700) = 0
open("test/foo", O_WRONLY|O_CREAT, 0666) = 3
ftruncate(3, 11) = 0
mmap(NULL, 11, PROT_WRITE, MAP_SHARED, 3, 0) = -1 EACCES (Permission denied)
The errno equals EACCESS.
Replacing the open-flag O_WRONLY with O_RDWR yields a successful mapping.
The Linux mmap man page documents the errno as:
EACCES A file descriptor refers to a non-regular file. Or a file map‐
ping was requested, but fd is not open for reading. Or
MAP_SHARED was requested and PROT_WRITE is set, but fd is not
open in read/write (O_RDWR) mode. Or PROT_WRITE is set, but the
file is append-only.
Thus, that behaviour is documented with the second sentence.
But what is the reason behind it?
Is it allowed by POSIX?
Is it a kernel or a library limitation? (On a quick glance, I couldn't find anything obvious in Linux/mm/mmap.c)

The IEEE Std 1003.1, 2004 Edition (POSIX.1 2004) appears to forbid it.
An implementation may permit accesses other than those specified by prot; however, if the Memory Protection option is supported, the implementation shall not permit a write to succeed where PROT_WRITE has not been set or shall not permit any access where PROT_NONE alone has been set. The implementation shall support at least the following values of prot: PROT_NONE, PROT_READ, PROT_WRITE, and the bitwise-inclusive OR of PROT_READ and PROT_WRITE. If the Memory Protection option is not supported, the result of any access that conflicts with the specified protection is undefined. The file descriptor fildes shall have been opened with read permission, regardless of the protection options specified. If PROT_WRITE is specified, the application shall ensure that it has opened the file descriptor fildes with write permission unless MAP_PRIVATE is specified in the flags parameter as described below.
(emphasis added)
Also, on x86, it is not possible to have write-only memory, and this is a limitation of the page table entries. Pages may be marked read-only or read-write and independently may be executable or non-executable, but cannot be write-only. Moreover the man-page for mprotect() says:
Whether PROT_EXEC has any effect different from PROT_READ is architecture- and kernel version-dependent. On some hardware architectures (e.g., i386), PROT_WRITE implies PROT_READ.
This being the case, you've opened a file descriptor without read access, but mmap() would be bypassing the O_WRONLY by giving you PROT_READ rights. Instead, it will refuse outright with EACCESS.

I don't think the x86 hardware supports write-only pages, so write access implies read. But it seems to be a more general requirement than just x86 - mm/mmap.c contains this code in do_mmap_pgoff():
case MAP_SHARED:
if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
return -EACCES;
....
/* fall through */
case MAP_PRIVATE:
if (!(file->f_mode & FMODE_READ))
return -EACCES;
I think that explains what you're seeing.

Writing out DMA buffers into memory mapped file

I need to write in embedded Linux(2.6.37) as fast as possible incoming DMA buffers to HD partition as raw device /dev/sda1. Buffers are aligned as required and are of equal 512KB length. The process may continue for a very long time and fill as much as, for example, 256GB of data.
I need to use the memory-mapped file technique (O_DIRECT not applicable), but can't understand the exact way how to do this.
So, in pseudo code "normal" writing:
fd=open(/dev/sda1",O_WRONLY);
while(1) {
p = GetVirtualPointerToNewBuffer();
if (InputStopped())
break;
write(fd, p, BLOCK512KB);
}
Now, I will be very thankful for the similar pseudo/real code example of how to utilize memory-mapped technique for this writing.
UPDATE2:
Thanks to kestasx the latest working test code looks like following:
#define TSIZE (64*KB)
void* TBuf;
int main(int argc, char **argv) {
int fdi=open("input.dat", O_RDONLY);
//int fdo=open("/dev/sdb2", O_RDWR);
int fdo=open("output.dat", O_RDWR);
int i, offs=0;
void* addr;
i = posix_memalign(&TBuf, TSIZE, TSIZE);
if ((fdo < 1) || (fdi < 1)) {
printf("Error in files\n");
return -1; }
while(1) {
addr = mmap((void*)TBuf, TSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fdo, offs);
if ((unsigned int)addr == 0xFFFFFFFFUL) {
printf("Error MMAP=%d, %s\n", errno, strerror(errno));
return -1; }
i = read(fdi, TBuf, TSIZE);
if (i != TSIZE) {
printf("End of data\n");
return 0; }
i = munmap(addr, TSIZE);
offs += TSIZE;
sleep(1);
};
}
UPDATE3:
1. To precisely imitate the DMA work, I need to move read() call before mmp(), because when the DMA finishes it provides me with the address where it has put data. So, in pseudo code:
while(1) {
read(fdi, TBuf, TSIZE);
addr = mmap((void*)TBuf, TSIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED, fdo, offs);
munmap(addr, TSIZE);
offs += TSIZE; }
This variant fails after(!) the first loop - read() says BAD ADDRESS on TBuf.
Without understanding exactly what I do, I substituted munmap() with msync(). This worked perfectly.
So, the question here - why unmapping the addr influenced on TBuf?
2.With the previous example working I went to the real system with the DMA. The same loop, just instead of read() call is the call which waits for a DMA buffer to be ready and its virtual address provided.
There are no error, the code runs, BUT nothing is recorded (!).
My thought was that Linux does not see that the area was updated and therefore does not sync() a thing.
To test this, I eliminated in the working example the read() call - and yes, nothing was recorded too.
So, the question here - how can I tell Linux that the mapped region contains new data, please, flush it!
Thanks a lot!!!

If I correctly understand, it makes sense if You mmap() file (not sure if it You can mmap() raw partition/block-device) and data via DMA is written directly to this memory region.
For this to work You need to be able to control p (where new buffer is placed) or address where file is maped. If You don't - You'll have to copy memory contents (and will lose some benefits of mmap).
So psudo code would be:
truncate("data.bin", 256GB);
fd = open( "data.bin", O_RDWR );
p = GetVirtualPointerToNewBuffer();
adr = mmap( p, 1GB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, offset_in_file );
startDMA();
waitDMAfinish();
munmap( adr, 1GB );
This is first step only and I'm not completely sure if it will work with DMA (have no such experience).
I assume it is 32bit system, but even then 1GB mapped file size may be too big (if Your RAM is smaller You'll be swaping).
If this setup will work, next step would be to make loop to map regions of file at different offsets and unmap already filled ones.
Most likely You'll need to align addr to 4KB boundary.
When You'll unmap region, it's data will be synced to disk. So You'll need some testing to select appropriate mapped region size (while next region is filled by DMA, there must be enough time to unmap/write previous one).
UPDATE:
What exactly happens when You fill mmap'ed region via DMA I simply don't know (not sure how exactly dirty pages are detected: what is done by hardware, and what must be done by software).
UPDATE2: To my best knowledge:
DMA works the following way:
CPU arranges DMA transfer (address where to write transfered data in RAM);
DMA controller does the actual work, while CPU can do it's own work in parallel;
once DMA transfer is complete - DMA controller signals CPU via IRQ line (interrupt), so CPU can handle the result.
This seems simple while virtual memory is not involved: DMA should work independently from runing process (actual VM table in use by CPU). Yet it should be some mehanism to invalidate CPU cache for modified by DMA physical RAM pages (don't know if CPU needs to do something, or it is done authomatically by hardware).
mmap() forks the following way:
after successfull call of mmap(), file on disk is attached to process memory range (most likely some data structure is filled in OS kernel to hold this info);
I/O (reading or writing) from mmaped range triggers pagefault, which is handled by kernel loading appropriate blocks from atached file;
writes to mmaped range are handled by hardware (don't know how exactly: maybe writes to previously unmodified pages triger some fault, which is handled by kernel marking these pages dirty; or maybe this marking is done entirely in hardware and this info is available to kernel when it needs to flush modified pages to disk).
modified (dirty) pages are written to disk by OS (as it sees appropriate) or can be forced via msync() or munmap()
In theory it should be possible to do DMA transfers to mmaped range, but You need to find out, how exactly pages ar marked dirty (if You need to do something to inform kernel which pages need to be written to disk).
UPDATE3:
Even if modified by DMA pages are not marked dirty, You should be able to triger marking by rewriting (reading ant then writing the same) at least one value in each page (most likely each 4KB) transfered. Just make sure this rewriting is not removed (optimised out) by compiler.
UPDATE4:
It seems file opened O_WRONLY can't be mmap'ed (see question comments, my experimets confirm this too). It is logical conclusion of mmap() workings described above. The same is confirmed here (with reference to POSIX standart requirement to ensure file is readable regardless of maping protection flags).
Unless there is some way around, it actually means that by using mmap() You can't avoid reading of results file (unnecessary step in Your case).
Regarding DMA transfers to mapped range, I think it will be a requirement to ensure maped pages are preloalocated before DMA starts (so there is real memory asigned to both DMA and maped region). On Linux there is MAP_POPULATE mmap flag, but from manual it seams it works with MAP_PRIVATE mapings only (changes are not writen to disk), so most likely it is usuitable. Likely You'll have to triger pagefaults manually by accessing each maped page. This should triger reading of results file.
If You still wish to use mmap and DMA together, but avoid reading of results file, You'll have to modify kernel internals to allow mmap to use O_WRONLY files (for example by zero-filling trigered pages, instead of reading them from disk).

extendable mremap on anonymously mmaped memory

I was believing mremap would have a realloc-like behavior until debugging things like the following lines of code in C.
#define PAGESIZE 0x1000
void *p = mmap(0, PAGESIZE, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_SHARED, -1, 0);
void *p2 = mremap(p, PAGESIZE, PAGESIZE * 2, MREMAP_MAYMOVE);
// then any future access to the 2nd page in p2 would generate a nice SIGBUS
After viewing several old threads in some mailing-lists I know mmap was originally designed for 'pure' file mapping and folks who designed mremap seem not caring about codes like that above.
I know a shared memory object would do for this. But shm_open/shm_unlink require filenames and I don't want to deal with strings in this very project. And, I am not sure, maybe shared memory objects would more or less reduce performance of my application.
I'm just wondering if it's possible to make mremap work fine(fine here means no SIGBUS when expanding) with anonymously mapped memory, or if there is some similar methods simple&fast?
thx to all in advance :-)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string