Write-only mapping a O_WRONLY opened file supposed to work? - linux

Is mmap() supposed to be able to create a write-only mapping of a O_WRONLY opened file?
I am asking because following fails on a Linux 4.0.4 x86-64 system (strace log):
mkdir("test", 0700) = 0
open("test/foo", O_WRONLY|O_CREAT, 0666) = 3
ftruncate(3, 11) = 0
mmap(NULL, 11, PROT_WRITE, MAP_SHARED, 3, 0) = -1 EACCES (Permission denied)
The errno equals EACCESS.
Replacing the open-flag O_WRONLY with O_RDWR yields a successful mapping.
The Linux mmap man page documents the errno as:
EACCES A file descriptor refers to a non-regular file. Or a file map‐
ping was requested, but fd is not open for reading. Or
MAP_SHARED was requested and PROT_WRITE is set, but fd is not
open in read/write (O_RDWR) mode. Or PROT_WRITE is set, but the
file is append-only.
Thus, that behaviour is documented with the second sentence.
But what is the reason behind it?
Is it allowed by POSIX?
Is it a kernel or a library limitation? (On a quick glance, I couldn't find anything obvious in Linux/mm/mmap.c)

The IEEE Std 1003.1, 2004 Edition (POSIX.1 2004) appears to forbid it.
An implementation may permit accesses other than those specified by prot; however, if the Memory Protection option is supported, the implementation shall not permit a write to succeed where PROT_WRITE has not been set or shall not permit any access where PROT_NONE alone has been set. The implementation shall support at least the following values of prot: PROT_NONE, PROT_READ, PROT_WRITE, and the bitwise-inclusive OR of PROT_READ and PROT_WRITE. If the Memory Protection option is not supported, the result of any access that conflicts with the specified protection is undefined. The file descriptor fildes shall have been opened with read permission, regardless of the protection options specified. If PROT_WRITE is specified, the application shall ensure that it has opened the file descriptor fildes with write permission unless MAP_PRIVATE is specified in the flags parameter as described below.
(emphasis added)
Also, on x86, it is not possible to have write-only memory, and this is a limitation of the page table entries. Pages may be marked read-only or read-write and independently may be executable or non-executable, but cannot be write-only. Moreover the man-page for mprotect() says:
Whether PROT_EXEC has any effect different from PROT_READ is architecture- and kernel version-dependent. On some hardware architectures (e.g., i386), PROT_WRITE implies PROT_READ.
This being the case, you've opened a file descriptor without read access, but mmap() would be bypassing the O_WRONLY by giving you PROT_READ rights. Instead, it will refuse outright with EACCESS.

I don't think the x86 hardware supports write-only pages, so write access implies read. But it seems to be a more general requirement than just x86 - mm/mmap.c contains this code in do_mmap_pgoff():
case MAP_SHARED:
if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
return -EACCES;
....
/* fall through */
case MAP_PRIVATE:
if (!(file->f_mode & FMODE_READ))
return -EACCES;
I think that explains what you're seeing.

Related

How to increase the size of memory region allocated with mmap()

I'm allocating memory using mmap Linux syscall.
mov x5, 0 ; offset is zero
mov x4, -1 ; no file descriptor
mov x3, 0x22 ; MAP_PRIVATE + MAP_ANONYMOUS
mov x2, 3 ; PROT_READ + PROT_WRITE
mov x1, 4096 ; initial region size in bytes
mov x0, 0 ; let Linux choose region address
mov x8, 222 ; mmap
svc 0
Is it possible to increase the size of allocated memory region preserving its start address and contents? How to do it properly?
On Linux, use the mremap(2) Linux-specific system call without MREMAP_MAYMOVE to extend the existing mapping, without considering the option of remapping those physical pages to a different virtual address where there's enough room for the larger mapping.
It will return an error if some other mapping already exists for the pages you want to grow into. (Unlike mmap(MAP_FIXED) which will silently replace those mappings.)
If you're writing in asm, portability to non-Linux is barely relevant; other OSes will have different call numbers and maybe ABIs, so just look up __NR_mremap in asm/unistd.h, and get the flags patterns from sys/mman.h.
With just portable POSIX calls, mmap() with a non-NULL hint address = right after you existing mapping, but without MAP_FIXED; it will pick that address if the pages are free (and as #datenwolf says, merge with the earlier mapping into one long extent). Otherwise it will pick somewhere else. (Then you have to munmap that mapping that ended up not where you wanted it.)
There is a Linux-specific mmap option: MAP_FIXED_NOREPLACE will return an error instead of mapping at an address different from the hint. Kernels older than 4.17 don't know about that flag and will typically treat it as if you used no other flags besides MAP_ANONYMOUS, so you should check the return value against the hint.
Do not use MAP_FIXED_NOREPLACE | MAP_FIXED; that would act as MAP_FIXED on old kernels, and maybe also on new kernels that do know about MAP_FIXED_NOREPLACE.
Assuming you know the start of the mapping you want to extend, and the desired new total size, mremap is a better choice than mmap(MAP_FIXED_NOREPLACE). It's been supported since at least Linux 2.4, i.e. decades, and keeps the existing mapping flags and permissions automatically (e.g. MAP_PRIVATE, PROT_READ|PROT_WRITE)
If you only knew the end address of the existing mapping, mmap(MAP_FIXED_NOREPLACE) might be a good choice.
If there's free virtual address space behind your original region, just create an additional mmap-ed region right behind the original one, using the MAP_FIXED | MAP_FIXED_NOREPLACE flags and identical permissions. If the page size of both regions are identical, they'll be coalesced into single mapping.

Do I need to close a file before calling syncfs()

On my embedded system, I want to make sure that the data is safely written when I close a file - if the system reports that the data was saved, the user should be able to remove power immediately.
I know that the proper way to do this is fsync(), fclose(), and fsync() on the directory (cfr. this blog entry). However, it's a bit tricky to get a file descriptor for the directory in my case (I'd have to go through /proc/self/fd to find back the filename and derive the directory from there). It would be much simpler for me to just do syncfs() on the entire filesystem - I know that this is the only file that is open on the filesystem anyway.
Now my question is:
Is it sufficient to do syncfs()?
Do I need to fclose() the FILE * first (for the directory entry to be up-to-date)? Or is fflush() sufficient?
If it needs to be closed, is it useful to dup() the file descriptor before closing so I can use it directly for syncfs()?
First of all, don't mix standard library <stdio.h> calls (like fprintf(3) or fopen(3)) with system calls (like open(2) or close(2) or sync(2)) as the formers are library routines that use in-process' buffers to store temporary data, for which the system is unaware, and the others are operating system interfaces that make the system responsible for the data maintainance from now onwards. You'll distinguish them easily as the former use FILE * descriptors to operate, while the last use int integer descriptors to operate on.
So if you use a system call to ensure your data is properly synced to disk, it is absolutely neccessary to first fflush(3) your process' buffer data before you do the filesystem sync(2) or fsync(2) call.
No sync(2) is warranted to happen at fclose(3) or even on close(2) time, or in the atexit() callbacks your process does before exit().
The operating system buffers are write delayed for performance reasons, and close(2) is not an event that makes it to trigger such a thing. Just think that many processes can be reading and writing the same file at the same time, and each close(2) triggering a filesystem flush could be a pain to achieve. Operating system triggers such calls at regular intervals, on umount(2) system calls, on system shutdown, and on specific calls to the sync(2) and fsync(2) system calls.
If you need to maintain the FILE *fd descriptor open, just do a fflush(fd) for that descriptor to ensure that the operating system has all its buffers for fwrite(3)d or fprintf(3)ed data first.
So finally, if you are using <stdio.h> functions, first do a fflush() for all the FILE * descriptors you have written to, or call fflush(NULL); to tell stdio to synch all descriptors in one call. Then do the sync(2) or fsync(2) call to ensure all your data is physically on disk. No need to close anything.
FILE *fd;
...
fflush(fd);
fsync(fileno(fd));
/* here you know that up to the last write(2) or fwrite(3)...
* data is synced to disk */
By the way, your approach of going to /dev/fd/<number> to get the descriptor (that you had previously) is faulty for two reasons:
Once you close your descriptor, /dev/fd/<number> is not anymore the descriptor you want. Normally, it doesn't exist, even. Just try this:
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
int main()
{
int fd;
char fn[] = "/dev/fd/1";
close(1); /* close standard output */
fd = open(fn, O_RDONLY); /* try to reopen from /dev/fd */
if (fd < 0) {
fprintf(stderr,
"%s: %s(errno=%d)\n",
fn,
strerror(errno),
errno);
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
} /* main */
You cannot get the directory where an open file belongs to with only the file descriptor. In a multilinked file, there can be thousands of directories just pointing to it. There's nothing on the inode (or in the open file structure) that allows you to get the path used to open that file. A common way to use temporary files is just to create them and immediately unlink(2) them, so nobody can open it again. As much as you retain the file open you have access to it, but no path points to it anymore.
Enable the "sync" flag in your filesystem (/etc/fstab), default is "async" (disabled) . When this flag is enabled, all changes to the according filesystem are inmediately flushed to disk. This makes your entire filesystem slow, but depending on your embedded system requirements, this can be a great option to consider.

mmap File-backed mapping vs Anonymous mapping in Linux

what is the main difference between File-backed mapping & Anonymous
mapping.
How can we choose between File-backed mapping or Anonymous
mapping, when we need an IPC between processes.
What is the advantage,disadvantage of using these.?
mmap() system call allows you to go for either file-backed mapping or anonymous mapping.
void *mmap(void *addr, size_t lengthint " prot ", int " flags ,int fd,
off_t offset)
File-backed mapping- In linux , there exists a file /dev/zero which is an infinite source of 0 bytes. You just open this file, and pass its descriptor to the mmap() call with appropriate flag, i.e., MAP_SHARED if you want the memory to be shared by other process or MAP_PRIVATE if you don't want sharing.
Ex-
.
.
if ((fd = open("/dev/zero", O_RDWR)) < 0)
printf("open error");
if ((area = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,fd, 0)) == MAP_FAILED)
{
printf("Error in memory mapping");
exit(1);
}
close(fd); //close the file because memory is mapped
//create child process
.
.
Quoting the man-page of mmap() :-
The contents of a file mapping (as opposed to an anonymous mapping;
see MAP_ANONYMOUS below), are initialized using length bytes starting
at offset offset in the file (or other object) referred to by the file
descriptor fd. offset must be a multiple of the page size as returned
by sysconf(_SC_PAGE_SIZE).
In our case, it has been initialized with zeroes(0s).
Quoting the text from the book Advanced Programming in the UNIX Environment by W. Richard Stevens, Stephen A. Rago II Edition
The advantage of using /dev/zero in the manner that we've shown is
that an actual file need not exist before we call mmap to create the
mapped region. Mapping /dev/zero automatically creates a mapped region
of the specified size. The disadvantage of this technique is that it
works only between related processes. With related processes, however,
it is probably simpler and more efficient to use threads (Chapters 11
and 12). Note that regardless of which technique is used, we still
need to synchronize access to the shared data
After the call to mmap() succeeds, we create a child process which will be able to see the writes to the mapped region(as we specified MAP_SHARED flag).
Anonymous mapping - The similar thing that we did above can be done using anonymous mapping.For anonymous mapping, we specify the MAP_ANON flag to mmap and specify the file descriptor as -1.
The resulting region is anonymous (since it's not associated with a pathname through a file descriptor) and creates a memory region that can be shared with descendant processes.
The advantage is that we don't need any file for mapping the memory, the overhead of opening and closing file is also avoided.
if ((area = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_SHARED, -1, 0)) == MAP_FAILED)
printf("Error in anonymous memory mapping");
So, these file-backed mapping and anonymous mapping necessarily work only with related processes.
If you need this between unrelated processes, then you probably need to create named shared memory by using shm_open() and then you can pass the returned file descriptor to mmap().

using O_TMPFILE to clean up huge pages... or other methods?

My program is using huge pages. For doing, it open files as follows:
oflags = O_RDWR | O_CREAT | O_TRUNC;
fd = open(filename, oflag, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
Where filename is in the hugetlb file system.
That works. My program can then mmap() the created file descriptors. But if my program gets killed, the files remain... and in the huge page filesystem, remaining files is blocked memory, as shown by the following command (876 != 1024):
cat /proc/meminfo | grep Huge
AnonHugePages: 741376 kB
HugePages_Total: 1024
HugePages_Free: 876
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
So, as my program is not sharing the file to anyone else, it made sense to me to create temporary files using the O_TMPFILE flag.
So I tried:
oflags = O_RDWR | O_TMPFILE;
fd = open(pathname, oflag, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
where pathname is the hugetlbfs moint point.
That fails (for a reason I cannot explain) with the following error:
open failed for /dev/hugepages: Operation not supported
Why? and more to the point: How can I guarantee that all huge pages my program is using get freed?
Yes: I could catch some signals (e.g. SIGTERM); but not all (SIGKILL)
Yes: I could unlink() the file as soon as possible using the first approach, but what if SIGKILL is received between open() and unlink().
Kernels like guaranties. So do I. What is the proper methods to guarantees 100% cleanup regardless on when or how my program terminates.
Looks like O_TMPFILE is not implemented yet for hugetlbfs; indeed, this option requires support of the underlying file-system:
O_TMPFILE requires support by the underlying filesystem; only a subset of Linux filesystems provide that support. In the initial implementation, support was provided in the ex2, ext3, ext4, UDF, Minix, and shmem filesystems. XFS support was added
in Linux 3.15.
This is confirmed by looking at the kernel source code where there's no inode_ops->tmpfile() implementation in hugetlbfs.
I believe that the right answer here is to work on this implementation...
I noticed your comment about the unlink() option, however, maybe the following approach is not that risky:
open the file (by name) with TRUNCATE (so you can assume its size is 0)
unlink it
mmap() it with your target size
If your program gets killed in the middle, worst case is to leave an empty file.

extendable mremap on anonymously mmaped memory

I was believing mremap would have a realloc-like behavior until debugging things like the following lines of code in C.
#define PAGESIZE 0x1000
void *p = mmap(0, PAGESIZE, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_SHARED, -1, 0);
void *p2 = mremap(p, PAGESIZE, PAGESIZE * 2, MREMAP_MAYMOVE);
// then any future access to the 2nd page in p2 would generate a nice SIGBUS
After viewing several old threads in some mailing-lists I know mmap was originally designed for 'pure' file mapping and folks who designed mremap seem not caring about codes like that above.
I know a shared memory object would do for this. But shm_open/shm_unlink require filenames and I don't want to deal with strings in this very project. And, I am not sure, maybe shared memory objects would more or less reduce performance of my application.
I'm just wondering if it's possible to make mremap work fine(fine here means no SIGBUS when expanding) with anonymously mapped memory, or if there is some similar methods simple&fast?
thx to all in advance :-)

Resources