using O_TMPFILE to clean up huge pages... or other methods?

using O_TMPFILE to clean up huge pages... or other methods? - linux

My program is using huge pages. For doing, it open files as follows:
oflags = O_RDWR | O_CREAT | O_TRUNC;
fd = open(filename, oflag, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
Where filename is in the hugetlb file system.
That works. My program can then mmap() the created file descriptors. But if my program gets killed, the files remain... and in the huge page filesystem, remaining files is blocked memory, as shown by the following command (876 != 1024):
cat /proc/meminfo | grep Huge
AnonHugePages: 741376 kB
HugePages_Total: 1024
HugePages_Free: 876
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
So, as my program is not sharing the file to anyone else, it made sense to me to create temporary files using the O_TMPFILE flag.
So I tried:
oflags = O_RDWR | O_TMPFILE;
fd = open(pathname, oflag, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
where pathname is the hugetlbfs moint point.
That fails (for a reason I cannot explain) with the following error:
open failed for /dev/hugepages: Operation not supported
Why? and more to the point: How can I guarantee that all huge pages my program is using get freed?
Yes: I could catch some signals (e.g. SIGTERM); but not all (SIGKILL)
Yes: I could unlink() the file as soon as possible using the first approach, but what if SIGKILL is received between open() and unlink().
Kernels like guaranties. So do I. What is the proper methods to guarantees 100% cleanup regardless on when or how my program terminates.

Looks like O_TMPFILE is not implemented yet for hugetlbfs; indeed, this option requires support of the underlying file-system:
O_TMPFILE requires support by the underlying filesystem; only a subset of Linux filesystems provide that support. In the initial implementation, support was provided in the ex2, ext3, ext4, UDF, Minix, and shmem filesystems. XFS support was added
in Linux 3.15.
This is confirmed by looking at the kernel source code where there's no inode_ops->tmpfile() implementation in hugetlbfs.
I believe that the right answer here is to work on this implementation...
I noticed your comment about the unlink() option, however, maybe the following approach is not that risky:
open the file (by name) with TRUNCATE (so you can assume its size is 0)
unlink it
mmap() it with your target size
If your program gets killed in the middle, worst case is to leave an empty file.

Related

Write-only mapping a O_WRONLY opened file supposed to work?

Is mmap() supposed to be able to create a write-only mapping of a O_WRONLY opened file?
I am asking because following fails on a Linux 4.0.4 x86-64 system (strace log):
mkdir("test", 0700) = 0
open("test/foo", O_WRONLY|O_CREAT, 0666) = 3
ftruncate(3, 11) = 0
mmap(NULL, 11, PROT_WRITE, MAP_SHARED, 3, 0) = -1 EACCES (Permission denied)
The errno equals EACCESS.
Replacing the open-flag O_WRONLY with O_RDWR yields a successful mapping.
The Linux mmap man page documents the errno as:
EACCES A file descriptor refers to a non-regular file. Or a file map‐
ping was requested, but fd is not open for reading. Or
MAP_SHARED was requested and PROT_WRITE is set, but fd is not
open in read/write (O_RDWR) mode. Or PROT_WRITE is set, but the
file is append-only.
Thus, that behaviour is documented with the second sentence.
But what is the reason behind it?
Is it allowed by POSIX?
Is it a kernel or a library limitation? (On a quick glance, I couldn't find anything obvious in Linux/mm/mmap.c)

The IEEE Std 1003.1, 2004 Edition (POSIX.1 2004) appears to forbid it.
An implementation may permit accesses other than those specified by prot; however, if the Memory Protection option is supported, the implementation shall not permit a write to succeed where PROT_WRITE has not been set or shall not permit any access where PROT_NONE alone has been set. The implementation shall support at least the following values of prot: PROT_NONE, PROT_READ, PROT_WRITE, and the bitwise-inclusive OR of PROT_READ and PROT_WRITE. If the Memory Protection option is not supported, the result of any access that conflicts with the specified protection is undefined. The file descriptor fildes shall have been opened with read permission, regardless of the protection options specified. If PROT_WRITE is specified, the application shall ensure that it has opened the file descriptor fildes with write permission unless MAP_PRIVATE is specified in the flags parameter as described below.
(emphasis added)
Also, on x86, it is not possible to have write-only memory, and this is a limitation of the page table entries. Pages may be marked read-only or read-write and independently may be executable or non-executable, but cannot be write-only. Moreover the man-page for mprotect() says:
Whether PROT_EXEC has any effect different from PROT_READ is architecture- and kernel version-dependent. On some hardware architectures (e.g., i386), PROT_WRITE implies PROT_READ.
This being the case, you've opened a file descriptor without read access, but mmap() would be bypassing the O_WRONLY by giving you PROT_READ rights. Instead, it will refuse outright with EACCESS.

I don't think the x86 hardware supports write-only pages, so write access implies read. But it seems to be a more general requirement than just x86 - mm/mmap.c contains this code in do_mmap_pgoff():
case MAP_SHARED:
if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
return -EACCES;
....
/* fall through */
case MAP_PRIVATE:
if (!(file->f_mode & FMODE_READ))
return -EACCES;
I think that explains what you're seeing.

Linux read operations requesting duplicate bytes?

This is a bit of a strange question. I'm writing a fuse module using the go-fuse library, and at the moment I have a "fake" file with a size of 6000 bytes, and which will output some unrelated data for all read requests. My read function looks like this:
func (f *MyFile) Read(buf []byte, off int64) (fuse.ReadResult, fuse.Status) {
log.Printf("Reading into buffer of len %d from %d\n",len(buf),off)
FillBuffer(buf, uint64(off), f.secret)
return fuse.ReadResultData(buf), fuse.OK
}
As you can see I'm outputting a log on every read containing the range of the read request. The weird thing is that when I cat the file I get the following:
2013/09/13 21:09:03 Reading into buffer of len 4096 from 0
2013/09/13 21:09:03 Reading into buffer of len 8192 from 0
So cat is apparently reading the first 4096 bytes of data, discarding it, then reading 8192 bytes, which encompasses all the data and so succeeds. I've tried with other programs too, including hexdump and vim, and they all do the same thing. Interestingly, if I do a head -c 3000 dir/fakefile it still does the two reads, even though the later one is completely unnecessary. Does anyone have any insights into why this might be happening?

I suggest you strace your cat process to see for yourself. On my system, cat reads by 64K chunks, and does a final read() to make sure it read the whole file. That last read() is necessary to make the distinction between a reading a "chunk-sized file" and a bigger file. i.e. it makes sure there is nothing left to read, as the file size could have changed between the fstat() and the read() system calls.
Is your "fake file" size being returned correctly to FUSE by stat/fstat() system calls?

how can i show the size of files in /proc? it should not be size zero

from the following message, we know that there are two characters in file /proc/sys/net/ipv4/ip_forward, but why ls just showed this file is of size zero?
i know this is not a file on disk, but a file in the memory, so is there any command which i can see the real size of the files in /proc?
root#OpenWrt:/proc/sys/net/ipv4# cat ip_forward | wc -c
2
root#OpenWrt:/proc/sys/net/ipv4# ls -l ip_forward
-rw-r--r-- 1 root root 0 Sep 3 00:20 ip_forward
root#OpenWrt:/proc/sys/net/ipv4# pwd
/proc/sys/net/ipv4

Those are not really files on disk (as you mention) but they are also not files in memory - the names in /proc correspond to calls into the running kernel in the operating system, and the contents are generated on the fly.
The system doesn't know how large the files would be without generating them, but if you read the "file" twice there's no guarantee you get the same data because the system may have changed.
You might be looking for the program
sysctl -a
instead.

Things in /proc are not really files. In most cases, they're not even files in memory. When you access these files, the proc filesystem driver performs a system call that gets data appropriate for the file, and then formats it for output. This is usually dynamic data that's constructed on the fly. An example of this is /proc/net/arp, which contains the current ARP cache.
Getting the size of these things can only be done by formatting the entire output, so it's not done just when listing the file. If you want the sizes, use wc -c as you did.

The /proc/ filesystem is an "illusion" maintained by the kernel, which does not bother giving the size of (most of) its pseudo-files (since computing that "real" size would usually involve having built the entire textual pseudo-file's content), and expects most [pseudo-] textual files from /proc/ to be read in sequence from first to last byte (i.e. till EOF), in reasonably sized (e.g. 1K) blocks. See proc(5) man page for details.
So there is no way to get the true size (of some file like /proc/self/maps or /proc/sys/net/ipv4/ip_forward) in a single syscall (like stat(2), because it would give a size of 0, as reported by stat(1) or ls(1) commands). A typical way of reading these textual files might be
FILE* f = fopen("/proc/self/maps", "r");
// or some other textual /proc file,
// e.g. /proc/sys/net/ipv4/ip_forward
if (f)
{
do {
// you could use readline instead of fgets
char line[256];
memset (line, 0, sizeof(line));
if (NULL == fgets(line, sizeof(line), f))
break;
// do something with line, for example:
fputs(line, stdout);
} while (!feof (f));
fclose (f);
}
Of course, some files (e.g. /proc/self/cmdline) are documented as possibly containing NUL bytes. You'll need some fread for them.

It's not really a file in the memory, it's an interface between the user and the kernel.

POSIX shared memory and semaphores permissions set incorrectly by open calls

I'm trying to create a shared memory which will be used by several processes, which will not necessarily be started by the same user, so I create the segment with the following line:
fd = shm_open(SHARE_MEM_NAME,O_RDWR | O_CREAT,0606);
however, when I check out the permissions of the file created in /dev/shm they are:
-rw----r-- 1 lmccauslin lmccauslin 1784 2012-08-10 17:11 /dev/shm/CubeConfigShare
not -rw----rw- as I'd expected.
the permissions for /dev/shm are lrwxrwxrwx.
The exact same thing happens with the semaphore created similarly.
kernel version: 3.0.0-23-generic
glibc version: EGLIBC 2.13-20ubuntu5.1
Anyone got any ideas?

It's probably umask.
Citing the manpage of shm_open:
O_CREAT Create the shared memory object if it does not exist. The user and
group ownership of the object are taken from the corresponding effec‐
tive IDs of the calling process, and the object's permission bits are
set according to the low-order 9 bits of mode, except that those bits
set in the process file mode creation mask (see umask(2)) are cleared
for the new object. A set of macro constants which can be used to
define mode is listed in open(2). (Symbolic definitions of these
constants can be obtained by including <sys/stat.h>.)
So, in order to allow creating files which are world-writable, you'd need to set an umask permitting it, for example:
umask(0);
Set like this, umask won't affect any permissions on created files anymore. However, you should note that if you will then create another file without specifying permissions explicitly, it will be world-writable as well.
Thus, you may want to clear the umask only temporarily, and then restore it:
#include <sys/types.h>
#include <sys/stat.h>
...
void yourfunc()
{
// store old
mode_t old_umask = umask(0);
int fd = shm_open(SHARE_MEM_NAME,O_RDWR | O_CREAT,0606);
// restore old
umask(old_umask);
}

From what I understand, POSIX semaphores are created in shared memory. So you need to make sure that users have
rw permissions to /dev/shm for the semaphores to be created.
Then, as a handy option, put the following line in your /etc/fstab file to mount tmpfs:
none /dev/shm tmpfs defaults 0 0
So that when your machine is rebooted, the permissions are set right from the start.
Two of the three had /dev/shm set to drwxrwxrwx and the machine that would not allow creation of semaphores had it set to drwxr_xr_x.
You can also look at shared memory limits:
------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18446744073642442748
min seg size (bytes) = 1

use aio_write() but still see data going through cache?

I'm playing with this code on Linux 2.6.16.46:
io.aio_fildes = open(name, O_CREAT | O_TRUNC | O_WRONLY | O_SYNC, 00300);
io.aio_buf = buffer;
io.aio_nbytes = size;
io.aio_sigevent = sigev;
io.aio_lio_opcode = LIO_WRITE;
aio_write( &io );
This should use the memory pointed by buffer for the IO operation. Still, I see the number of dirty pages go up as if I was writing to cache. Why is that?
On the build machine, there's no O_DIRECT support in open(). But since I'm not using write(), should that still be a problem?
I'm pretty sure there's direct IO support on the target.

figured this out. Direct/buffered IO is one thing, sync/async is another. To have async writes avoid cache one still needs to give O_DIRECT to the open() call, even if write() is not used.
There will likely be compiler errors at first - read man 2 open carefully.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string