Mapped region still valid when size of underlying file changes? - linux

Let's have a look at a few scenarios:
a)
file size : |---------|
mapped region: |---------|
region access: |XXXXXXXXX|
--> file grows
file size : |----------------|
mapped region: |---------|
region access: |XXXXXXXXX|
Is it still well-defined/portable/safe to access (read/write) the complete mapped region?
(assuming that the file grew via normal writes to it or via truncating it; file was just mapped once, no extra remapping after the file size changed)
b)
file size : |---------|
mapped region: |-----------------------|
access : |XXXXXXXXX|
--> file grows
file size : |-----------------------|
mapped region: |-----------------------|
access : |XXXXXXXXXXXXXXXXXXXXXXX|
Say, before the file was extended the program just accessed the intersection of the file size and the mapped region. This should be fine.
After the file grew - such that the sizes of the mapping and file match - is it now well defined to access every part of the region/file?
If this is the case, creating larger mapped regions in the beginning could be an optimization to avoid some mremap (or munmap/mmap) calls - at least for some use-cases.
c)
file size : |---------|
mapped region: |---------|
access : |XXXXXXXXX|
--> file is truncated
file size : |---|
mapped region: |---------|
access : |XXX|
As long as the program accesses the still overlapping part of the region - is that well-defined behaviour?

Generally, if size of mapped file is changed, it's safe to access pages not affected by size change, and it's unspecified what happens with pages affected by size change.
From mmap(2):
1.
If the size of the mapped file changes after the call to mmap() as a result of some other operation on the mapped file, the effect of references to portions of the mapped region that correspond to added or removed portions of the file is unspecified.
2.
The mmap() function can be used to map a region of memory that is larger than the current size of the object. Memory access within the mapping but beyond the current end of the underlying objects may result in SIGBUS signals being sent to the process.
So, in all three cases, it seems that it's safe to access all originally mapped pages below current file size and it's not safe to access pages above current file size.
I'm not totally sure about case (b), but it seems to be a valid case and it works at least in Linux.
Note that SIGBUS generation is not guaranteed and it's not specified what really happens when you access data above mapping size or above file size. Implementation may allow you to read valid data from the end of the page, for example.
There are also two optimization tricks related to question (to avoid mremap()):
You can allocate large memory region in heap and then subsequently mmap() pages you need using MAP_FIXED flag.
You can use linux-specific remap_file_pages(2) call.
Test program
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <sys/mman.h>
#define max(a, b) (a > b ? a : b)
int main(int argc, char** argv) {
int mmap_size = atoi(argv[1]);
int file_size1 = atoi(argv[2]);
int file_size2 = atoi(argv[3]);
char* data = malloc(max(file_size1, file_size2));
memset(data, 7, max(file_size1, file_size2));
int fd = open("foo", O_RDWR | O_TRUNC | O_CREAT, 0777);
write(fd, data, file_size1);
char* addr = mmap(NULL, mmap_size, PROT_READ, MAP_SHARED, fd, 0);
if (file_size2 <= file_size1)
ftruncate(fd, file_size2);
else
write(fd, data, file_size2 - file_size1);
printf("%d\n", addr[0]);
printf("%d\n", addr[file_size1 - 1]);
printf("%d\n", addr[file_size2 - 1]);
return 0;
}
Example output on Linux:
$ ./a.out 4096 4096 $(( 4096 * 2))
7
7
0
$ ./a.out $(( 4096 * 2 )) 4096 $(( 4096 * 2))
7
7
7
$ ./a.out $(( 4096 * 2 )) $(( 4096 * 2)) 4096
7
Bus error

1) The file grows after having been mapped
If you know that the file will grow, you would map it with the matching flag so that the mapping grows with the file.
If you do not know that the file will grow, you would also not access behind the mapped area.
2) The file shrinks after having been mapped
If you know that the file has shrunk, there is no reason to access the the area after the end of the file as you would get a signal.
If you do not know that the file has shrunk, see my other answer to this question.

Related

Accessing large memory (32 GB) using /dev/zero

I want to use /dev/zero for storing lots of temporary data (32 GB or around that). I am doing this:
fd = open("/dev/zero", O_RDWR );
// <Exit on error>
vbase = (uint64_t*) mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
// <Exit on error>
ftruncate(fd, (off_t) MEMSIZE);
I am changing MEMSIZE from 1GB to 32 GB (performing a memtest) to see if I can really access all that range. I am running out of memory at 1 GB.
Is there something I am missing ? Am I mmap'ing correctly ?
Or am I running into some system limit ? How can I check if this is happening ?
P.S: I run many programs that generate many gigs of data within a single file, so I dont know if there is an artificial upper limit, just that I seem to be running into something.
I have to admit I'm confused about what you're actually trying to do. Anyway, a couple of reason why what you do might not work:
From the mmap(2) manpage: "MAP_ANONYMOUS
The mapping is not backed by any file; its contents are initialized to zero. The fd and offset arguments are ignored;"
From the null(4) manpage: "Data written to a null or zero special file is discarded."
So anyway, before MAP_ANONYMOUS, mmap'ing /dev/zero was sometimes used to get anonymous (i.e. not backed by any file) memory. No need to do both. In either case, actually writing to all that memory implies that you need some kind of backing store for it, either physical memory or swap space. If you cannot guarantee that, maybe it's better to mmap() a real file on a filesystem with enough space?
Look into Linux kernel mmap implementation:
vm_mmap vm_mmap_pgoff  do_mmap_pgoff  mmap_region  file->f_op->mmap(file, vma)
In the function do_mmap_pgoff, it checks the max_map_count
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;
root> sysctl -a | grep map_count
vm.max_map_count = 65530
In the function mmap_region, it checks the process virtual address limit (whether it is unlimited).
int may_expand_vm(struct mm_struct *mm, unsigned long npages)
{
unsigned long cur = mm->total_vm; /* pages */
unsigned long lim;
lim = rlimit(RLIMIT_AS) >> PAGE_SHIFT;
if (cur + npages > lim)
return 0;
return 1;
}
root> ulimit -a | grep virtual
virtual memory (kbytes, -v) unlimited
In linux kernel, init task has the rlimit setting by default.
[RLIMIT_AS] = { RLIM_INFINITY, RLIM_INFINITY }, \
#ifndef RLIM_INFINITY
# define RLIM_INFINITY (~0UL)
#endif
In order to prove it, use the test_mem program
tmp> ./test_mem
RLIMIT_AS limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
RLIMIT_DATA limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
struct rlimit rl;
int ret;
ret = getrlimit(RLIMIT_AS, &rl);
if (ret == 0) {
printf("RLIMIT_AS limit got sucessfully:\n");
printf("soft_limit=%lld, hard_limit=%lld\n", (long long)rl.rlim_cur, (long long)rl.rlim_max);
}
That means unlimited means 0xFFFFFFFF for 32bit app in the 64bit OS. Change the shell virtual address limit, it could reflect correctly.
root> ulimit -v 1024000
tmp> ./test_mem
RLIMIT_AS limit got sucessfully:
soft_limit=1048576000, hard_limit=1048576000
RLIMIT_DATA limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
In mmap_region, there is an accountable check
accountable_mapping  security_vm_enough_memory_mm  cap_vm_enough_memory  __vm_enough_memory  overcommit/swap/admin and user reserve handling.
Please follow the three steps to check whether they can meet.

Linux file descriptor table and vmalloc

I see that the Linux kernel uses vmalloc to allocate memory for fdtable when it's bigger than a certain threshold. I would like to know when this happens and have some more clear information.
static void *alloc_fdmem(size_t size)
{
/*
* Very large allocations can stress page reclaim, so fall back to
* vmalloc() if the allocation size will be considered "large" by the VM.
*/
if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
if (data != NULL)
return data;
}
return vmalloc(size);
}
alloc_fdmem is called from alloc_fdtable and the last function is called from expand_fdtable
I wrote this code to print the size.
#include <stdio.h>
#define PAGE_ALLOC_COSTLY_ORDER 3
#define PAGE_SIZE 4096
int main(){
printf("\t%d\n", PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
}
Output
./printo
32768
So, how many files does it take for the kernel to switch to using vmalloc to allocate fdtable?
So PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER is 32768
This is called like:
data = alloc_fdmem(nr * sizeof(struct file *));
i.e. it's used to store struct file pointers.
If your pointers are 4 bytes, it happens when your have 32768/4 = 8192 open files, if your pointers are 8 bytes, it happens at 4096 open files.

Can I use a single call to munmap() to unmap two memory mappings located in a contiguous range of virtual memory?

For example, if I do this:
char *pMap1; /* First mapping */
char *pReq; /* Address we would like the second mapping at */
char *pMap2; /* Second mapping */
/* Map the first 1 MB of the file. */
pMap1 = (char *)mmap(0, 1024*1024, PROT_READ, MAP_SHARED, fd, 0);
assert( pMap1!=MAP_FAILED );
/* Now map the second MB of the file. Request that the OS positions the
** second mapping immediately after the first in virtual memory. */
pReq = pMap1 + 1024*1024;
pMap2 = (char *)mmap(pReq, 1024*1024, PROT_READ, MAP_SHARED, fd, 1024*1024);
assert( pMap2!=MAP_FAILED );
/* Unmap the mappings created above. */
if( pMap2==pReq ){
munmap(pMap1, 2 * 1024*1024);
}else{
munmap(pMap1, 1 * 1024*1024);
munmap(pMap2, 1 * 1024*1024);
}
And the OS does place my second mapping in the requested position (so
that the (pMap2==pReq) condition is true), will the single call to
munmap() serve to release all resources allocated?
The Linux man page says "The munmap() system call deletes the mappings
for the specified address range...", which suggests that this will work,
but I'm still a little nervous about it. Even if it does work on Linux,
does anybody know how portable this is likely to be?
Thanks very much in advance.
The glibc manual says it's fine:
munmap removes any memory maps from (addr) to (addr + length). length
should be the length of the mapping.
It is safe to unmap multiple
mappings in one command, or include unmapped space in the range. It is
also possible to unmap only part of an existing mapping. However, only
entire pages can be removed.
The POSIX spec says:
int munmap(void *addr, size_t len);
The munmap() function shall remove any mappings for those entire pages containing any part of the address space of the process starting at addr and continuing for len bytes.
To me, the wording clearly reads as if removing multiple mappings with a single munmap() is fine, and should be supported by any compliant implementation.
I think it should work. The POSIX spec says that it removes
any mappings for those entire pages containing any part of the address space of the process starting at addr and continuing for len bytes
The only unspecified behavior it describes is:
The behavior of this function is unspecified if the mapping was not established by a call to mmap().

Is there any API for determining the physical address from virtual address in Linux?

Is there any API for determining the physical address from virtual address in Linux operating system?
Kernel and user space work with virtual addresses (also called linear addresses) that are mapped to physical addresses by the memory management hardware. This mapping is defined by page tables, set up by the operating system.
DMA devices use bus addresses. On an i386 PC, bus addresses are the same as physical addresses, but other architectures may have special address mapping hardware to convert bus addresses to physical addresses.
In Linux, you can use these functions from asm/io.h:
virt_to_phys(virt_addr);
phys_to_virt(phys_addr);
virt_to_bus(virt_addr);
bus_to_virt(bus_addr);
All this is about accessing ordinary memory. There is also "shared memory" on the PCI or ISA bus. It can be mapped inside a 32-bit address space using ioremap(), and then used via the readb(), writeb() (etc.) functions.
Life is complicated by the fact that there are various caches around, so that different ways to access the same physical address need not give the same result.
Also, the real physical address behind virtual address can change. Even more than that - there could be no address associated with a virtual address until you access that memory.
As for the user-land API, there are none that I am aware of.
/proc/<pid>/pagemap userland minimal runnable example
virt_to_phys_user.c
#define _XOPEN_SOURCE 700
#include <fcntl.h> /* open */
#include <stdint.h> /* uint64_t */
#include <stdio.h> /* printf */
#include <stdlib.h> /* size_t */
#include <unistd.h> /* pread, sysconf */
typedef struct {
uint64_t pfn : 55;
unsigned int soft_dirty : 1;
unsigned int file_page : 1;
unsigned int swapped : 1;
unsigned int present : 1;
} PagemapEntry;
/* Parse the pagemap entry for the given virtual address.
*
* #param[out] entry the parsed entry
* #param[in] pagemap_fd file descriptor to an open /proc/pid/pagemap file
* #param[in] vaddr virtual address to get entry for
* #return 0 for success, 1 for failure
*/
int pagemap_get_entry(PagemapEntry *entry, int pagemap_fd, uintptr_t vaddr)
{
size_t nread;
ssize_t ret;
uint64_t data;
uintptr_t vpn;
vpn = vaddr / sysconf(_SC_PAGE_SIZE);
nread = 0;
while (nread < sizeof(data)) {
ret = pread(pagemap_fd, ((uint8_t*)&data) + nread, sizeof(data) - nread,
vpn * sizeof(data) + nread);
nread += ret;
if (ret <= 0) {
return 1;
}
}
entry->pfn = data & (((uint64_t)1 << 55) - 1);
entry->soft_dirty = (data >> 55) & 1;
entry->file_page = (data >> 61) & 1;
entry->swapped = (data >> 62) & 1;
entry->present = (data >> 63) & 1;
return 0;
}
/* Convert the given virtual address to physical using /proc/PID/pagemap.
*
* #param[out] paddr physical address
* #param[in] pid process to convert for
* #param[in] vaddr virtual address to get entry for
* #return 0 for success, 1 for failure
*/
int virt_to_phys_user(uintptr_t *paddr, pid_t pid, uintptr_t vaddr)
{
char pagemap_file[BUFSIZ];
int pagemap_fd;
snprintf(pagemap_file, sizeof(pagemap_file), "/proc/%ju/pagemap", (uintmax_t)pid);
pagemap_fd = open(pagemap_file, O_RDONLY);
if (pagemap_fd < 0) {
return 1;
}
PagemapEntry entry;
if (pagemap_get_entry(&entry, pagemap_fd, vaddr)) {
return 1;
}
close(pagemap_fd);
*paddr = (entry.pfn * sysconf(_SC_PAGE_SIZE)) + (vaddr % sysconf(_SC_PAGE_SIZE));
return 0;
}
int main(int argc, char **argv)
{
pid_t pid;
uintptr_t vaddr, paddr = 0;
if (argc < 3) {
printf("Usage: %s pid vaddr\n", argv[0]);
return EXIT_FAILURE;
}
pid = strtoull(argv[1], NULL, 0);
vaddr = strtoull(argv[2], NULL, 0);
if (virt_to_phys_user(&paddr, pid, vaddr)) {
fprintf(stderr, "error: virt_to_phys_user\n");
return EXIT_FAILURE;
};
printf("0x%jx\n", (uintmax_t)paddr);
return EXIT_SUCCESS;
}
GitHub upstream.
Usage:
sudo ./virt_to_phys_user.out <pid> <virtual-address>
sudo is required to read /proc/<pid>/pagemap even if you have file permissions as explained at: https://unix.stackexchange.com/questions/345915/how-to-change-permission-of-proc-self-pagemap-file/383838#383838
As mentioned at: https://stackoverflow.com/a/46247716/895245 Linux allocates page tables lazily, so make sure that you read and write a byte to that address from the test program before using virt_to_phys_user.
How to test it out
Test program:
#define _XOPEN_SOURCE 700
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
enum { I0 = 0x12345678 };
static volatile uint32_t i = I0;
int main(void) {
printf("vaddr %p\n", (void *)&i);
printf("pid %ju\n", (uintmax_t)getpid());
while (i == I0) {
sleep(1);
}
printf("i %jx\n", (uintmax_t)i);
return EXIT_SUCCESS;
}
The test program outputs the address of a variable it owns, and its PID, e.g.:
vaddr 0x600800
pid 110
and then you can pass convert the virtual address with:
sudo ./virt_to_phys_user.out 110 0x600800
Finally, the conversion can be tested by using /dev/mem to observe / modify the memory, but you can't do this on Ubuntu 17.04 without recompiling the kernel as it requires: CONFIG_STRICT_DEVMEM=n, see also: How to access physical addresses from user space in Linux? Buildroot is an easy way to overcome that however.
Alternatively, you can use a Virtual machine like QEMU monitor's xp command: How to decode /proc/pid/pagemap entries in Linux?
See this to dump all pages: How to decode /proc/pid/pagemap entries in Linux?
Userland subset of this question: How to find the physical address of a variable from user-space in Linux?
Dump all process pages with /proc/<pid>/maps
/proc/<pid>/maps lists all the addresses ranges of the process, so we can walk that to translate all pages: /proc/[pid]/pagemaps and /proc/[pid]/maps | linux
Kerneland virt_to_phys() only works for kmalloc() addresses
From a kernel module, virt_to_phys(), has been mentioned.
However, it is import to highlight that it has this limitation.
E.g. it fails for module variables. arc/x86/include/asm/io.h documentation:
The returned physical address is the physical (CPU) mapping for
the memory address given. It is only valid to use this function on
addresses directly mapped or allocated via kmalloc().
Here is a kernel module that illustrates that together with an userland test.
So this is not a very general possibility. See: How to get the physical address from the logical one in a Linux kernel module? for kernel module methods exclusively.
As answered before, normal programs should not need to worry about physical addresses as they run in a virtual address space with all its conveniences. Furthermore, not every virtual address has a physical address, the may belong to mapped files or swapped pages. However, sometimes it may be interesting to see this mapping, even in userland.
For this purpose, the Linux kernel exposes its mapping to userland through a set of files in the /proc. The documentation can be found here. Short summary:
/proc/$pid/maps provides a list of mappings of virtual addresses together with additional information, such as the corresponding file for mapped files.
/proc/$pid/pagemap provides more information about each mapped page, including the physical address if it exists.
This website provides a C program that dumps the mappings of all running processes using this interface and an explanation of what it does.
The suggested C program above usually works, but it can return misleading results in (at least) two ways:
The page is not present (but the virtual addressed is mapped to a page!). This happens due to lazy mapping by the OS: it maps addresses only when they are actually accessed.
The returned PFN points to some possibly temporary physical page which could be changed soon after due to copy-on-write. For example: for memory mapped files, the PFN can point to the read-only copy. For anonymous mappings, the PFN of all pages in the mapping could be one specific read-only page full of 0s (from which all anonymous pages spawn when written to).
Bottom line is, to ensure a more reliable result: for read-only mappings, read from every page at least once before querying its PFN. For write-enabled pages, write into every page at least once before querying its PFN.
Of course, theoretically, even after obtaining a "stable" PFN, the mappings could always change arbitrarily at runtime (for example when moving pages into and out of swap) and should not be relied upon.
I wonder why there is no user-land API.
Because user land memory's physical address is unknown.
Linux uses demand paging for user land memory. Your user land object will not have physical memory until it is accessed. When the system is short of memory, your user land object may be swapped out and lose physical memory unless the page is locked for the process. When you access the object again, it is swapped in and given physical memory, but it is likely different physical memory from the previous one. You may take a snapshot of page mapping, but it is not guaranteed to be the same in the next moment.
So, looking for the physical address of a user land object is usually meaningless.

Detect block size for quota in Linux

The limit placed on disk quota in Linux is counted in blocks. However, I found no reliable way to determine the block size. Tutorials I found refer to block size as 512 bytes, and sometimes as 1024 bytes.
I got confused reading a post on LinuxForum.org for what a block size really means. So I tried to find that meaning in the context of quota.
I found a "Determine the block size on hard disk filesystem for disk quota" tip on NixCraft, that suggested the command:
dumpe2fs /dev/sdXN | grep -i 'Block size'
or
blockdev --getbsz /dev/sdXN
But on my system those commands returned 4096, and when I checked the real quota block size on the same system, I got a block size of 1024 bytes.
Is there a scriptable way to determine the quota block size on a device, short of creating a known sized file, and checking it's quota usage?
The filesystem blocksize and the quota blocksize are potentially different. The quota blocksize is given by the BLOCK_SIZE macro defined in <sys/mount.h> (/usr/include/sys/mount.h):
#ifndef _SYS_MOUNT_H
#define _SYS_MOUNT_H 1
#include <features.h>
#include <sys/ioctl.h>
#define BLOCK_SIZE 1024
#define BLOCK_SIZE_BITS 10
...
The filesystem blocksize for a given filesystem is returned by the statvfs call:
#include <stdio.h>
#include <sys/statvfs.h>
int main(int argc, char *argv[])
{
char *fn;
struct statvfs vfs;
if (argc > 1)
fn = argv[1];
else
fn = argv[0];
if (statvfs(fn, &vfs))
{
perror("statvfs");
return 1;
}
printf("(%s) bsize: %lu\n", fn, vfs.f_bsize);
return 0;
}
The <sys/quota.h> header includes a convenience macro to convert filesystem blocks to disk quota blocks:
/*
* Convert count of filesystem blocks to diskquota blocks, meant
* for filesystems where i_blksize != BLOCK_SIZE
*/
#define fs_to_dq_blocks(num, blksize) (((num) * (blksize)) / BLOCK_SIZE)

Resources