Why can't I mmap(MAP_FIXED) the highest virtual page in a 32-bit Linux process on a 64-bit kernel? - linux

While attempting to test Is it allowed to access memory that spans the zero boundary in x86? in user-space on Linux, I wrote a 32-bit test program that tries to map the low and high pages of 32-bit virtual address space.
After echo 0 | sudo tee /proc/sys/vm/mmap_min_addr, I can map the zero page, but I don't know why I can't map -4096, i.e. (void*)0xfffff000, the highest page. Why does mmap2((void*)-4096) return -ENOMEM?
strace ./a.out
execve("./a.out", ["./a.out"], 0x7ffe08827c10 /* 65 vars */) = 0
strace: [ Process PID=1407 runs in 32 bit mode. ]
....
mmap2(0xfffff000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0
Also, what check is rejecting it in linux/mm/mmap.c, and why is it designed that way? Is this part of making sure that creating a pointer to one-past-an-object doesn't wrap around and break pointer comparisons, because ISO C and C++ allow creating a pointer to one-past-the-end, but otherwise not outside of objects.
I'm running under a 64-bit kernel (4.12.8-2-ARCH on Arch Linux), so 32-bit user space has the entire 4GiB available. (Unlike 64-bit code on a 64-bit kernel, or with a 32-bit kernel where the 2:2 or 3:1 user/kernel split would make the high page a kernel address.)
I haven't tried from a minimal static executable (no CRT startup or libc, just asm) because I don't think that would make a difference. None of the CRT startup system calls look suspicious.
While stopped at a breakpoint, I checked /proc/PID/maps. The top page isn't already in use. The stack includes the 2nd highest page, but the top page is unmapped.
00000000-00001000 rw-p 00000000 00:00 0 ### the mmap(0) result
08048000-08049000 r-xp 00000000 00:15 3120510 /home/peter/src/SO/a.out
08049000-0804a000 r--p 00000000 00:15 3120510 /home/peter/src/SO/a.out
0804a000-0804b000 rw-p 00001000 00:15 3120510 /home/peter/src/SO/a.out
f7d81000-f7f3a000 r-xp 00000000 00:15 1511498 /usr/lib32/libc-2.25.so
f7f3a000-f7f3c000 r--p 001b8000 00:15 1511498 /usr/lib32/libc-2.25.so
f7f3c000-f7f3d000 rw-p 001ba000 00:15 1511498 /usr/lib32/libc-2.25.so
f7f3d000-f7f40000 rw-p 00000000 00:00 0
f7f7c000-f7f7e000 rw-p 00000000 00:00 0
f7f7e000-f7f81000 r--p 00000000 00:00 0 [vvar]
f7f81000-f7f83000 r-xp 00000000 00:00 0 [vdso]
f7f83000-f7fa6000 r-xp 00000000 00:15 1511499 /usr/lib32/ld-2.25.so
f7fa6000-f7fa7000 r--p 00022000 00:15 1511499 /usr/lib32/ld-2.25.so
f7fa7000-f7fa8000 rw-p 00023000 00:15 1511499 /usr/lib32/ld-2.25.so
fffdd000-ffffe000 rw-p 00000000 00:00 0 [stack]
Are there VMA regions that don't show up in maps that still convince the kernel to reject the address? I looked at the occurrences of ENOMEM in linux/mm/mmapc., but it's a lot of code to read so maybe I missed something. Something that reserves some range of high addresses, or because it's next to the stack?
Making the system calls in the other order doesn't help (but PAGE_ALIGN and similar macros are written carefully to avoid wrapping around before masking, so that wasn't likely anyway.)
Full source, compiled with gcc -O3 -fno-pie -no-pie -m32 address-wrap.c:
#include <sys/mman.h>
//void *mmap(void *addr, size_t len, int prot, int flags,
// int fildes, off_t off);
int main(void) {
volatile unsigned *high =
mmap((void*)-4096L, 4096, PROT_READ | PROT_WRITE,
MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0);
volatile unsigned *zeropage =
mmap((void*)0, 4096, PROT_READ | PROT_WRITE,
MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0);
return (high == MAP_FAILED) ? 2 : *high;
}
(I left out the part that tried to deref (int*)-2 because it just segfaults when mmap fails.)

The mmap function eventually calls either do_mmap or do_brk_flags which do the actual work of satisfying the memory allocation request. These functions in turn call get_unmapped_area. It is in that function that the checks are made to ensure that memory cannot be allocated beyond the user address space limit, which is defined by TASK_SIZE. I quote from the code:
* There are a few constraints that determine this:
*
* On Intel CPUs, if a SYSCALL instruction is at the highest canonical
* address, then that syscall will enter the kernel with a
* non-canonical return address, and SYSRET will explode dangerously.
* We avoid this particular problem by preventing anything executable
* from being mapped at the maximum canonical address.
*
* On AMD CPUs in the Ryzen family, there's a nasty bug in which the
* CPUs malfunction if they execute code from the highest canonical page.
* They'll speculate right off the end of the canonical space, and
* bad things happen. This is worked around in the same way as the
* Intel problem.
#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
#define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? \
0xc0000000 : 0xFFFFe000)
#define TASK_SIZE (test_thread_flag(TIF_ADDR32) ? \
IA32_PAGE_OFFSET : TASK_SIZE_MAX)
On processors with 48-bit virtual address spaces, __VIRTUAL_MASK_SHIFT is 47.
Note that TASK_SIZE is specified depending on whether the current process is 32-bit on 32-bit, 32-bit on 64-bit, 64-bit on 64-bit. For 32-bit processes, two pages are reserved; one for the vsyscall page and the other used as a guard page. Essentially, the vsyscall page cannot be unmapped and so the highest address of the user address space is effectively 0xFFFFe000. For 64-bit processes, one guard page is reserved. These pages are only reserved on 64-bit Intel and AMD processors because only on these processors the SYSCALL mechanism is used.
Here is the check that is performed in get_unmapped_area:
if (addr > TASK_SIZE - len)
return -ENOMEM;

Related

why and how do the addresses returned by 'cat /proc/self/maps' change when it's executed again

I'm trying to understand linux memory management.
Why and how do the addresses returned by 'cat /proc/self/maps' change when it's executed again
user#notebook:/$ cat /proc/self/maps | grep heap
55dc94a7c000-55dc94a9d000 rw-p 00000000 00:00 0 [heap]
user#notebook:/$ cat /proc/self/maps | grep heap
562609879000-56260989a000 rw-p 00000000 00:00 0 [heap]
This is due to Address Space Layout Randomization, aka ASLR. Linux will load code and libraries at different locations each time to make it harder to exploit buffer overflows and similar.
You can disable it with
echo 0 > /proc/sys/kernel/randomize_va_space
which will make the addresses the same each time. You can then re-enable it with:
echo 2 > /proc/sys/kernel/randomize_va_space
and the addresses will be randomized again.

Is there a way to have a.out loaded in linux x86_64 "high memory"?

If I look at the memory mapping for a 64-bit process on Linux (x86_64) I see that the a.out is mapped in fairly low memory:
$ cat /proc/1160/maps
00400000-004dd000 r-xp 00000000 103:03 536876177 /usr/bin/bash
006dc000-006dd000 r--p 000dc000 103:03 536876177 /usr/bin/bash
006dd000-006e6000 rw-p 000dd000 103:03 536876177 /usr/bin/bash
006e6000-006ec000 rw-p 00000000 00:00 0
00e07000-00e6a000 rw-p 00000000 00:00 0 [heap]
7fbeac11c000-7fbeac128000 r-xp 00000000 103:03 1074688839 /usr/lib64/libnss_files-2.17.so
7fbeac128000-7fbeac327000 ---p 0000c000 103:03 1074688839 /usr/lib64/libnss_files-2.17.so
I'd like to map a 2G memory region in the very lowest portion of memory, but have to put this in the region after these a.out mappings, crossing into the second 2G region.
Is the a.out being mapped here part of the x86_64 ABI, or can this load address be moved to a different region, using one of:
runtime loader flags
linker flags when the executable is created
?
Yes. Building a Linux x86-64 application as a position-independent executable will cause both it and its heap to be mapped into high memory, right along with libc and other libraries. This should leave the space under 2GB free for your use. (However, note that the kernel will probably protect the first 64KB or so of memory from being mapped to protect it from certain exploits; look up vm.mmap_min_addr for information.)
To build your application as a position-independent executable, pass -pie -fPIE to the compiler.

Why is vdso appearing during execution of static binaries? [duplicate]

This question already has answers here:
What are vdso and vsyscall?
(2 answers)
Closed 7 years ago.
Here is a quick sample program. (This will basically get the procmap associated with the process)
> cat sample.c
#include<stdio.h>
int main()
{
char buffer[1000];
sprintf(buffer, "cat /proc/%d/maps\n", getpid());
int status = system(buffer);
return 1;
}
Preparing it statically
> gcc -static -o sample sample.c
> file sample
sample: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.24, BuildID[sha1]=9bb9f33e867df8f2d56ffb4bfb5d348c544b1050, not stripped
Executing the binary
> ./sample
00400000-004c0000 r-xp 00000000 08:01 12337398 /home/admin/sample
006bf000-006c2000 rw-p 000bf000 08:01 12337398 /home/admin/sample
006c2000-006c5000 rw-p 00000000 00:00 0
0107c000-0109f000 rw-p 00000000 00:00 0 [heap]
7ffdb3d78000-7ffdb3d99000 rw-p 00000000 00:00 0 [stack]
7ffdb3de7000-7ffdb3de9000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
I googled about vDSO but did not understand properly. Wikipedia says "these are ways in which kernel routines can be accessed from user space". My question is why are these shared objects appearing in execution of static binaries?
My question is why are these shared objects appearing in execution of static binaries?
They appear because your kernel "injects" them into every process.
Read more about them here and here.

virtual memory in linux

I am debugging one issue where a same program behaves differently on different Linux boxes (all 2.6 kernels). Basically on some linux box, mmap() of 16MB always succeeds, but on other linux box the same mmap() would fail with "ENOMEM".
I checked /proc//maps and saw the virtual memory map on different linux boxes are quite different. One thing is: the address range for heap:
linux box1: the mmap() would return address of 31162000-31164000
linux box2: the mmap() would return address of a9a67000-a9a69000
My question is: for a particular process, how is linux virtual memory arranged? What decides the actual address ranges? Why the mmap() would fail even I can still see some large unused virtual address ranges?
UPDATE: here is an example of address space where mmap() of 16MB would fail:
10024000-10f6b000 rwxp 10024000 00:00 0 [heap]
30000000-3000c000 rw-p 30000000 00:00 0
3000c000-3000d000 r--s 00000000 00:0c 5461 /dev/shm/mymap1
3000d000-3000e000 rw-s 00001000 00:0c 5461 /dev/shm/mymap2
3000e000-30014000 r--s 00000000 00:0c 5463 /dev/shm/mymap3
30014000-300e0000 r--s 00000000 00:0c 5465 /dev/shm/mymap4
300e0000-310e0000 r--s 00000000 00:0b 2554 /dev/mymap5
310e0000-31162000 rw-p 310e0000 00:00 0
31162000-31164000 rw-s 00000000 00:0c 3554306 /dev/shm/mymap6
31164000-312e4000 rw-s 00000000 00:0c 3554307 /dev/shm/mymap7
312e4000-32019000 rw-s 00000000 00:0c 3554309 /dev/shm/mymap8
7f837000-7f84c000 rw-p 7ffeb000 00:00 0 [stack]
in the above example: there are still big space between the last mymap8 and [stack]. But further mmap() of 16MB would fail. My question is: how does Linux decide mmap() base and the allowed range?
Thanks.

identifying glibc mmap areas (VMA's) from a Linux kernel module

I understood When allocating a blocks of memory larger than MMAP_THRESHOLD bytes, the glibc malloc() implementation allocates the memory as a private anonymous mapping using mmap ,and this mmap allocated area wont come as a part of [heap] in linux vma.
So is there any method available to identify all the glibc mmap areas from a linux kernel module.?
example :
One of test program which do malloc greater than MMAP_THRESHOLD many times shows cat /proc/pid/maps output as
00013000-00085000 rw-p 00000000 00:00 0 [heap]
40000000-40016000 r-xp 00000000 00:0c 14107305 /lib/arm-linux-gnueabi/ld-2.13.so
4025e000-4025f000 r--p 00001000 00:0c 14107276 /lib/arm-linux-gnueabi/libdl-2.13.so
4025f000-40260000 rw-p 00002000 00:0c 14107276 /lib/arm-linux-gnueabi/libdl-2.13.so
.....
.....
40260000-40261000 ---p 00000000 00:00 0
40261000-40a60000 rw-p 00000000 00:00 0
40a60000-40a61000 ---p 00000000 00:00 0
40a61000-42247000 rw-p 00000000 00:00 0
beed8000-beef9000 rw-p 00000000 00:00 0 [stack]
In this few are (40a61000-42247000,40261000-40a60000) actually glibc mmap areas,So from a Linux kernel module is there any way to identify this areas ,something like below code which identify stack and heap ?
if (vma->vm_start <= mm->start_brk &&
vma->vm_end >= mm->brk) {
name = "[heap]";
} else if (vma->vm_start <= mm->start_stack &&
vma->vm_end >= mm->start_stack) {
name = "[stack]";
}
I believe you should not dump the memory of your application from a kernel module. You should consider using application checkpointing, see this answer and the Berkley checkpoint restart library
You could also consider using the core dumping facilities inside the kernel.

Resources