Why is mprotect a distinct syscall from mmap

Why is mprotect a distinct syscall from mmap - linux

I was working with syscalls relating to virtual memory lately. From the manual of mmap I know that it can be very powerful when MAP_FIXED flag is set, creating new mappings everywhere in the memory.
MAP_FIXED
Don't interpret addr as a hint: place the mapping at exactly
that address. addr must be suitably aligned: for most
architectures a multiple of the page size is sufficient;
however, some architectures may impose additional
restrictions. If the memory region specified by addr and len
overlaps pages of any existing mapping(s), then the overlapped
part of the existing mapping(s) will be discarded. If the
specified address cannot be used, mmap() will fail.
Software that aspires to be portable should use the MAP_FIXED
flag with care, keeping in mind that the exact layout of a
process's memory mappings is allowed to change significantly
between kernel versions, C library versions, and operating
system releases. Carefully read the discussion of this flag
in NOTES!
My question is, why there is a distinct syscall mprotect from mmap, given that mmap can do the exact same job by creating a new mapping with the same fd and offset, and set the new prot you want?
In my opinion, all operations about VM can be done ultimately with mmap and munmap, for those operations are basically just playing with the page table. Can someone tell me if this is a bad idea?

You need mprotect if you want to change the permissions on an existing region of memory, while keeping its contents intact.
mmap can't do this. If you use mmap with MAP_FIXED to create a new mapping at the same address, then the region's previous contents will be replaced by the contents of the new file you mapped, or zeros if using MAP_ANONYMOUS.
Using the same fd and offset does not solve this. If the map was originally created with MAP_ANONYMOUS (as is the case for most dynamically allocated memory) then there is no fd. Or, if the region was mapped to a file but with MAP_PRIVATE, then the contents could have been modified in your process's memory without being written back to the file. Attempting to map the file again with mmap will lose the modified data and replace it with the file's original contents.

Related

Where do FS and GS registers get mapped to in the linear address space?

I understand that in 32 bit you have segments where each segment would map to a base and limit. Therefore, a segment wouldn't be able to access another segments data.
With 64 bit, we throw away most of the segments and have a base of 0 with no limit, thus accessing the entire 64 bit address space. But I get confused when they state we have FS and GS registers for thread local storage and additional data.
If the default segment can access anything in the linear address space, then what is stopping the program from corrupting or accessing the FS/GS segments? The OS would have to keep track of FS/GS and make sure nothing else gets allocated there right? How does this work?
Also, if the default area can access anything, then why do we even have FS/GS. I guess FS makes sense because we can just switch the register during a thread switch. But why even use GS? Why not malloc memory instead? Sorry I am new to OS.

In 64-bit mode, the FS and GS "segment registers" aren't really used, per se. But using an FS or GS segment override for a memory access causes the address to be offset by the value contained in a hidden FSBASE/GSBASE register which can be set by a special instruction (possibly privileged, in which case a system call can ask the kernel to do it). So for instance, if FSBASE = 0x12340000 and RDI = 0x56789, then MOV AL, FS:[RDI] will load from linear address 0x12396789.
This is still just a linear address - it's not separate from the rest of the process's address space in any way, and it's subject to all the same paging and memory protection as any other memory access. The program could get the exact same effect with MOV AL, [0x12396789] (since DS has a base of 0). It is up to the program's usual memory allocation mechanisms to allocate an appropriate block of memory and set FSBASE to point to that block, if it intends to use FS in this way. There are no special protections needed to avoid "corrupting" this memory, any more than they are needed to prevent a program from corrupting any other part of its own memory.
So it doesn't really provide new functionality - it's more a convenience for programmers and compilers. As you say, it's nice for something like a pointer to thread-local storage, but if we didn't have FS/GS, there are plenty of other ways we could keep such a pointer in the thread's state (say, reserving R15). It's true that there's not an obvious general need for two such registers; for the most part, the other one is just there in case someone comes up with a good way to use it in a particular program.
See also:
How are the fs/gs registers used in Linux AMD64?
What is the "FS"/"GS" register intended for?

Is it possible to overwrite kernel code by replacing the page tables?

So it seems that you cannot modify kernel code because the PTE that points to it is marked as executable as opposed to writeable. I was wondering if you could overwrite kernel code using the following method? (this only applies to x86 and assumes we have root access so we run the following steps as a kernel module)
Read in the contents of the CR3 register
Use kmalloc to allocate memory big enough to replicate all the PTE and the PDE
Copy all the paging data into the newly allocated memory using the value obtained from the CR3 register
Mark the relevant pages as executable and writeable
Overwrite the CR3 register with a pointer to the memory we kmalloc'ed in step 2
At this point, assuming this all worked, wouldnt you be able to overwrite return addresses and other parts of the kernel code? Whereas before doing this we would be stopped from the paging mechanism protections?

3 . Copy all the paging data into the newly allocated memory using the value obtained from the CR3 register
5 . Overwrite the CR3 register with a pointer to the memory we kmalloc'ed in step 2
These two steps might not work:
CR3 gives you an physical address; however, for reading the page data you require a virtual address. It is not even guaranteed that the PTD is currently mapped (and therefore accessible).
And to overwrite the CR3 register you need to know the physical address of the memory you have allocated using kmalloc; however, you only know the virtual address.
However, you might use virt_to_phys and phys_to_virt to translate physical to virtual addresses.
Is it possible to overwrite kernel code ...?
I'm not sure, but the following attempt should work:
The page tables themselves should be read-write - at least the ones used by kmalloc.
Instead of copying the PTD and the page tables, you could allocate some memory using kmalloc which is 2 page sizes long (8 KiB if the "traditional" 4 KiB memory pages are used). This means that "your" memory block under all circumstances contains one complete memory page.
When you have the virtual addresses of the PTD and the page tables, you can re-map "your" memory page so it does no longer point to your "kmalloc memory" but to the kernel code you want to modify...
At this point, assuming this all worked, wouldn't you be able to overwrite return addresses and other parts of the kernel code?
I'm not sure if I understand your question correctly.
But a kernel module is part of the kernel - so nothing stops a kernel module from doing something completely stupid (intentionally or because of a bug).
For this reason you have to be very careful when programming kernel modules.
And because "root" has the ability to load kernel modules, it is important that hackers or malware never get "root" access. Otherwise malware could be injected into the kernel using insmod.

write access for kernel into a readonly segment

Until now I thought the kernel has the permissions to write in readonly segments. But this code has brought a lot of questions
int main() {
char *x = "Hello World";
int status = pipe((int*)x);
perror("Error");
}
The output of the code is
Error : Bad Address
What my argument is, "Since the pipe function executes in kernel mode the ro segment must be writable by kernel". Which doesn't seem to be the case here. Now my questions are
How kernel protects the memory segments which are readonly?
Or am I assuming wrong about the kernel's capabilities?

Much like the user space, the kernel's address space is subject to whether a particular virtual address (also called a logical address) is mapped as readable, writable and executable. Unlike the user space though, the kernel has the free rein to map a group of virtual addresses with a page and change the page permission attributes. However, just because the kernel has the ability to map a page as writeable, does not mean the address stored in char*x was paged in the kernel's address space as writable, or even paged at all, at the time of the pipe call.
The way the kernel protects regions of memory is with a piece of hardware called a memory management unit (MMU). The MMU is what performs the mapping of virtual to physical addresses and enforces permissions in those regions. The kernel is more or less given free rein to configure the MMU. Unlike kernel space, user space code should be unable to access the MMU. Since the user space can not access the MMU, it can not change the page table's mappings or the permission attributes of a page. This effectively means that user space has to use the address space mapping and the permissions set by the kernel.

I don't understand where the "kernel can write to ro pages" assertion comes from. If the kernel wants to it can remap memory however it sees fit of course, but why would it do that for this case?
I presume you are running on x86. On this arch the kernel splits the address space into 2 parts (user/kernel). When you switch to the kernel, userspace is still mapped So in particular when the kernel wants tries to write to the provided address, it hits the same mapping your userspace process would. Since the mapping does not allow write access, the operation fails.
For the sake of argument let's say this would not hold true. That is, whatever read-only mapping is in userspace, the kernel will write to it anyway and that will work. Well, that would be an instant security problem - consider a file you can only read/exec, like the glibc. it is mapped read-only/exec. And now you make the kernel write to area, effectively changing the file for everyone. So why not in particular do read(evilfd, address_of_libc, sizeo_of_libc); and bam, you just managed to overwrite the entire lib with data of your choice.

Replacing `sbrk` with `mmap`

I've read that sbrk is a deprecated call and one should prefer mmap with MAP_ANONYMOUS flag. I need one continous (logical) memory block that can grow. However, mmap treats first parameter as a hint, so it can make gaps, which is unacceptable in my case. I tried to use MAP_FIXED flag (which as documentation states is not recommended) and I can get continuos memory, but after mapping several pages I get strange behaviour of my program: system functions like printf and clock_gettime begin to fail. I guess the first mmap which I call without MAP_FIXED returns page that has some mapped pages after it, which contain system data. So what is the right way to use mmap instead of sbrk?

With Linux you can use mmap with MAP_NORESERVE (and possibly PROT_NONE) to claim a large chunk of address space without actually allocating any memory. You map the largest area you could possibly want (and can get), and then remap bits of it with MAP_FIXED to actually allocate memory as needed.

I've read that sbrk is a deprecated call
Don't believe everything you read, especially if the source is not authoritative.
I need one continous (logical) memory block that can grow.
In that case, mmap is not for you, unless you are willing to declare the maximum size to which that block can grow.
I tried to use MAP_FIXED flag (which as documentation states is not recommended) and I can get continuos memory, but after mapping several pages I get strange behaviour of my program
With MMAP_FIXED you have to be very careful: the system will happily map over whatever (if anything) was there before, including libc data and code.

Linux MMAP internals

I have several questions regarding the mmap implementation in Linux systems which don't seem to be very much documented:
When mapping a file to memory using mmap, how would you handle prefetching the data in such file?
I.e. what happens when you read data from the mmaped region? Is that data moved to the L1/L2 caches? Is it read directly from disk cache? Does the prefetchnta and similar ASM instructions work on mmaped zones?
What's the overhead of the actual mmap call? Is it relative to the amount of mapped data, or constant?
Hope somebody has some insight into this. Thanks in advance.

mmap is basically programmatic access to the Virtual Memory sub system.
When you have, say, 1G file, and you mmap it, you get a pointer to "the entire" file as if it were in memory.
However, at this stage nothing has happened save the actual mapping operation of reserving pages for the file in the VM. (The large the file, the longer the mapping operation, of course.)
In order to start reading data from the file, you simply access it through the pointer you were returned in the mmap call.
If you wish to "preload" parts of the file, just visit the area you'd like to preload. Make sure you visit ALL of the pages you want to load, since the VM will only load the pages you access. For example, say within your 1G file, you have a 10MB "index" area that you'd like to map in. The simplest way would be to just "walk your index", or whatever data structure you have, letting the VM page in data as necessary. Or, if you "know" that it's the "first 10MB" of the file, and that your page size for your VM is, say, 4K, then you can just cast the mmap pointer to a char pointer, and just iterate through the pages.
void load_mmap(char *mmapPtr) {
// We'll load 10MB of data from mmap
int offset = 0;
for(int offset = 0; offset < 10 * 1024 * 1024; offset += 4 * 1024) {
char *p = mmapPtr + offset;
// deref pointer to force mmap load
char c = *p;
}
}
As for L1 and L2 caches, mmap has nothing to do with that, that's all about how you access the data.
Since you're using the underlying VM system, anything that addresses data within the mmap'd block will work (ever from assembly).
If you don't change any of the mmap'd data, the VM will automatically flush out old pages as new pages are needed If you actually do change them, then the VM will write those pages back for you.

It's nothing to do with the CPU caches; it maps it into virtual address space, and if it's subsequently accessed, or locked with mlock(), then it brings it physically into memory. What CPU caches it's in or not in is nothing you really have control over (at least, not via mmap).
Normally touching the pages is necessary to cause it to be mapped in, but if you do a mlock or mlockall, that would have the same effect (these are usually privileged).
As far as the overhead is concerned, I don't really know, you'd have to measure it. My guess is that a mmap() which doesn't load pages in is more or less a constant time operation, but bringing the pages in will take longer with more pages.
Recent versions of Linux also support a flag MAP_POPULATE which instructs mmap to load the pages in immediately (presumably only if possible)

Answering Mr. Ravi Phulsundar's question:
Multiple processes can map the same file as long as the permissions are set correctly. Looking at the mmap man page just pass the MAP_SHARED flag ( if you need to map a really large file use mmap2 instead ):
mmap
MAP_SHARED
Share this mapping with all other processes that map this object.
Storing to the region is equivalent to
writing to the file. The file may not
actually be updated until msync(2) or
munmap(2) are called.

you use MAP_SHARED

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string