Does mprotect flush the instruction cache on ARM Linux? - linux

I am writing a JIT on ARM Linux that executes an instruction set that contains self-modifying code. The instruction set does not have any cache flush instructions (similar to x86 in that respect).
If I write out some code to a page and then call mprotect on that page, is that sufficient to invalidate the instruction cache? Or do I also need to use the cacheflush syscall on those pages?

You'd expect that the mmap/mprotect syscalls would establish mappings that are updated immediately, and need no further interaction to use the memory ranges as specified. I see that the kernel does indeed flush caches on mprotect. In that case, no cache flush would be required.
However, I also see that some versions of libc do call cacheflush after mprotect, which would imply that some environments would need the caches flushed (or have previously). I'd take a guess that this is a workaround to a bug.
You could always add the call to cacheflush; although it's extra code, it shouldn't be to harmful - at worst, the caches will already be flushed. You could always write a quick test and see what happens...

In Linux specifically, mprotect DOES cacheflush all caches since at least version 2.6.39 (and even before that for sure). You can see that in the code:
https://elixir.bootlin.com/linux/v2.6.39.4/source/mm/mprotect.c#L122 .
If you are writing a POSIX portable code, I would call cacheflush as the standard C library is not demanding such behavior from the kernel, nor from the implementation.
Edit: You should also be carefull and check what flush_cache_range does in the specific architecture you are implementing for, as in some architecture (like ARM64) this function does nothing...

I believe you do not have to explicitly flush the cache.
Which processor is this? ARMv5? ARMv7?

Related

Disable Linux vsyscall vdso vvar

I am implementing a Linux security sandbox for a custom bytecode interpreter through seccomp mode. To minimize as much as possible the attack surface, I want to run it in a completely clean virtual address space. I only need code and data segments plus stack available, but I do not need vsyscall, vdso nor vvar.
Is there any way to disable allocation of this pages for a given process?
Basically, no, you will have to disable vsyscall/vDSO globally if you want the mapping itself to be unavailable. If you only want the program to be unable to call vsyscall/vDSO syscalls, then seccomp will be able to do it. Some caveats though:
See https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
On x86-64, vsyscall emulation is enabled by default. (vsyscalls are
legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities:
A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to
the vsyscall entry for the given call and not the address after the
'syscall' instruction. Any code which wants to restart the call
should be aware that (a) a ret instruction has been emulated and (b)
trying to resume the syscall will again trigger the standard vsyscall
emulation security checks, making resuming the syscall mostly
pointless.
A return value of SECCOMP_RET_TRACE will signal the tracer as usual,
but the syscall may not be changed to another system call using the
orig_rax register. It may only be changed to -1 order to skip the
currently emulated call. Any other change MAY terminate the process.
The rip value seen by the tracer will be the syscall entry address;
this is different from normal behavior. The tracer MUST NOT modify
rip or rsp. (Do not rely on other changes terminating the process.
They might work. For example, on some kernels, choosing a syscall
that only exists in future kernels will be correctly emulated (by
returning -ENOSYS).
To detect this quirky behavior, check for addr & ~0x0C00 ==
0xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For
SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other
condition: future kernels may improve vsyscall emulation and current
kernels in vsyscall=native mode will behave differently, but the
instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these
cases.
Note that modern systems are unlikely to use vsyscalls at all -- they
are a legacy feature and they are considerably slower than standard
syscalls. New code will use the vDSO, and vDSO-issued system calls
are indistinguishable from normal system calls.
So emulated vsyscalls can be confined by seccomp, and vDSOs are likewise confined by seccomp. If you disable gettimeofday(), the confined program will not be able to call that syscall through emulated vsyscall, vDSO, or regular syscall. If you confine them this way with seccomp, you shouldn't have to worry about the attack surface they create.
If you are worried about an attacker exploiting the vDSO mapping itself (which doesn't require calling a syscall), then I don't believe there's a way to disable it on a per-process basis reliably. You can prevent it from being linked in, but it would be hard to prevent a compromised bytecode interpreter from allocating memory and putting it back. You can boot with the vdso=0 kernel parameter which will disable it globally, though, so linking it in would do nothing.

Who zeroes pages while calling calloc() in Linux?

I am aware that an implementer has a choice of whether he wants to zero a malloc page or let OS give him a zeroed page (for more optimization purposes).
My question is simple - in Ubuntu 14.04 LTS which comes with linux kernel 3.16 and gcc 4.8.4, who will zero my pages? Is it in user land or kernel land?
It can depend on where the memory came from. The calloc code is userland, and will zero a memory page that gets re-used by a process. This happens when the memory is previously used and then freed, but not returned to the OS. However, if the page is newly allocated to the process, it will come already cleared to 0 by the OS (for security purposes), and so does not need to be cleared by calloc. This means calloc can potentially be faster than calling malloc followed by memset, since it can skip the memset if it knows it will already by zeroed.
That depends on the implementer of your standard library, not on the host system. It is not possible to give a specific answer for a particular OS, since it may be the build target of multiple compilers and their libraries - including on other systems, if you consider the possibility of cross-compiling (building on one type of system to target another).
Most implementations I've seen of calloc() use a call of malloc() followed by either a call of memset() or (with some implementations that target unix) a legacy function called bzero() - which is, itself, sometimes replaced by a macro call that expands to a call of memset() in a number of recent versions of libraries.
memset() is often hand-optimised. But, again, it is up to the implementer of the library.

safe unloading of kernel module

I have to write LKM, that intercepts some syscalls.
Solution is to:
Find address of sys_call_table symbol, check if address is correct(checking for example that sys_call_table[__NR_close] points to address of sys_close)
Disable interrupts
Disable WP bit in CR0
Change sys_call_table[__NR_close] to my own function
Enable WP bit
Enable interrupts.
Loading of module works fine.
But, what about safe unloading of module?
Consider situation when I restore sys_call_table to it's original state and module is unloaded - what if kernel is still executing code from my module in context of syscall of other process on other CPU? I will get page fault in kernel mode(because pages with code segment of module are no more available, as module was unloaded).
The shared resource is entry in sys_call_table. If I can made access to this entry protected by locks - then I can safely unload my module.
But, since kernel system call handler doesn't have any of this locks(e.g.arch/x86/kernel/entry_32.S) - it means that there is no safe way of unloading my module? Is it true?
UPDATE1
I need to get information about file accesses on old kernels(where fanotify(2) is not available), starting from 2.4 kernel version. I need this information to perform on access scanning through antivirus engine.
You're correct that there is no safe way to unload your module once you've done this. This is one reason why replacing/wrapping system call table entries this way is frowned upon.
In most recent versions, sys_call_table is not an exported symbol -- at least in part to discourage this very thing.
It would be possible in theory to support a more robust system call replacement mechanism but the kernel maintainers believe that the whole concept is so fraught with the potential for errors and confusion that they have declined to support it. (A web search will show several long-ago debates about this subject on the linux kernel mailing list.)
(Speaking here as one who used exactly the same technique several years ago.)
You can of course do it anyway. Then, you can either "just risk" unloading your module - and hence potentially causing a kernel panic (but of course it will likely work 99% of the time). Or you can not allow your module to be unloaded at all (requiring a reboot in order to upgrade or uninstall).
At the end of the uninit function in your kernel module, you can wait till all your custom hooks end.
This can be achieved using counters.
Increment the counter when your custom hook is hit, decrement it right before it returns.
When the counter hits zero, only then return from the uninit function.
You will also need locking on the counter variable.

Linux System Calls & Kernel Mode

I understand that system calls exist to provide access to capabilities that are disallowed in user space, such as accessing a HDD using the read() system call. I also understand that these are abstracted by a user-mode layer in the form of library calls such as fread(), to provide compatibility across hardware.
So from the application developers point of view, we have something like;
//library //syscall //k_driver //device_driver
fread() -> read() -> k_read() -> d_read()
My question is; what is stopping me inlining all the instructions in the fread() and read() functions directly into my program? The instructions are the same, so the CPU should behave in the same way? I have not tried it, but I assume that this does not work for some reason I am missing. Otherwise any application could get arbitrary kernel mode operation.
TL;DR: What allows system calls to 'enter' kernel mode that is not copy-able by an application?
System calls do not enter the kernel themselves. More precisely, for example the read function you call is still, as far as your application is concerned, a library call. What read(2) does internally is calling the actual system call using some interruption or the syscall(2) assembly instruction, depending on the CPU architecture and OS.
This is the only way for userland code to have privileged code to be executed, but it is an indirect way. The userland and kernel code execute in different contexts.
That means you cannot add the kernel source code to your userland code and expect it to do anything useful but crash. In particular, the kernel code has access to physical memory addresses required to interact with the hardware. Userland code is limited to access a virtual memory space that has not this capability. Also, the instructions userland code is allowed to execute is a subset of the ones the CPU support. Several I/O, interruption and virtualization related instructions are examples of prohibited code. They are known as privileged instructions and require to be in an lower ring or supervisor mode depending on the CPU architecture.
You could inline them. You can issue system calls directly through syscall(2), but that soon gets messy. Note that the system call overhead (context switches back and forth, in-kernel checks, ...), not to mention the time the system call itself takes, makes your gain by inlining dissapear in the noise (if there is any gain, more code means cache isn't so useful, and performance suffers). Trust the libc/kernel folks to have studied the matter and done the inlining for you behind your back (in the relevant *.h file) if it really is a measurable gain.

Execute code in process's stack, on recent Linux

I want to use ptrace to write a piece of binary code in a running process's stack.
However, this causes segmentation fault (signal 11).
I can make sure the %eip register stores the pointer to the first instruction that I want to execute in the stack. I guess there is some mechanism that linux protects the stack data to be executable.
So, does anyone know how to disable such protection for stack. Specifically, I'm trying Fedora 15.
Thanks a lot!
After reading all replies, I tried execstack, which really makes code in stack executable. Thank you all!
This is probably due to the NX bit on modern processors. You may be able to disable this for your program using execstack.
http://advosys.ca/viewpoints/2009/07/disabling-the-nx-bit-for-specific-apps/
http://linux.die.net/man/8/execstack
As already mentioned it is due to the NX bit. But it is possible. I know for sure that gcc uses it itself for trampolines (which are a workaround to make e.g. function pointers of nested functions). I dont looked at the detailes, but I would recommend a look at the gcc code. Search in the sources for the architecture specific macro TARGET_ASM_TRAMPOLINE_TEMPLATE, there you should see how they do it.
EDIT: A quick google for that macro, gave me the hint: mprotect is used to change the permissions of the memory page. Also be carefull when you generate date and execute it - you maybe have in addition to flush the instruction cache.

Resources