safe unloading of kernel module - linux

I have to write LKM, that intercepts some syscalls.
Solution is to:
Find address of sys_call_table symbol, check if address is correct(checking for example that sys_call_table[__NR_close] points to address of sys_close)
Disable interrupts
Disable WP bit in CR0
Change sys_call_table[__NR_close] to my own function
Enable WP bit
Enable interrupts.
Loading of module works fine.
But, what about safe unloading of module?
Consider situation when I restore sys_call_table to it's original state and module is unloaded - what if kernel is still executing code from my module in context of syscall of other process on other CPU? I will get page fault in kernel mode(because pages with code segment of module are no more available, as module was unloaded).
The shared resource is entry in sys_call_table. If I can made access to this entry protected by locks - then I can safely unload my module.
But, since kernel system call handler doesn't have any of this locks(e.g.arch/x86/kernel/entry_32.S) - it means that there is no safe way of unloading my module? Is it true?
UPDATE1
I need to get information about file accesses on old kernels(where fanotify(2) is not available), starting from 2.4 kernel version. I need this information to perform on access scanning through antivirus engine.

You're correct that there is no safe way to unload your module once you've done this. This is one reason why replacing/wrapping system call table entries this way is frowned upon.
In most recent versions, sys_call_table is not an exported symbol -- at least in part to discourage this very thing.
It would be possible in theory to support a more robust system call replacement mechanism but the kernel maintainers believe that the whole concept is so fraught with the potential for errors and confusion that they have declined to support it. (A web search will show several long-ago debates about this subject on the linux kernel mailing list.)
(Speaking here as one who used exactly the same technique several years ago.)
You can of course do it anyway. Then, you can either "just risk" unloading your module - and hence potentially causing a kernel panic (but of course it will likely work 99% of the time). Or you can not allow your module to be unloaded at all (requiring a reboot in order to upgrade or uninstall).

At the end of the uninit function in your kernel module, you can wait till all your custom hooks end.
This can be achieved using counters.
Increment the counter when your custom hook is hit, decrement it right before it returns.
When the counter hits zero, only then return from the uninit function.
You will also need locking on the counter variable.

Related

Can kernel code make things read-only in a way that other kernel code can't undo?

I'm under the impression that the Linux kernel's attempts to protect itself revolve around not letting malicious code run in kernelspace. In particular, if a malicious kernel module were to get loaded, that it would be "game over" from a security standpoint. However, I recently came across a post that contradicts this belief and says that there is some way that the kernel can protect parts of itself from other parts of itself:
There is plenty of mechanisms to protect you against malicious modules. I write kernel code for fun so I have some experience in the field; it's basically a flag in the pagetable.
What's there to stop any kernel module from changing that flag in the pagetable back? The only protection against malicious modules is keeping them from loading at all. Once one loads, it's game over.
Make the pagetable readonly. Done.
Kernel modules can just make it read-write again, the same way your code made it read-only, then carry on with their changes.
You can actually lock this down so that kernel mode cannot modify the page table until an interrupt occurs. If your IDT is read-only as well there is no way for the module to do anything about it.
That doesn't make any sense to me. Am I missing something big about how kernel memory works? Can kernelspace code restrict itself from modifying the page table? Would that really prevent kernel rootkits? If so, then why doesn't the Linux kernel do that today to put an end to all kernel rootkits?
If the malicious kernel code is loaded in the trusted way (e.g. loading a kernel module and not exploiting a vulnerability) then no: kernel code is kernel code.
Intel CPUs do have a series of mechanisms to disable read/write access to kernel memory:
CR0.WP if set disallows writes accesses to both user and kernel read-only pages. Used to detect bugs in the kernel code.
CR4.PKE if set (4-level paging must be enabled, mandatory in 64-bit mode) disallows the kernel from accessing (not including instruction fetches) the user page mode unless these are tagged with the right key (which marks their RW permissions). Used to allow the kernel to write to structures like VSDO and KUSER_SHARED_DATA but not other user mode structures. The keys permissions are in an MSR, not in memory; the keys themselves are in the page table entries.
CR4.SMEP if set disallows kernel instruction fetching from user mode pages. Used to prevent attacks where a kernel function pointer is redirected to a user mode allocated page (e.g. the nelson.c privilege escalation exploit).
CR4.SMAP if set disallows kernel access to user mode pages during implicit access or during any type (implicit or explicit) of access (if EFLAGS.AC=0, thus overriding the protection keys). Used to enforce a more strictly no-user-mode-access policy.
Of course the R/W and U/S bits in the paging structures control if the item is read-only/read-write and assigned to user or kernel.
You can read how permissions are applied for supervisor-mode accesses in the Intel manual:
Data writes to supervisor-mode addresses.
Access rights depend on the value of CR0.WP:
- If CR0.WP = 0, data may be written to any supervisor-mode address.
- If CR0.WP = 1, data may be written to any supervisor-mode address with a translation for which the
R/W flag (bit 1) is 1 in every paging-structure entry controlling the translation; data may not be written
to any supervisor-mode address with a translation for which the R/W flag is 0 in any paging-structure
entry controlling the translation.
So even if the kernel protected a page X as read-only and then protected the page structures themselves as read-only, a malicious module could simply clear CR0.WP.
It could also change CR3 and use its own paging structures.
Note that Intel developed SGX to address the threat model where the kernel itself is evil.
However, running the kernel components into enclaves in a secure way (i.e. no single point of failure) may not be trivial.
Another approach is virtualizing the kernel with the VMX extension, though this is by no way trivial to implement.
Finally, the CPU has four protection levels at the segmentation layer but paging has only two: supervisor (CPL = 0) and user (CPL > 0).
It is theoretically possible to run a kernel component in "Ring 1" but then you'd need to make an interface (e.g. something like a call gate or syscall) for it to access the other kernel functions.
It's easier to run it in user mode altogether (since you don't trust the module in the first place).
I have no idea what this is supposed to mean:
You can actually lock this down so that kernel mode cannot modify the page table until an interrupt occurs.
I don't recall any mechanism by which the interrupt handling will lock/unlock anything.
I'm curious though, if anybody can shed some light they are welcome.
Security in the x86 CPUs (but this may be generalized) has always been hierarchical: whoever cames first set up the constraints for whoever cames later.
There is usually little to no protection between nonisolated components at the same hierarchical level.

Disable Linux vsyscall vdso vvar

I am implementing a Linux security sandbox for a custom bytecode interpreter through seccomp mode. To minimize as much as possible the attack surface, I want to run it in a completely clean virtual address space. I only need code and data segments plus stack available, but I do not need vsyscall, vdso nor vvar.
Is there any way to disable allocation of this pages for a given process?
Basically, no, you will have to disable vsyscall/vDSO globally if you want the mapping itself to be unavailable. If you only want the program to be unable to call vsyscall/vDSO syscalls, then seccomp will be able to do it. Some caveats though:
See https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
On x86-64, vsyscall emulation is enabled by default. (vsyscalls are
legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities:
A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to
the vsyscall entry for the given call and not the address after the
'syscall' instruction. Any code which wants to restart the call
should be aware that (a) a ret instruction has been emulated and (b)
trying to resume the syscall will again trigger the standard vsyscall
emulation security checks, making resuming the syscall mostly
pointless.
A return value of SECCOMP_RET_TRACE will signal the tracer as usual,
but the syscall may not be changed to another system call using the
orig_rax register. It may only be changed to -1 order to skip the
currently emulated call. Any other change MAY terminate the process.
The rip value seen by the tracer will be the syscall entry address;
this is different from normal behavior. The tracer MUST NOT modify
rip or rsp. (Do not rely on other changes terminating the process.
They might work. For example, on some kernels, choosing a syscall
that only exists in future kernels will be correctly emulated (by
returning -ENOSYS).
To detect this quirky behavior, check for addr & ~0x0C00 ==
0xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For
SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other
condition: future kernels may improve vsyscall emulation and current
kernels in vsyscall=native mode will behave differently, but the
instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these
cases.
Note that modern systems are unlikely to use vsyscalls at all -- they
are a legacy feature and they are considerably slower than standard
syscalls. New code will use the vDSO, and vDSO-issued system calls
are indistinguishable from normal system calls.
So emulated vsyscalls can be confined by seccomp, and vDSOs are likewise confined by seccomp. If you disable gettimeofday(), the confined program will not be able to call that syscall through emulated vsyscall, vDSO, or regular syscall. If you confine them this way with seccomp, you shouldn't have to worry about the attack surface they create.
If you are worried about an attacker exploiting the vDSO mapping itself (which doesn't require calling a syscall), then I don't believe there's a way to disable it on a per-process basis reliably. You can prevent it from being linked in, but it would be hard to prevent a compromised bytecode interpreter from allocating memory and putting it back. You can boot with the vdso=0 kernel parameter which will disable it globally, though, so linking it in would do nothing.

How does Adore-Ng rootkit gets loaded into the kernel

I am working on detecting kernel level rootkits and have chosen Adore-Ng as my first test rootkit. After having known how this rootkit hides itself and other processes in the Linux kernel (2.4, 2.6 versions), I now want to know how it gets loaded into the kernel.
Specifically, I want to know whether
1. It calls any already existing APIs in the Linux kernel OR
2. Does it have any hard-coded assembly instructions such that as soon as its compiled, it gets loaded into the kernel?
I went through the source code of Adore-Ng but couldn't find anything in this direction and could only see how it achieves its goal of hiding.
Can anyone tell/suggest how I can find its loading behaviour?
Thanks.
The core of the Adore rootkit is a malicious module, so for it to be loaded into the kernel you first need root access, then run insmod (or modinfo).
Another way to load Adore is to infect a trusted kernel module as described in Phrack volume 0x0b, issue 0x3d. It wouldn't be as stealthy as the former way because the cleaner module wouldn't be invoked. Since you went through the source code, I believe you have the aptitude to modify the source to invoke the cleaner module once the Adore module has infected a legit module. (make sure you also leave no trace of the module infection; the cleaner will not pick that up).
Any admin worth his salt would use signed modules, so don't expect these methods to work on production kernels after the 2.6 mainline.

Kernel module to monitor syscalls?

I would like to create a kernel module from scratch that latches to a user session and monitors each system call made by processes belonging to that user.
I know what everyone is thinking - "use strace" - but I'd like to have some of my own logging and analysis with the data I collect, and strace has some issues - an application could use "mmap" to write to a file without the file contents ever appearing as the arguments of an "open" system call, or an application without any write permission may create coredumps to copy sensitive data.
I want to be able to handle these special cases and do some of my own logging. I wonder though - how can I route all syscalls through my module? Is there any way to do that without touching the kernel code?
Thanks
I don't have the exact answer to your question, but I red a paper a couple of days ago and it may be useful for you:
http://www.cse.iitk.ac.in/users/moona/students/Y2157230.pdf/
I have done something similar in the past by using a kernel module to patch the system call table. Each patched function did something like the following:
patchFunction(/*params*/)
{
// pre checks
ret = origFunction(/*params*/);
// post checks
return ret;
}
Note that when you start mucking around in the kernel data structures, your module becomes version dependent. The kernel module will probably have to be compiled for the specific kernel version you are installing on.
Also note, this is a technique employed by many rootkits so if you have security software installed it may try to prevent you from doing something like this.

pinning a pthread to a single core

I am trying to measure the performance of some library calls. My primary measurement tool is the rdtsc call. After doing some reading I realize that I need to disable preemption and interrupts in order to get the most accurate readings. Can someone help me figure out how to do these? I know that pthreads have a 'set affinity' mechanism. Is that enough to get the job done?
I also read somewhere that I can make calls into the kernel of the sort
preempt_disable()
raw_local_irq_save(...)
Is there any benefit to using one approach over the other? I tried the latter approach and got this error.
error: 'preempt_disable' was not declared in this scope
which can be fixed by including linux/preempt.h but the compiler still complains.
linux/preempt.h: No such file or directory
Obviously I have not done any kernel hacking and I could not find this file on my system anywhere. I am really hoping I wont have to install a new linux kernel. :)
Thanks for your input.
Pinning a pthread to a single CPU can be done using pthread_setaffinity_np
But what you want to achieve at the end is not so simple. I'll explain you why.
preempt.h is part of the Linux Kernel source. Its located here. You need to have kernel sources with you. Anyways, you need to write a kernel module to access it, you cannot use it from user space. Learn how to write a kernel module here. Same is the case with functions preempt_disable and other interrupt disabling kernel functions
Now the point is, pthreads are in user space and your preemption disabling function is in kernel space. How to interact?
Either you need to write a new system call of your own where you do your preemption and interrupt disabling and call it from user space. Or you need to resort to other Kernel-User Space Interfaces like procfs, sysfs, ioctl etc
But I am really skeptical as to how all these will help you to benchmark library functions. You may want to have a look at how performance is typically measured using rdtsc

Resources