How to skip a system call with ptrace? - linux

I am trying to write a program with ptrace that tracks all system calls made by a child.
Now I have a list of system calls which are forbidden for the child. I am able to track all system calls using ptrace but I just don't know how to skip a particular system call.
Currently my tracking (parent) process gets a signal everytime child enters or exits a system call (PTRACE_SYSCALL). But if child is trying to enter a prohibited system call then I wan't to make child skip that call and move to next step. Also when I do this I want the child to know that there was a permission denied error, so I will be setting errno = 13, will that be enough?
Update:
gdb provides this feature of skipping one line..what mechanism does gdb use?
How to achieve that?
UPDATE:
The best way to achieve this with ptrace is to redirect the original system call to some other system call for example to nanosleep() call. This call will fail since it will receive illegal arguments. Then you just have to change the return code in EAX to -EACCES to pretend that call failed due to Permission denied error.

I found two college lectures that mention the inability to abort an initiated system call as a disadvantage of ptrace (the manpage mentions a PTRACE_SYSEMU macro that looks like could do it, but the newer headers don't have it). Theoretically, you could make use of the ptrace entry and exit stops to counteract the calls you don't want -- by swapping in bogus arguments that'll cause the system call to fail or do nothing, or by injecting code that'll counter a previous systemcall, but that seems extremely hacky.
On Linux, you should be able to achieve your goal with seccomp:
#include <fcntl.h>
#include <seccomp.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
static int set_security(){
int rc = -1;
scmp_filter_ctx ctx;
struct scmp_arg_cmp arg_cmp[] = { SCMP_A0(SCMP_CMP_EQ, 2) };
ctx = seccomp_init(SCMP_ACT_ERRNO(ENOSYS));
/*ctx = seccomp_init(SCMP_ACT_ALLOW);*/
if (ctx == NULL)
goto out;
rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
if (rc < 0)
goto out;
rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);
if (rc < 0)
goto out;
rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 1,
SCMP_CMP(0, SCMP_CMP_EQ, 1));
if (rc < 0)
goto out;
rc = seccomp_rule_add_array(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 1,
arg_cmp);
if (rc < 0)
goto out;
rc = seccomp_load(ctx);
if (rc < 0)
goto out;
/* ... */
out:
seccomp_release(ctx);
return -rc;
}
int main(int argc, char *argv[])
{
int fd;
const char out_msg[] = "stdout test\n";
const char err_msg[] = "stderr test\n";
if(0>set_security())
return 1;
if(0>write(1, out_msg, sizeof(out_msg)))
perror("Write stdout");
if(0>write(2, err_msg, sizeof(err_msg)))
perror("Write stderr");
//This should fail with ENOSYS
if(0>(fd=open("/dev/zero", O_RDONLY)))
perror("open");
exit(0);
}

If you want to disable a system call, it's probably easiest to use symbol interposition, instead of ptrace. (Assuming you're not aiming for security against malicious binaries. If this is for security reasons, PSKocik's answer shows how to use seccomp).
Make a shared library that provides a gettimeofday function which just sets errno and returns without making a system call.
Use LD_PRELOAD=./my_library.so ./a.out to get it loaded before libc.
This won't work on binaries that statically link libc, or that use inline system calls instead of the libc wrappers (e.g. mov eax, SYS_gettimeofday / syscall). You can disassemble a binary and look for syscall (x86-64) or int 0x80 (i386 ABI) to check for that.
Note that glibc's gettimeofday and clock_gettime implementations actually never make a real system call; instead they use RDTSC and the VDSO page exported by the kernel to find out how to scale the timestamp counter into a real time. (So intercepting the library function is your only hope; a strace-style method wouldn't catch them anyway.)
BTW, failed system calls return negative error values. e.g. on x86-64, rax = -EPERM. The glibc syscall wrappers take care of detecting negative values and setting the errno global variable. So if you are intercepting syscall instructions with ptrace, that's what you need to do.
re:edit: gdb skip line
gdb can skip a line by using ptrace to resume execution in a different place. That only works if you're already stopped there, though. So to use this to "skip" system calls, you'd have to set breakpoints at every system call site you want to block in the whole process.
It doesn't sound like a useful approach. If someone's actively trying to defeat it, they can just JIT-compile some code that makes a system call directly. You could prevent processes from mapping memory that's both writable and executable, and scanning it for system calls every time you detect a fault from the process jumping into memory that was requested to be executable but your mechanism just set it to writable. (So behind the scenes you catch the hardware-generated exception and flip the page from writable to executable and scan it, or back to writable but not executable.)
This sounds like a lot of kernel hacking to implement correctly, when you could just use seccomp (see the other answer) if you need something that's resistant to workarounds and static binaries.

Related

Is there a linux command (along the lines of cat or dd) that will allow me to specify the offset of the read syscall?

I am working on a homework assignment for an operating systems class, and we are implementing basic versions of certain file system operations using FUSE.
The only operation that we are implementing that I couldn't test to a point I was happy with was the read() syscall. I am having trouble finding a way to get the read() syscall to be called with an offset other than 0.
I tried some of the commands (like dd, head, and tail) mentioned in answers to this question, but by the time that they reached my implementation of the read() syscall the offset was 0. To clarify, when I called these commands I received (at the calling terminal) the bytes in the file that were specified in the calls, but in another terminal that was displaying the syscalls that were being handled by FUSE, and hence my implementations, it displayed that my implementation of the read() syscall was always being called with offset 0 (and usually size of 4096, which I presume is the block size of the real linux file system I am using). I assume that these commands are making read() syscalls in blocks of 4096 bytes, then internally (i.e., within the dd, head, or tail command's code rather than through syscalls) modifying the output to what is seen on the calling terminal.
Is there any command (or script) I can run (or write and then run in the case of the script) that will allow me to test this syscall with varying offset values?
I figured out the issue I was having. For posterity, I will record the answer rather than just delete my question, because the answer wasn't necessarily easy to find.
Essentially, the issue occurred within FUSE. FUSE defaults to not using direct I/O (which is definitely the correct default to have, don't get me wrong), which is what resulted in the reads in size chunks of 4096 (these are the result of FUSE using a page cache of file contents [AKA a file content cache] in the kernel). For what I wanted to test (as explained in the question), I needed to enable direct I/O. There are a few ways of doing this, but the simplest way for me to do this was to pass -o direct_io as a command line argument. This worked for me because I was using the fuse_main call in the main function of my program.
So my main function looked like this:
int main(int argc, char *argv[])
{
return fuse_main(argc, argv, &my_operations_structure, NULL);
}
and I was able to call my program like this (I used the -d option in addtion to the -o direct_io option in order to display the syscalls that FUSE was processing and the output/debug info from my program):
./prog_name -d -o direct_io test_directory
Then, I tested my program with the following simple test program (I know I don't do very much error checking, but this program is only for some quick and dirty tests):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
int main(int argc, char *argv[])
{
FILE * file;
char buf[4096];
int fd;
memset(&buf[0], 0, sizeof(buf));
if (argc != 4)
{
printf("usage: ./readTest [size] [offset] [filename]\n");
return 0;
}
file = fopen(argv[3], "r");
if (file == NULL)
{
printf("Couldn't open file\n");
return -1;
}
fd = fileno(file);
pread(fd, (void *) buf, atoi(argv[1]), (off_t) atoi(argv[2]));
printf("%s\n", buf);
return 0;
}

Why a segfault instead of privilege instruction error?

I am trying to execute the privileged instruction rdmsr in user mode, and I expect to get some kind of privilege error, but I get a segfault instead. I have checked the asm and I am loading 0x186 into ecx, which is supposed to be PERFEVTSEL0, based on the manual, page 1171.
What is the cause of the segfault, and how can I modify the code below to fix it?
I want to resolve this before hacking a kernel module, because I don't want this segfault to blow up my kernel.
Update: I am running on Intel(R) Xeon(R) CPU X3470.
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <sched.h>
#include <assert.h>
uint64_t
read_msr(int ecx)
{
unsigned int a, d;
__asm __volatile("rdmsr" : "=a"(a), "=d"(d) : "c"(ecx));
return ((uint64_t)a) | (((uint64_t)d) << 32);
}
int main(int ac, char **av)
{
uint64_t start, end;
cpu_set_t cpuset;
unsigned int c = 0x186;
int i = 0;
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset);
assert(sched_setaffinity(0, sizeof(cpuset), &cpuset) == 0);
printf("%lu\n", read_msr(c));
return 0;
}
The question I will try to answer: Why does the above code cause SIGSEGV instead of SIGILL, though the code has no memory error, but an illegal instruction (a privileged instruction called from non-privileged user pace)?
I would expect to get a SIGILL with si_code ILL_PRVOPC instead of a segfault, too. Your question is currently 3 years old and today, I stumbled upon the same behavior. I am disappointed too :-(
What is the cause of the segfault
The cause seems to be that the Linux kernel code decides to send SIGSEGV. Here is the responsible function:
http://elixir.free-electrons.com/linux/v4.9/source/arch/x86/kernel/traps.c#L487
Have a look at the last line of the function.
In your follow up question, you got a list of other assembly instructions which get propagated as SIGSEGV to userspace though they are actually general protection faults. I found your question because I triggered the behavior with cli.
and how can I modify the code below to fix it?
As of Linux kernel 4.9, I'm not aware of any reliable way to distinguish between a memory error (what I would expect to be a SIGSEGV) and a privileged instruction error from userspace.
There may be very hacky and unportable way to distibguish these cases. When a privileged instruction causes a SIGSEGV, the siginfo_t si_code is set to a value which is not directly listed in the SIGSEGV section of man 2 sigaction. The documented values are SEGV_MAPERR, SEGV_ACCERR, SEGV_PKUERR, but I get SI_KERNEL (0x80) on my system. According to the man page, SI_KERNEL is a code "which can be placed in si_code for any signal". In strace, you see SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0}. The responsible kernel code is here.
It would also be possible to grep dmesg for this string.
Please, never ever use those two methods to distinguish between GPF and memory error on a production system.
Specific solution for your code: Just don't run rdmsr from user space. But this answer is really unsatisfying if you are looking for a generic way to figure out why a program received a SIGSEGV.

How to access errno after clone (or: How to set errno location)

Per traditional POSIX, errno is simply an integer lvalue, which works perfectly well with fork, but oviously doesn't work nearly as well with threads. As per pthreads, errno is a thread-local integer lvalue. Under Linux/NTPL, as an implementation detail, errno is some "macro that expands to a function returning an integer lvalue".
On my Debian system, this seems to be *__errno_location (), on some other systems I've seen things like &(gettib()->errnum.
TL;DR
Assuming I've used clone to create a thread, can I just call errno and expect that it will work, or do I have to do some special rain dance? For example, do I need to read some special field in the thread information block, or some special TLS value, or, do I get to set the address of the thread-local variable where the glibc stores the error values somehow? Something like __set_errno_location() maybe?
Or, will it "just work" as it is?
Inevitably, someone will be tempted to reply "simply use phtreads" -- please don't. I do not want to use pthreads. I want clone. I do not want any of the ill-advised functionality of pthreads, and I do not want to deal with any of its quirks, nor do I want the overhead to implement those quirks. I recognize that much of the crud in pthreads comes from the fact that it has to work (and, surprisingly, it successfully works) amongst others for some completely broken systems that are nearly three decades old, but that doesn't mean that it is necessarily a good thing for everyone and every situation. Portability is not of any concern in this case.
All I want in this particular situation is fire up another process running in the same address space as the parent, synchronization via a simple lock (say, a futex), and write working properly (which means I also have to be able to read errno correctly). As little overhead as possible, no other functionality or special behavior needed or even desired.
According to the glibc source code, errno is defined as a thread-local variable. Unfortunately, this requires significant C library support. Any threads created using pthread_create() will be made aware of thread-local variables. I would not even bother trying to get glibc to accept your foreign threads.
An alternative would be to use a different libc implementation that may allow you to extract some of its internal structures and manually set the thread control block if errno is part of it. This would be incredibly hacky and unreliable. I doubt you'll find anything like __set_errno_location(), but rather something like __set_tcb().
#include <bits/some_hidden_file.h>
void init_errno(void)
{
struct __tcb* tcb;
/* allocate a dummy thread control block (malloc may set errno
* so might have to store the tcb on stack or allocate it in the
* parent) */
tcb = malloc(sizeof(struct __tcb));
/* initialize errno */
tcb->errno = 0;
/* set pointer to thread control block (x86) */
arch_prctl(ARCH_SET_FS, tcb);
}
This assumes that the errno macro expands to something like: ((struct __tcb*)__read_fs())->errno.
Of course, there's always the option of implementing an extremely small subset of libc yourself. Or you could write your own implementation of the write() system call with a custom stub to handle errno and have it co-exist with the chosen libc implementation.
#define my_errno /* errno variable stored at some known location */
ssize_t my_write(int fd, const void* buf, size_t len)
{
ssize_t ret;
__asm__ (
/* set system call number */
/* set up parameters */
/* make the call */
/* retrieve return value in c variable */
);
if (ret >= -4096 && ret < 0) {
my_errno = -ret;
return -1;
}
return ret;
}
I don't remember the exact details of GCC inline assembly and the system call invocation details vary depending on platform.
Personally, I'd just implement a very small subset of libc, which would just consist of a little assembler and a few constants. This is remarkably simple with so much reference code available out there, although it may be overambitious.
If errno is a thread local variable, so clone() will copy it in the new process's address space? i had overrode the errno_location() function like around 2001 to use an errno based on the pid.
http://tamtrajnana.blogspot.com/2012/03/thread-safety-of-errno-variable.html
since errno is now defined as "__thread int errno;" (see above comment) this explains how __thread types are handled: Linux's thread local storage implementation

How to get current program counter inside mprotect handler and update it

I want to get the current program counter(PC) value inside mprotect handler. From there I want to increase the value of PC by 'n' number of instruction so that the program will skip some instructions. I want to do all that for linux kernel version 3.0.1. Any help about the data structures where I can get the value of PC and how to update that value? Sample code will be appreciated. Thanks in advance.
My idea is to use some task when a memory address is being written. So my idea is to use mprotect to make the address write protected. When some code tries to write something on that memory address, I will use mprotect handler to perform some operation. After taking care of the handler, I want to make the write operation successful. So my idea was to make the memory address unprotected inside handler and then perform the write operation again. When the code returns from the handler function, the PC will point to the original write instruction, whereas I want it to point to the next instruction. So I want to increase PC by one instruction irrespective of instruction lenght.
Check the following flow
MprotectHandler(){
unprotect the memory address on which protection fault arised
write it again
set PC to the next instruction of original write instruction
}
inside main function:
main(){
mprotect a memory address
try to write the mprotected address // original write instruction
Other instruction // after mprotect handler execution, PC should point here
}
Since it is tedious to compute the instruction length on several CISC processors, I recommend a somewhat different procedure: Fork using clone(..., CLONE_VM, ...) into a tracer and a tracee thread, and in the tracer instead of
write it again
set PC to the next instruction of original write instruction
do a
ptrace(PTRACE_SINGLESTEP, ...)
- after the trace trap you may want to protect the memory again.
Here is sample code demonstrating the basic principle:
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <sys/ucontext.h>
static void
handler(int signal, siginfo_t* siginfo, void* uap) {
printf("Attempt to access memory at address %p\n", siginfo->si_addr);
mcontext_t *mctx = &((ucontext_t *)uap)->uc_mcontext;
greg_t *rsp = &mctx->gregs[15];
greg_t *rip = &mctx->gregs[16];
// Jump past the bad memory write.
*rip = *rip + 7;
}
static void
dobad(uintptr_t *addr) {
*addr = 0x998877;
printf("I'm a survivor!\n");
}
int
main(int argc, char *argv[]) {
struct sigaction act;
memset(&act, 0, sizeof(struct sigaction));
sigemptyset(&act.sa_mask);
act.sa_sigaction = handler;
act.sa_flags = SA_SIGINFO | SA_ONSTACK;
sigaction(SIGSEGV, &act, NULL);
// Write to an address we don't have access to.
dobad((uintptr_t*)0x1234);
return 0;
}
It shows you how to update the PC in response to a page fault. It lacks the following which you have to implement yourself:
Instruction length decoding. As you can see I have hardcoded + 7 which happens to work on my 64bit Linux since the instruction causing the page fault is a 7 byte MOV. As Armali said in his answer, it is a tedious problem and you probably have to use an external library like libudis86 or something.
mprotect() handling. You have the address that caused the page fault in siginfo->si_addr and using that it should be trivial to find the address of the mprotected page and unprotect it.

What happens if a program makes an OABI style syscall in an EABI-only kernel?

Or more generally, what happens if an swi instruction with an opcode !=0 is executed on such a kernel? Does it produce a signal? I ask because I'd like to trap it.
The code that fields swi instructions is here: http://lxr.linux.no/linux+*/arch/arm/kernel/entry-common.S#L335. I am not an ARM expert, but it appears that the CPU does not stash the swi argument anywhere the kernel can get at it; if the kernel wants to know, it has to fetch the instruction from the calling program's runtime image. This makes every system call more expensive, so (if I'm reading things correctly) the kernel only bothers to find out what the swi argument is if it's compiled with CONFIG_OABI_COMPAT.
EDIT: The ARM ARM confirms that SWI does not do anything useful with its argument. (Physical page 634 / logical page A7-118.)
So I tried to see what would happen. I compiled the following program and ran it:
#include <stdio.h>
#include <signal.h>
void traphandler(int signum, siginfo_t *info, void *context)
{
puts("trap");
}
int main()
{
struct sigaction sa;
sa.sa_sigaction = traphandler;
sigemptyset(&sa.sa_mask);
sa.sa_flags = SA_RESTART | SA_SIGINFO;
sigaction(SIGTRAP, &sa, NULL);
puts("begin");
asm("swi 1");
puts("after swi 1");
asm("swi 255");
puts("after swi 255");
}
and the output was:
begin
after swi 1
after swi 255
The signal handler was not called, nor the program was killed. Quite disappointing.

Resources