Is it possible, somehow, to change the memory map of another process in Linux? As opposed, that is, to only being able to control it by way of code running in the process itself calling mmap.
The reason I'm asking is because I'd like to be able to build a process with a very custom memory map, and without being able to use shared libraries or even the vDSO, I don't see any way to do that inside the process itself that does not involve basically writing my own libc to handle syscalls and such. (Even if I were to link libc statically, wouldn't it attempt to use the vDSO?)
mmap mapped memory is preserved by fork but wiped by system calls from the exec family, therefore this can't be acheived by the classic sequence fork, setup stuff then exec.
A simple solution is to use an LD_PRELOAD hook. Let's place this code in add_mmap.c.
#include <sys/mman.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
void add_mmap(void) __attribute__((constructor));
void add_mmap(void)
{
int fd;
void *addr;
printf("calling mmap() before main...\n");
fd = open("/etc/passwd", O_RDONLY);
printf("fd=%d\n", fd);
/* map the first 100 bytes of the file to an address chosen by the kernel */
addr = mmap(0, 100, PROT_READ, MAP_SHARED, fd, 0);
printf("addr=%llx\n", (long long unsigned)addr);
}
Then, build it as a dynamic library:
gcc -Wall -g -fPIC -shared -o add_mmap.so add_mmap.c
And finally run some existing program with it:
$ LD_PRELOAD=./add_mmap.so /bin/cat
calling mmap() before main...
fd=3
addr=7fe4916f8000
We can check that the mapping was set up and preserved before cat was run:
$ cat /proc/27967/maps
...
7f2f7f2d0000-7f2f7f2d1000 r--s 00000000 09:00 1056387 /etc/passwd
...
EDIT
I only show here how to add a memory mapping before the program starts, but my example can be easily extended to transparently inject a "memory manger" thread inside the program. This thread would receive orders through an IPC mechanism like a socket and manipulate the mappings accordingly.
Related
What condition in which a user application , or more detaily a process run as root in UNIX or SYSTEM in Windows, from that it might become target of buffer overflow attacks to run shellcode. I see on net, a simple C as
#include <stdio.h>
#include <string.h>
void func(char *name)
{
char buf[100];
strcpy(buf, name);
printf("Welcome %s\n", buf);
}
int main(int argc, char *argv[])
{
func(argv[1]);
return 0;
}
It can become premise for buffer overflow attack and running shellcode in Unix. I am focusing my question on program or process permission
There are quite a few attack vectors when looking at privileged processes privileged processes. The first is the setuid bit, where a regular user invokes a setuid application and acquires the effective userid of the owner of the file. The next way that comes to mind is via Linux capabilities. Capabilities are like a more granular version of setuid, see man getcap for more information. One additional way is via network or IPC interfaces such UNIX sockets, named pipes, TCP or UDP sockets, etc) exposed by long-living processes (daemons)
I am trying to write a program with ptrace that tracks all system calls made by a child.
Now I have a list of system calls which are forbidden for the child. I am able to track all system calls using ptrace but I just don't know how to skip a particular system call.
Currently my tracking (parent) process gets a signal everytime child enters or exits a system call (PTRACE_SYSCALL). But if child is trying to enter a prohibited system call then I wan't to make child skip that call and move to next step. Also when I do this I want the child to know that there was a permission denied error, so I will be setting errno = 13, will that be enough?
Update:
gdb provides this feature of skipping one line..what mechanism does gdb use?
How to achieve that?
UPDATE:
The best way to achieve this with ptrace is to redirect the original system call to some other system call for example to nanosleep() call. This call will fail since it will receive illegal arguments. Then you just have to change the return code in EAX to -EACCES to pretend that call failed due to Permission denied error.
I found two college lectures that mention the inability to abort an initiated system call as a disadvantage of ptrace (the manpage mentions a PTRACE_SYSEMU macro that looks like could do it, but the newer headers don't have it). Theoretically, you could make use of the ptrace entry and exit stops to counteract the calls you don't want -- by swapping in bogus arguments that'll cause the system call to fail or do nothing, or by injecting code that'll counter a previous systemcall, but that seems extremely hacky.
On Linux, you should be able to achieve your goal with seccomp:
#include <fcntl.h>
#include <seccomp.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
static int set_security(){
int rc = -1;
scmp_filter_ctx ctx;
struct scmp_arg_cmp arg_cmp[] = { SCMP_A0(SCMP_CMP_EQ, 2) };
ctx = seccomp_init(SCMP_ACT_ERRNO(ENOSYS));
/*ctx = seccomp_init(SCMP_ACT_ALLOW);*/
if (ctx == NULL)
goto out;
rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
if (rc < 0)
goto out;
rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);
if (rc < 0)
goto out;
rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 1,
SCMP_CMP(0, SCMP_CMP_EQ, 1));
if (rc < 0)
goto out;
rc = seccomp_rule_add_array(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 1,
arg_cmp);
if (rc < 0)
goto out;
rc = seccomp_load(ctx);
if (rc < 0)
goto out;
/* ... */
out:
seccomp_release(ctx);
return -rc;
}
int main(int argc, char *argv[])
{
int fd;
const char out_msg[] = "stdout test\n";
const char err_msg[] = "stderr test\n";
if(0>set_security())
return 1;
if(0>write(1, out_msg, sizeof(out_msg)))
perror("Write stdout");
if(0>write(2, err_msg, sizeof(err_msg)))
perror("Write stderr");
//This should fail with ENOSYS
if(0>(fd=open("/dev/zero", O_RDONLY)))
perror("open");
exit(0);
}
If you want to disable a system call, it's probably easiest to use symbol interposition, instead of ptrace. (Assuming you're not aiming for security against malicious binaries. If this is for security reasons, PSKocik's answer shows how to use seccomp).
Make a shared library that provides a gettimeofday function which just sets errno and returns without making a system call.
Use LD_PRELOAD=./my_library.so ./a.out to get it loaded before libc.
This won't work on binaries that statically link libc, or that use inline system calls instead of the libc wrappers (e.g. mov eax, SYS_gettimeofday / syscall). You can disassemble a binary and look for syscall (x86-64) or int 0x80 (i386 ABI) to check for that.
Note that glibc's gettimeofday and clock_gettime implementations actually never make a real system call; instead they use RDTSC and the VDSO page exported by the kernel to find out how to scale the timestamp counter into a real time. (So intercepting the library function is your only hope; a strace-style method wouldn't catch them anyway.)
BTW, failed system calls return negative error values. e.g. on x86-64, rax = -EPERM. The glibc syscall wrappers take care of detecting negative values and setting the errno global variable. So if you are intercepting syscall instructions with ptrace, that's what you need to do.
re:edit: gdb skip line
gdb can skip a line by using ptrace to resume execution in a different place. That only works if you're already stopped there, though. So to use this to "skip" system calls, you'd have to set breakpoints at every system call site you want to block in the whole process.
It doesn't sound like a useful approach. If someone's actively trying to defeat it, they can just JIT-compile some code that makes a system call directly. You could prevent processes from mapping memory that's both writable and executable, and scanning it for system calls every time you detect a fault from the process jumping into memory that was requested to be executable but your mechanism just set it to writable. (So behind the scenes you catch the hardware-generated exception and flip the page from writable to executable and scan it, or back to writable but not executable.)
This sounds like a lot of kernel hacking to implement correctly, when you could just use seccomp (see the other answer) if you need something that's resistant to workarounds and static binaries.
I am working on a homework assignment for an operating systems class, and we are implementing basic versions of certain file system operations using FUSE.
The only operation that we are implementing that I couldn't test to a point I was happy with was the read() syscall. I am having trouble finding a way to get the read() syscall to be called with an offset other than 0.
I tried some of the commands (like dd, head, and tail) mentioned in answers to this question, but by the time that they reached my implementation of the read() syscall the offset was 0. To clarify, when I called these commands I received (at the calling terminal) the bytes in the file that were specified in the calls, but in another terminal that was displaying the syscalls that were being handled by FUSE, and hence my implementations, it displayed that my implementation of the read() syscall was always being called with offset 0 (and usually size of 4096, which I presume is the block size of the real linux file system I am using). I assume that these commands are making read() syscalls in blocks of 4096 bytes, then internally (i.e., within the dd, head, or tail command's code rather than through syscalls) modifying the output to what is seen on the calling terminal.
Is there any command (or script) I can run (or write and then run in the case of the script) that will allow me to test this syscall with varying offset values?
I figured out the issue I was having. For posterity, I will record the answer rather than just delete my question, because the answer wasn't necessarily easy to find.
Essentially, the issue occurred within FUSE. FUSE defaults to not using direct I/O (which is definitely the correct default to have, don't get me wrong), which is what resulted in the reads in size chunks of 4096 (these are the result of FUSE using a page cache of file contents [AKA a file content cache] in the kernel). For what I wanted to test (as explained in the question), I needed to enable direct I/O. There are a few ways of doing this, but the simplest way for me to do this was to pass -o direct_io as a command line argument. This worked for me because I was using the fuse_main call in the main function of my program.
So my main function looked like this:
int main(int argc, char *argv[])
{
return fuse_main(argc, argv, &my_operations_structure, NULL);
}
and I was able to call my program like this (I used the -d option in addtion to the -o direct_io option in order to display the syscalls that FUSE was processing and the output/debug info from my program):
./prog_name -d -o direct_io test_directory
Then, I tested my program with the following simple test program (I know I don't do very much error checking, but this program is only for some quick and dirty tests):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
int main(int argc, char *argv[])
{
FILE * file;
char buf[4096];
int fd;
memset(&buf[0], 0, sizeof(buf));
if (argc != 4)
{
printf("usage: ./readTest [size] [offset] [filename]\n");
return 0;
}
file = fopen(argv[3], "r");
if (file == NULL)
{
printf("Couldn't open file\n");
return -1;
}
fd = fileno(file);
pread(fd, (void *) buf, atoi(argv[1]), (off_t) atoi(argv[2]));
printf("%s\n", buf);
return 0;
}
I am trying to execute the privileged instruction rdmsr in user mode, and I expect to get some kind of privilege error, but I get a segfault instead. I have checked the asm and I am loading 0x186 into ecx, which is supposed to be PERFEVTSEL0, based on the manual, page 1171.
What is the cause of the segfault, and how can I modify the code below to fix it?
I want to resolve this before hacking a kernel module, because I don't want this segfault to blow up my kernel.
Update: I am running on Intel(R) Xeon(R) CPU X3470.
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <sched.h>
#include <assert.h>
uint64_t
read_msr(int ecx)
{
unsigned int a, d;
__asm __volatile("rdmsr" : "=a"(a), "=d"(d) : "c"(ecx));
return ((uint64_t)a) | (((uint64_t)d) << 32);
}
int main(int ac, char **av)
{
uint64_t start, end;
cpu_set_t cpuset;
unsigned int c = 0x186;
int i = 0;
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset);
assert(sched_setaffinity(0, sizeof(cpuset), &cpuset) == 0);
printf("%lu\n", read_msr(c));
return 0;
}
The question I will try to answer: Why does the above code cause SIGSEGV instead of SIGILL, though the code has no memory error, but an illegal instruction (a privileged instruction called from non-privileged user pace)?
I would expect to get a SIGILL with si_code ILL_PRVOPC instead of a segfault, too. Your question is currently 3 years old and today, I stumbled upon the same behavior. I am disappointed too :-(
What is the cause of the segfault
The cause seems to be that the Linux kernel code decides to send SIGSEGV. Here is the responsible function:
http://elixir.free-electrons.com/linux/v4.9/source/arch/x86/kernel/traps.c#L487
Have a look at the last line of the function.
In your follow up question, you got a list of other assembly instructions which get propagated as SIGSEGV to userspace though they are actually general protection faults. I found your question because I triggered the behavior with cli.
and how can I modify the code below to fix it?
As of Linux kernel 4.9, I'm not aware of any reliable way to distinguish between a memory error (what I would expect to be a SIGSEGV) and a privileged instruction error from userspace.
There may be very hacky and unportable way to distibguish these cases. When a privileged instruction causes a SIGSEGV, the siginfo_t si_code is set to a value which is not directly listed in the SIGSEGV section of man 2 sigaction. The documented values are SEGV_MAPERR, SEGV_ACCERR, SEGV_PKUERR, but I get SI_KERNEL (0x80) on my system. According to the man page, SI_KERNEL is a code "which can be placed in si_code for any signal". In strace, you see SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0}. The responsible kernel code is here.
It would also be possible to grep dmesg for this string.
Please, never ever use those two methods to distinguish between GPF and memory error on a production system.
Specific solution for your code: Just don't run rdmsr from user space. But this answer is really unsatisfying if you are looking for a generic way to figure out why a program received a SIGSEGV.
I want to play with kernel linked list before I use it in some part of the kernel code. But if I just include list.h, it doesn't work due to dependencies.
How can I write code using list in a single.c file e.g. test.c so that I can test my code just by compiling test.c? Looking forward to hearing from you soon.
Also, how can I use nested linked list?
You can get a userspace port from http://www.mcs.anl.gov/~kazutomo/list/list.h.
It says:
Here is a recipe to cook list.h for user space program
copy list.h from linux/include/list.h
remove
#ifdef KERNE and its #endif
all #include line
prefetch() and rcu related functions
add macro offsetof() and container_of
It is not meant to use the list in Userspace since its made for inside Kernel use and has several dependencies of kernel types and so on. You can see this by compiling your code with correct include paths:
gcc -I path-to-kernel-src/include/ test.c
When test.c contains this code:
#include <stdio.h>
#include <stdlib.h>
#include <linux/list.h>
int main(int argc, char **argv) { }
It fails to compile since there are includes in list.h which conflicts the userspace include (stdlib.h).
Nevertheless, the dependencies of such a data structures like list are pretty small. You need to sort them out in order to break the list.h dependencies from other kernel. In a short test, I removed the includes to and from list.h and added the data types struct list_head/hlist_head and hlist_node.