How to prohibit system calls, GNU/Linux - linux

I'm currently working on the back-end of ACM-like public programming contest system. In such system, any user can submit a code source, which will be compiled and run automatically (which means, no human-eye pre-moderation is performed) in attempt to solve some computational problem.
Back-end is a GNU/Linux dedicated machine, where a user will be created for each contestant, all such users being part of users group. Sources sent by any particular user will be stored at the user's home directory, then compiled and executed to be verified against various test cases.
What I want is to prohibit usage of Linux system calls for the sources. That's because problems require platform-independent solutions, while enabling system calls for insecure source is a potential security breach. Such sources may be successfully placed in the FS, even compiled, but never run. I also want to be notified whenever source containing system calls was sent.
By now, I see the following places where such checker may be placed:
Front-end/pre-compilation analysis - source already checked in the system, but not yet compiled. Simple text checker against system calls names. Platform-dependent, compiler-independent, language-dependent solution.
Compiler patch - crash GCC (or any other compiler included in the tool-chain) whenever system call is encountered. Platform-dependent, compiler-dependent, language-independent solution (if we place checker "far enough"). Compatibility may also be lost. In fact, I dislike this alternative most.
Run-time checker - whenever system call is invoked from the process, terminate this process and report. This solution is compiler and language independent, but depends on the platform - I'm OK with that, since I will deploy the back-end on similar platforms in short- and mid-terms.
So the question is: does GNU/Linux provide an opportunity for administrator to prohibit system calls usage for a usergroup, user or particular process? It may be a security policy or a lightweight GNU utility.
I tried to Google, but Google disliked me today.

mode 1 seccomp allows a process to limit itself to exactly four syscalls: read, write, sigreturn, and _exit. This can be used to severely sandbox code, as seccomp-nurse does.
mode 2 seccomp (at the time of writing, found in Ubuntu 12.04 or patch your own kernel) provides more flexibility in filtering syscalls. You can, for example, first set up filters, then exec the program under test. Appropriate use of chroot or unshare can be used to prevent it from re-execing anything else "interesting".

I think you need to define system call better. I mean,
cat <<EOF > hello.c
#include <stdio.h>
int main(int argc,char** argv) {
fprintf(stdout,"Hello world!\n");
return 0;
}
EOF
gcc hello.c
strace -q ./a.out
demonstrates that even an apparently trivial program makes ~27 system calls.
You (I assume) want to allow calls to the "standard C library", but those in turn will be implemented in terms of system calls. I guess what I'm trying to say is that run-time checking is less feasible than you might think (using strace or similar anyway).

Related

Why many Linux distros use setuid instead of capabilities?

capabilities(7) are a great way for not giving all root privileges to a process and AFAIK can be used instead of setuid(2). According to this and many others,
"Unfortunately, still many binaries have the setuid bit set, while they should be replaced with capabilities instead."
As a simple example, on Ubuntu,
$ ls -l `which ping`
-rwsr-xr-x 1 root root 44168 May 8 2014 /bin/ping
As you know, setting suid/guid on a file, changes the effective user ID to root. So if there the suid-enabled program contains a flaw, the non-privileged user can break-out and become the equivalent of the root user.
My question is why many Linux distributions still use setuid method while setting capabilities can be used instead with less security concerns?
This may not give the reason why some dudes somewhere decided one way or another, but some auditing tools and interfaces may not yet know about capabilities.
An example is the proc_connector netlink interface and the programs based on it (like forkstat): there are events for a process changing its credentials, but not for it changing its capabilities.
FWIW, the cause why you may not get eg a net_raw+ep ping(8) instead of a setuid one on a Debian-like distro is because that depends on the setcap(8) utility from the libcap2-bin package already existing before you install ping. From iputils-ping.postinst:
if command -v setcap > /dev/null; then
if setcap cap_net_raw+ep /bin/ping; then
chmod u-s /bin/ping
else
echo "Setcap failed on /bin/ping, falling back to setuid" >&2
chmod u+s /bin/ping
fi
else
echo "Setcap is not installed, falling back to setuid" >&2
chmod u+s /bin/ping
fi
Also notice that ping itself will drop any setuid privileges and switch to use capabilities on Linux upon starting, so your concerns about it may be a bit exagerated. From ping.c:
int
main(int argc, char **argv)
{
struct addrinfo hints = { .ai_family = AF_UNSPEC, .ai_protocol = IPPROTO
_UDP, .ai_socktype = SOCK_DGRAM, .ai_flags = getaddrinfo_flags };
struct addrinfo *result, *ai;
int status;
int ch;
socket_st sock4 = { .fd = -1 };
socket_st sock6 = { .fd = -1 };
char *target;
limit_capabilities();
From ping_common.c
void limit_capabilities(void)
{
...
if (setuid(getuid()) < 0) {
perror("setuid");
exit(-1);
}
Consider a different planet were capabilites patches were rejected, the submitter had to try harder to make a neighborly improvement, and came up with this:
filesystems can be mounted suid, nosuid, or capsuid.
if mounted capsuid, setuid bits on ${file} don't count unless .${file}.suid_capability also exists, is setuid, and is either 0-length or parseable in some standardized format:
-r-sr-xr-x. 1 carton carton 3812 Dec 27 15:39 t1
-r-Sr--r--. 1 carton carton 0 Dec 27 15:39 .t1.suid_capability
if the file is empty, setuid bit works as normal. If it has parseable contents, it works as partial-root capabilities.
if desired the .t1.suid_capability files can be annotated with fields for:
binding to a hash of their matching 't1' file
binding to an absolute path, so for example a setuid file within a chroot does not become "hot" until you chroot into it.
binding to a public key or hash nonce loaded at boot identifying the installed system like a uuid. If the private key or nonce is hidden from a backup server, traditional password resets would still work but some forms of rootkit persistence would become more difficult. It also lets you work with images of other systems in subdirectories, automatically making them effectively "nosuid"-mounted even if you are undisciplined about mount options and elect not to use the absolute path feature.
Now answer some questions about this planet relative to our own:
any missing functionality compared to our native planet?
which system works better with 'ls', 'tar', 'rsync', 'cpio', 'find', 'pax', 'mtree'? with gentoo's ebuild sandbox? with LXC?
which system works better with diskless or guest systems running on NFS or 9p roots?
which system works better with tripwire?
Now repeat those questions considering three systems instead of two:
our planet: mysterious capabilities metadata in filesystems
alternate planet using civilized Unix approach
mosvy's 'ping' example, where the binary drops privileges right after it starts
With the third system are finally starting to lose features relative to status quo: Someone could write a hypothetical capabilities-aware tripwire that works less well in the 'ping' example. Has anyone done that, though? . . . anyone outside NSA---have the people forcing this complexity on us done that work to deliver (relatively small) value from the complexity? And if that's the feature-hill we want to die on (as opposed to reducing attack surface), is there maybe a better way to get it, like dm-verity, that once again moots the complexity?
Now let's make a slight refinement to mosvy's 'ping':
crt.o or some early runtime startup code reads the sidecar file and /etc/suid_capability_hash_nonce. It drops suid if /etc/suid_capability_hash_nonce exists but the sidecar file does not (in "alternate planet" terms, the existence of /etc/suid_capability_hash_nonce makes the whole system '-o capsuid'-mounted). The early runtime startup code handles parsing the sidecars and dropping to the enumerated capabilities so that logic doesn't have to be coded into main.c one-by-one.
Honestly, I think the 'ping' example is superior because programs know what capabilities they need, and the need changes only when the program's source code changes. By setting them in a drop_privileges()-style function the need will never get out of sync. Fanned-out distributions won't have to scramble to update capability sets.
If you disagree and want the sidecar/capabilities-style system, something equivalent could have been implemented without tampering with kernel, mount options, bootup, any of it: old dracut-nfsroot initrds would keep working. It is just a matter of style.
This is all a Socratic way of saying capabilities offers nothing beyond dm-verity + 'ping'-style privileges-dropping. There is nothing "unfortunate" about eschewing complexity that's not pulling its weight.
Every few years CADT developers invent some new form of metadata to cram into filesystem directories: Finder metadata, POSIX ACLs, SELinux contexts, NFS ACLs. What will they think of next? How long will it take to update all the filesystems, all the network protocols, all the fileutils tools, all the non-Linux storage operating systems, all the fancy licensed "enterprise" tape backup suites? Will anyone ever get around to updating all that stuff, or will we just limp along with substandard tools? Will even the core feature be documented usefully, or will it be the undocumented plaything of a few annoying "distributions"? Will the feature work most of the time, but stop systems from booting unexpectedly and create chicken-and-egg recovery problems? Will the feature actually stop any attacks?
Is it fair to all the people who have to clean up after them and tolerate complexity in their own systems that made different, simpler choices, just to unbreak interop that their new feature broke?
The value delivered is particularly low in this case, but we have enough experience to set the value bar high on this type of proposal. I think it was a mistake to accept the feature into Linux and applaud any distribution that avoids it.
Capability support was really enabled in 2008 when file capabilities were added to the kernel. So, that's 13 years and counting to replace setuid. Your question is very valid.
If capability support wasn't implemented in a backwardly compatible way in Linux, it would either have been rejected at birth, or things would have surely changed by now! I suspect it all comes down to the fact that there is no economic incentive to adopt them when their benefit is only evident temporarily when a bug in some code is found to be exploitable.
I think that people are resigned to all the ways that setuid binaries seem to be exploitable and the point fixes that people quickly roll out when another vulnerability surfaces. Buffer overflow -> launch shell -> root exploit -> code fix -> start over. They are comfortable with the idea that user identity = privilege (ie., root). This spills into how people want to describe the futility of breaking down all powerful root into independent capabilities, when a chain of exploits can yield all privilege. Clearly, the pervasive idea that privilege and identity should be equivalent is why Ambient capabilities even exist.
Capabilities, however, when used as they were intended to be used are not identity = privilege. They are capable binaries = privilege - where the privilege is reduced relative to a user identity executing arbitrary code, by the combination of those capability bits and the code in the actual program that has them.
If you write code that edits files, it is clear it shouldn't need to worry about being directly abused for raw ethernet packet formation, or loading kernel modules. Exploiting a bug in that code may well allow a malicious edit of a file but, unlike setuid, it won't allow strange packets to be sent on the network. At least not without tricking the code of some independently capable executable into doing it by proxy.
However, no one intentionally writes buggy code, so I think the answer to your question comes down to another question: "where exactly is the benefit of limiting how exploitable an exploit is?".
I suspect that if some distribution were to figure out how to restructure their code base to eliminate setuid-root binaries, in favor of file capabilities, it would fare better (certainly no worse) over time as code exploits are found vs. other distributions clinging to setuid-root. But, until such a distribution comes into being, I can't fault the idea that that is just an opinion.

Quick questions about Linux kernel modules

I'm very familiar with Linux (I've been using it for 2 years, no Windows for 1 1/2 years), and I'm finally digging deeper into kernel programming and I'm working a project. So my questions are:
Will a kernel module run faster than a traditional c program.
How can I communicate with a module (is that even possible), for example call a function in it.
1.Will a kernel module run faster than a traditional c program.
It Depends™
Running as a kernel module means you get to play by different rules, you potentially get to avoid some context switches depending on what you are doing. You get access to some powerful tools that can be leveraged to optimize your code, but don't expect your code to run magically faster just by throwing everything in kernelspace.
2.How can I communicate with a module (is that even possible), for example call a function in it.
There are various ways:
You can use the various file system interfaces: procfs, sysfs, debugfs, sysctl, ...
You could register a char device
You can make use of the Netlink interface
You could create new syscalls, although that's heavily discouraged
And you can always come up with your own scheme, or use some less common APIs
Will a kernel module run faster than a traditional c program.
The kernel is already a C program, which is most likely be compiled with same compiler you use. So generic algorithms or some processor intensive computations will be executed with almost same speed.
But most userspace programs (like bash) have to ask kernel to perform some operations on system resources, i.e. print prompt onto monitor. It will require entering the kernel with system call, sending data over tty interfaces and passing to a video-driver, it may introduce some latency. If you'd implemented bash in-kernel, you may directly call video-driver, which is definitely faster.
That approach however, have drawbacks. First of all, bash should be able to print prompt on ssh-session or serial console, and that will complicate logic. Also, if your bash will hang, you cannot just kill, you have to reboot system.
How can I communicate with a module (is that even possible), for example call a function in it.
In addition to excellent list provided by #tux3, I would suggest to start with char devices.

How do I ensure my Linux program doesn't produce core dumps?

I've got a program that keeps security-sensitive information (such as private keys) in memory, since it uses them over the lifetime of the program. Production versions of this program set RLIMIT_CORE to 0 to try to ensure that a core dump that might contain this sensitive information is never produced.
However, while this isn't mentioned in the core(8) manpage, the apport documentation on the Ubuntu wiki claims,
Note that even if ulimit is set to disabled core files (by specyfing a
core file size of zero using ulimit -c 0), apport will still capture
the crash.
Is there a way within my process (i.e., without relying on configuration of the system external to it) that I can ensure that a core dump of my process is never generated?
Note: I'm aware that there are plenty of methods (such as those mentioned in the comments below) where a user with root or process owner privileges could still access the sensitive data. What I'm aiming at here is preventing unintentional exposure of the sensitive data through it being saved to disk, being sent to the Ubuntu bug tracking system, or things like that. (Thanks to Basile Starynkevitch for making this explicit.)
According to the POSIX spec, core dumps only happen in response to signals whose action is the default action and whose default action is to "terminate the process abnormally with additional actions".
So, if you scroll down to the list in the description of signal.h, everything with an "A" in the "Default Action" column is a signal you need to worry about. Use sigaction to catch all of them and just call exit (or _exit) in the signal handler.
I believe these are the only ways POSIX lets you generate a core dump. Conceivably, Linux might have other "back doors" for this purpose; unfortunately, I am not enough of a kernel expert to be sure...

Kernel module to monitor syscalls?

I would like to create a kernel module from scratch that latches to a user session and monitors each system call made by processes belonging to that user.
I know what everyone is thinking - "use strace" - but I'd like to have some of my own logging and analysis with the data I collect, and strace has some issues - an application could use "mmap" to write to a file without the file contents ever appearing as the arguments of an "open" system call, or an application without any write permission may create coredumps to copy sensitive data.
I want to be able to handle these special cases and do some of my own logging. I wonder though - how can I route all syscalls through my module? Is there any way to do that without touching the kernel code?
Thanks
I don't have the exact answer to your question, but I red a paper a couple of days ago and it may be useful for you:
http://www.cse.iitk.ac.in/users/moona/students/Y2157230.pdf/
I have done something similar in the past by using a kernel module to patch the system call table. Each patched function did something like the following:
patchFunction(/*params*/)
{
// pre checks
ret = origFunction(/*params*/);
// post checks
return ret;
}
Note that when you start mucking around in the kernel data structures, your module becomes version dependent. The kernel module will probably have to be compiled for the specific kernel version you are installing on.
Also note, this is a technique employed by many rootkits so if you have security software installed it may try to prevent you from doing something like this.

How to test the kernel for kernel panics?

I am testing the Linux Kernel on an embedded device and would like to find situations / scenarios in which Linux Kernel would issue panics.
Can you suggest some test steps (manual or code automated) to create Kernel panics?
There's a variety of tools that you can use to try to crash your machine:
crashme tries to execute random code; this is good for testing process lifecycle code.
fsx is a tool to try to exercise the filesystem code extensively; it's good for testing drivers, block io and filesystem code.
The Linux Test Project aims to create a large repository of kernel test cases; it might not be designed with crashing systems in particular, but it may go a long way towards helping you and your team keep everything working as planned. (Note that the LTP isn't proscriptive -- the kernel community doesn't treat their tests as anything important -- but the LTP team tries very hard to be descriptive about what the kernel does and doesn't do.)
If your device is network-connected, you can run nmap against it, using a variety of scanning options: -sV --version-all will try to find versions of all services running (this can be stressful), -O --osscan-guess will try to determine the operating system by throwing strange network packets at the machine and guessing by responses what the output is.
The nessus scanning tool also does version identification of running services; it may or may not offer any improvements over nmap, though.
You can also hand your device to users; they figure out the craziest things to do with software, they'll spot bugs you'd never even think to look for. :)
You can try following key combination
SysRq + c
or
echo c >/proc/sysrq-trigger
Crashme has been known to find unknown kernel panic situations, but it must be run in a potent way that creates a variety of signal exceptions handled within the process and a variety of process exit conditions.
The main purpose of the messages generated by Crashme is to determine if sufficiently interesting things are happening to indicate possible potency. For example, if the mprotect call is needed to allow memory allocated with malloc to be executed as instructions, and if you don't have the mprotect enabled in the source code crashme.c for your platform, then Crashme is impotent.
It seems that operating systems on x64 architectures tend to have execution turned off for data segments. Recently I have updated the crashme.c on http://crashme.codeplex.com/ to use mprotect in case of __APPLE__ and tested it on a MacBook Pro running MAC OS X Lion. This is the first serious update to Crashme since 1994. Expect to see updated Centos and Freebsd support soon.

Resources