Why many Linux distros use setuid instead of capabilities? - linux

capabilities(7) are a great way for not giving all root privileges to a process and AFAIK can be used instead of setuid(2). According to this and many others,
"Unfortunately, still many binaries have the setuid bit set, while they should be replaced with capabilities instead."
As a simple example, on Ubuntu,
$ ls -l `which ping`
-rwsr-xr-x 1 root root 44168 May 8 2014 /bin/ping
As you know, setting suid/guid on a file, changes the effective user ID to root. So if there the suid-enabled program contains a flaw, the non-privileged user can break-out and become the equivalent of the root user.
My question is why many Linux distributions still use setuid method while setting capabilities can be used instead with less security concerns?

This may not give the reason why some dudes somewhere decided one way or another, but some auditing tools and interfaces may not yet know about capabilities.
An example is the proc_connector netlink interface and the programs based on it (like forkstat): there are events for a process changing its credentials, but not for it changing its capabilities.
FWIW, the cause why you may not get eg a net_raw+ep ping(8) instead of a setuid one on a Debian-like distro is because that depends on the setcap(8) utility from the libcap2-bin package already existing before you install ping. From iputils-ping.postinst:
if command -v setcap > /dev/null; then
if setcap cap_net_raw+ep /bin/ping; then
chmod u-s /bin/ping
else
echo "Setcap failed on /bin/ping, falling back to setuid" >&2
chmod u+s /bin/ping
fi
else
echo "Setcap is not installed, falling back to setuid" >&2
chmod u+s /bin/ping
fi
Also notice that ping itself will drop any setuid privileges and switch to use capabilities on Linux upon starting, so your concerns about it may be a bit exagerated. From ping.c:
int
main(int argc, char **argv)
{
struct addrinfo hints = { .ai_family = AF_UNSPEC, .ai_protocol = IPPROTO
_UDP, .ai_socktype = SOCK_DGRAM, .ai_flags = getaddrinfo_flags };
struct addrinfo *result, *ai;
int status;
int ch;
socket_st sock4 = { .fd = -1 };
socket_st sock6 = { .fd = -1 };
char *target;
limit_capabilities();
From ping_common.c
void limit_capabilities(void)
{
...
if (setuid(getuid()) < 0) {
perror("setuid");
exit(-1);
}

Consider a different planet were capabilites patches were rejected, the submitter had to try harder to make a neighborly improvement, and came up with this:
filesystems can be mounted suid, nosuid, or capsuid.
if mounted capsuid, setuid bits on ${file} don't count unless .${file}.suid_capability also exists, is setuid, and is either 0-length or parseable in some standardized format:
-r-sr-xr-x. 1 carton carton 3812 Dec 27 15:39 t1
-r-Sr--r--. 1 carton carton 0 Dec 27 15:39 .t1.suid_capability
if the file is empty, setuid bit works as normal. If it has parseable contents, it works as partial-root capabilities.
if desired the .t1.suid_capability files can be annotated with fields for:
binding to a hash of their matching 't1' file
binding to an absolute path, so for example a setuid file within a chroot does not become "hot" until you chroot into it.
binding to a public key or hash nonce loaded at boot identifying the installed system like a uuid. If the private key or nonce is hidden from a backup server, traditional password resets would still work but some forms of rootkit persistence would become more difficult. It also lets you work with images of other systems in subdirectories, automatically making them effectively "nosuid"-mounted even if you are undisciplined about mount options and elect not to use the absolute path feature.
Now answer some questions about this planet relative to our own:
any missing functionality compared to our native planet?
which system works better with 'ls', 'tar', 'rsync', 'cpio', 'find', 'pax', 'mtree'? with gentoo's ebuild sandbox? with LXC?
which system works better with diskless or guest systems running on NFS or 9p roots?
which system works better with tripwire?
Now repeat those questions considering three systems instead of two:
our planet: mysterious capabilities metadata in filesystems
alternate planet using civilized Unix approach
mosvy's 'ping' example, where the binary drops privileges right after it starts
With the third system are finally starting to lose features relative to status quo: Someone could write a hypothetical capabilities-aware tripwire that works less well in the 'ping' example. Has anyone done that, though? . . . anyone outside NSA---have the people forcing this complexity on us done that work to deliver (relatively small) value from the complexity? And if that's the feature-hill we want to die on (as opposed to reducing attack surface), is there maybe a better way to get it, like dm-verity, that once again moots the complexity?
Now let's make a slight refinement to mosvy's 'ping':
crt.o or some early runtime startup code reads the sidecar file and /etc/suid_capability_hash_nonce. It drops suid if /etc/suid_capability_hash_nonce exists but the sidecar file does not (in "alternate planet" terms, the existence of /etc/suid_capability_hash_nonce makes the whole system '-o capsuid'-mounted). The early runtime startup code handles parsing the sidecars and dropping to the enumerated capabilities so that logic doesn't have to be coded into main.c one-by-one.
Honestly, I think the 'ping' example is superior because programs know what capabilities they need, and the need changes only when the program's source code changes. By setting them in a drop_privileges()-style function the need will never get out of sync. Fanned-out distributions won't have to scramble to update capability sets.
If you disagree and want the sidecar/capabilities-style system, something equivalent could have been implemented without tampering with kernel, mount options, bootup, any of it: old dracut-nfsroot initrds would keep working. It is just a matter of style.
This is all a Socratic way of saying capabilities offers nothing beyond dm-verity + 'ping'-style privileges-dropping. There is nothing "unfortunate" about eschewing complexity that's not pulling its weight.
Every few years CADT developers invent some new form of metadata to cram into filesystem directories: Finder metadata, POSIX ACLs, SELinux contexts, NFS ACLs. What will they think of next? How long will it take to update all the filesystems, all the network protocols, all the fileutils tools, all the non-Linux storage operating systems, all the fancy licensed "enterprise" tape backup suites? Will anyone ever get around to updating all that stuff, or will we just limp along with substandard tools? Will even the core feature be documented usefully, or will it be the undocumented plaything of a few annoying "distributions"? Will the feature work most of the time, but stop systems from booting unexpectedly and create chicken-and-egg recovery problems? Will the feature actually stop any attacks?
Is it fair to all the people who have to clean up after them and tolerate complexity in their own systems that made different, simpler choices, just to unbreak interop that their new feature broke?
The value delivered is particularly low in this case, but we have enough experience to set the value bar high on this type of proposal. I think it was a mistake to accept the feature into Linux and applaud any distribution that avoids it.

Capability support was really enabled in 2008 when file capabilities were added to the kernel. So, that's 13 years and counting to replace setuid. Your question is very valid.
If capability support wasn't implemented in a backwardly compatible way in Linux, it would either have been rejected at birth, or things would have surely changed by now! I suspect it all comes down to the fact that there is no economic incentive to adopt them when their benefit is only evident temporarily when a bug in some code is found to be exploitable.
I think that people are resigned to all the ways that setuid binaries seem to be exploitable and the point fixes that people quickly roll out when another vulnerability surfaces. Buffer overflow -> launch shell -> root exploit -> code fix -> start over. They are comfortable with the idea that user identity = privilege (ie., root). This spills into how people want to describe the futility of breaking down all powerful root into independent capabilities, when a chain of exploits can yield all privilege. Clearly, the pervasive idea that privilege and identity should be equivalent is why Ambient capabilities even exist.
Capabilities, however, when used as they were intended to be used are not identity = privilege. They are capable binaries = privilege - where the privilege is reduced relative to a user identity executing arbitrary code, by the combination of those capability bits and the code in the actual program that has them.
If you write code that edits files, it is clear it shouldn't need to worry about being directly abused for raw ethernet packet formation, or loading kernel modules. Exploiting a bug in that code may well allow a malicious edit of a file but, unlike setuid, it won't allow strange packets to be sent on the network. At least not without tricking the code of some independently capable executable into doing it by proxy.
However, no one intentionally writes buggy code, so I think the answer to your question comes down to another question: "where exactly is the benefit of limiting how exploitable an exploit is?".
I suspect that if some distribution were to figure out how to restructure their code base to eliminate setuid-root binaries, in favor of file capabilities, it would fare better (certainly no worse) over time as code exploits are found vs. other distributions clinging to setuid-root. But, until such a distribution comes into being, I can't fault the idea that that is just an opinion.

Related

Meaning of the read permission for binary executable?

I am interested in the full impact of the read permission for binary executables. Indeed, I have encountered some behaviors that I wish to understand.
Let's say I have a C program that just call sleep(300). When the binary has the read permission, I am able to inspect the /proc/$PID folder associated with the running program. But when I removed this permission, I cannot access said folder : it does not exist.
Similarly, If I have a more clever program that copies str from one pointer to another, calling strace on this executable while yield better results if the binary is "readable". (For example, strace will show what every pointer points to)
Since strace relies on ptrace to analyze the running program internals, I don't understand the impact of the read permission. Indeed, I believe the read permission would only be relevant for statical analysis which rely on reading the binary.
Given the observed impact of the read permission, does that mean it is a good practice the remove the read permission of all the binaries on servers where security is critical?
It's certainly possible on Linux to have a binary with only execute permissions, as you've discovered. Doing this has the potential to cause problems with troubleshooting, as you've also discovered, because it makes the process harder to instrument.
I've certainly seen installations where the administrators have systematically removed read permissions from all their own binaries. I've sometimes felt that doing this has caused problems, although the installations where this kind of thing was done were so complex that it was difficult to be certain.
I guess you have to weigh up the benefit of a small increase in security, with a small decrease in serviceability. My experience is that, whatever the merits of removing read permissions, it doesn't seem to be a common practice in the Linux world.

How to change system date and time(embedded Linux) using QML?

How to change the system time and date in QML? I thought this might work from an example, though I have this doubt where the object is being sent. It didn't work. Can some one let me know how to that? Following is my code:
var time = new Date()
time.setHours(hour)
time.setMinutes(minute)
time.setSeconds(secs)
You generally cannot change the system time without root permission, and setting the system time is a whole-system operation for sysadmins. See settimeofday(2) and adjtimex(2). See also credentials(7) & capabilities(7).
(So it is not a matter of using Qt or something else; any library using settimeofday or adjtimex needs root permission)
Read also time(7).
And you generally should not change the system time, but have some daemon using NTP to adjust it continuously.
Your code is just changing the value of some variable holding some time. It is not changing the system time.
The sysadmin could change the time e.g. with date(1) used with -s which requires root permission and uses settimeofday internally. See also hwclock(8).
Notice that GUI applications should generally not be run as root (and Qt or GTK don't want to be run as root).
If you develop some embedded system and you want the user to set the time, you could consider writting some small specialized setuid executable which uses settimeofday(2) and have your GUI application run it (e.g. with QProcess). Be very careful when coding a setuid program (so read ALP or some good book on Linux system programming), since you could easily get vulnerabilities. Be aware that setuid is the basic mechanism (used by /bin/login, /bin/su, /usr/bin/sudo etc...; it is also used in Android systems or any Unix-derived system) to acquire or change permission. It is tricky to use, so do spend time to understand it.
Perhaps your init or systemd might be configured to ease such a task....
(you need to describe your entire system much more to get more help)

How to prohibit system calls, GNU/Linux

I'm currently working on the back-end of ACM-like public programming contest system. In such system, any user can submit a code source, which will be compiled and run automatically (which means, no human-eye pre-moderation is performed) in attempt to solve some computational problem.
Back-end is a GNU/Linux dedicated machine, where a user will be created for each contestant, all such users being part of users group. Sources sent by any particular user will be stored at the user's home directory, then compiled and executed to be verified against various test cases.
What I want is to prohibit usage of Linux system calls for the sources. That's because problems require platform-independent solutions, while enabling system calls for insecure source is a potential security breach. Such sources may be successfully placed in the FS, even compiled, but never run. I also want to be notified whenever source containing system calls was sent.
By now, I see the following places where such checker may be placed:
Front-end/pre-compilation analysis - source already checked in the system, but not yet compiled. Simple text checker against system calls names. Platform-dependent, compiler-independent, language-dependent solution.
Compiler patch - crash GCC (or any other compiler included in the tool-chain) whenever system call is encountered. Platform-dependent, compiler-dependent, language-independent solution (if we place checker "far enough"). Compatibility may also be lost. In fact, I dislike this alternative most.
Run-time checker - whenever system call is invoked from the process, terminate this process and report. This solution is compiler and language independent, but depends on the platform - I'm OK with that, since I will deploy the back-end on similar platforms in short- and mid-terms.
So the question is: does GNU/Linux provide an opportunity for administrator to prohibit system calls usage for a usergroup, user or particular process? It may be a security policy or a lightweight GNU utility.
I tried to Google, but Google disliked me today.
mode 1 seccomp allows a process to limit itself to exactly four syscalls: read, write, sigreturn, and _exit. This can be used to severely sandbox code, as seccomp-nurse does.
mode 2 seccomp (at the time of writing, found in Ubuntu 12.04 or patch your own kernel) provides more flexibility in filtering syscalls. You can, for example, first set up filters, then exec the program under test. Appropriate use of chroot or unshare can be used to prevent it from re-execing anything else "interesting".
I think you need to define system call better. I mean,
cat <<EOF > hello.c
#include <stdio.h>
int main(int argc,char** argv) {
fprintf(stdout,"Hello world!\n");
return 0;
}
EOF
gcc hello.c
strace -q ./a.out
demonstrates that even an apparently trivial program makes ~27 system calls.
You (I assume) want to allow calls to the "standard C library", but those in turn will be implemented in terms of system calls. I guess what I'm trying to say is that run-time checking is less feasible than you might think (using strace or similar anyway).

Accessing /proc

I'm currently developing an application which needs a lot of system and process information, some of which is only available through /proc, and I have some general questions about accessing the structures.
The application will be run on Linux (kernel >= 2.6), not on any other Unix-flavored OS. It should have access to any data in /proc, I can't say what is necessary now as the specifications are not clear yet, but the whole /proc directory is relevant to the application.
First of all: Is there a good documentation which covers all the features added / removed from kernel version to kernel version? One thing I'm curious about in particular is the format of the individual files. Can I take that for granted? Does it change among kernel versions?
Hooking up the parsing process based on the kernel wouldn't be a problem at all, it's just that I couldn't find any good docs on what has changed from version to version which could help me catching parsing errors in beforehand.
In addition: Is there a definite list of features that can be activated / deactivated by kernel options (except of course the /proc-feature itself)? I'm looking for a list of files / directories that only exist with the appropriate options being set in the kernel.
As an example of what I'm thinking of, this is a link to the proc manpage (http://linux.die.net/man/5/proc) which includes a lot of good information, e.g. some options include the earliest kernel version they were available at, some include whether a module is necessary to be loaded. This does not describe the output format of all information though, which is something I need if I want to parse it (e.g. if it is consistent throughout all kernel versions or changed at some point).
The second thing I'm wondering about is what happens if the process queried dies while being queried. What is my time interval? For example if I'm going to fetch a list of processes reading all the structures, and parse them one after another, what happens if my process x dies before I get to read it? Even if I check if the directory exists, it could still be gone one application call later.
Last but not least: Is there any major distribution out there that is not mounting proc?
From what I understand, a lot of common tools are based on the /proc interface such as lsmod or free, so I'm guessing that I can expect /proc to exist almost always.
The /proc interfaces are pretty stable (unlike the /sys interfaces), even if nothing is guaranteed. Almost all changes are backwards compatible, at least if they've been around for a few versions. You should
stick to the documented interfaces to be safe. If a file exists, its format may be extended in later versions, but normally in a backwards compatible way, e.g. adding columns to a table. The parts that are most at risk of disappearing are parts concerning hardware susbystems such as ACPI or SCSI, which are migrating to /sys (with a long transition period when both exist).
Most of the information is architecture-independent, except for hardware information (e.g. /proc/cpuinfo has very different fields on different architectures).
The main documentation is Documentation/filesystems/proc.txt in the kernel source. Consider proc(5) to be the overview and proc.txt to be the fine details. The kernel documentation is often incomplete, so don't be surprised if you need to resort to reading the source sometimes.
Most optional parts of /proc are activated by default if the driver whose data it exposes is included in the kernel. The exceptions are mostly related to hardware features that rarely need to be accessed from outside the kernel; if you need to access these features, you're probably already expecting to need to dig deeper. Look through Kconfig files in the kernel source for detailed information.
Process data (or hardware data related to removable hardware or provided by unloadable modules) can disappear under your nose. Most files under /proc can be read atomically, with a single read call with a reasonably-sized buffer; if you perform multiple read calls in sequence, drivers are supposed to guarantee that you get well-formed data. There is no way to guarantee atomicity between reads of separate files; if you're reading information about a process, this process can die at any time, and in principle could even be replaced by another process with the same PID before you're finished.
As it says in the description of /proc, “everyone should say Y here”. All desktop/server Linux systems and most embedded Linux systems must have /proc; a lot of things, including ps and other process management commands, many filesystem and device-related tools, and module loading, require it. The only systems that might be able to dispense with /proc are very small single-purpose embedded systems that support a single hardware configuration and run a fixed set of programs. You can count on its being here.

Run an untrusted C program in a sandbox in Linux that prevents it from opening files, forking, etc.?

I was wondering if there exists a way to run an untrusted C program under a sandbox in Linux. Something that would prevent the program from opening files, or network connections, or forking, exec, etc?
It would be a small program, a homework assignment, that gets uploaded to a server and has unit tests executed on it. So the program would be short lived.
I have used Systrace to sandbox untrusted programs both interactively and in automatic mode. It has a ptrace()-based backend which allows its use on a Linux system without special privileges, as well as a far faster and more poweful backend which requires patching the kernel.
It is also possible to create a sandbox on Unix-like systems using chroot(1), although that is not quite as easy or secure. Linux Containers and FreeBSD jails are a better alternative to chroot. Another alternative on Linux is to use a security framework like SELinux or AppArmor, which is what I would propose for production systems.
We would be able to help you more if you told as what exactly it is that you want to do.
EDIT:
Systrace would work for your case, but I think that something based on the Linux Security Model like AppArmor or SELinux is a more standard, and thus preferred, alternative, depending on your distribution.
EDIT 2:
While chroot(1) is available on most (all?) Unix-like systems, it has quite a few issues:
It can be broken out of. If you are going to actually compile or run untrusted C programs on your system, you are especially vulnerable to this issue. And if your students are anything like mine, someone WILL try to break out of the jail.
You have to create a full independent filesystem hierarchy with everything that is necessary for your task. You do not have to have a compiler in the chroot, but anything that is required to run the compiled programs should be included. While there are utilities that help with this, it's still not trivial.
You have to maintain the chroot. Since it is independent, the chroot files will not be updated along with your distribution. You will have to either recreate the chroot regularly, or include the necessary update tools in it, which would essentially require that it be a full-blown Linux distribution. You will also have to keep system and user data (passwords, input files e.t.c.) synchronized with the host system.
chroot() only protects the filesystem. It does not prevent a malicious program from opening network sockets or a badly-written one from sucking up every available resource.
The resource usage problem is common among all alternatives. Filesystem quotas will prevent programs from filling the disk. Proper ulimit (setrlimit() in C) settings can protect against memory overuse and any fork bombs, as well as put a stop to CPU hogs. nice(1) can lower the priority of those programs so that the computer can be used for any tasks that are deemed more important with no problem.
I wrote an overview of sandboxing techniques in Linux recently. I think your easiest approach would be to use Linux containers (lxc) if you dont mind about forking and so on, which don't really matter in this environment. You can give the process a read only root file system, an isolated loopback network connection, and you can still kill it easily and set memory limits etc.
Seccomp is going to be a bit difficult, as the code cannot even allocate memory.
Selinux is the other option, but I think it might be more work than a container.
Firejail is one of the most comprehensive tools to do that - it support seccomp, filesystem containers, capabilities and more:
https://firejail.wordpress.com/features-3/
You can use Qemu to test assignments quickly. This procedure below takes less than 5 seconds on my 5 year old laptop.
Let's assume the student has to develop a program that takes unsigned ints, each on their own line, until a line with "-1" arrives. The program should then average all the ints and output "Average: %f". Here's how you could test program completely isolated:
First, get root.bin from Jslinux, we'll use that as the userland (it has the tcc C-compiler):
wget https://github.com/levskaya/jslinux-deobfuscated/raw/master/root.bin
We want to put the student's submission in root.bin, so set up the loop device:
sudo losetup /dev/loop0 root.bin
(you could use fuseext2 for this too, but it's not very stable. If it stabilizes, you won't need root for any of this)
Make an empty directory:
mkdir mountpoint
Mount root.bin:
sudo mount /dev/loop0 mountpoint
Enter the mounted filesystem:
cd mountpoint.
Fix rights:
sudo chown -R `whoami` .
mkdir -p etc/init.d
vi etc/init.d:
#!/bin/sh
cd /root
echo READY 2>&1 > /dev/ttyS0
tcc assignment.c 2>&1 > /dev/ttyS0
./a.out 2>&1 > /dev/ttyS0
chmod +x etc/init.d/rcS
Copy the submission to the VM:
cp ~/student_assignment.c root/assignment.c
Exit the VM's root FS:
cd ..
sudo umount mountpoint
Now the image is ready, we just need to run it. It will compile and run the submission after booting.
mkfifo /tmp/guest_output
Open a seperate terminal and start listening for guest output:
dd if=/tmp/guest_output bs=1
In another terminal:
qemu-system-i386 -kernel vmlinuz-3.5.0-27-generic -initrd root.bin -monitor stdio -nographic -serial pipe:/tmp/guestoutput
(I just used the Ubuntu kernel here, but many kernels will work)
When the guest output shows "READY", you can send keys to the VM from the qemu prompt.
For example, to test this assignment, you could do
(qemu) sendkey 1
(qemu) sendkey 4
(qemu) sendkey ret
(qemu) sendkey 1
(qemu) sendkey 0
(qemu) sendkey ret
(qemu) sendkey minus
(qemu) sendkey 1
(qemu) sendkey ret
Now Average = 12.000000 should appear on the guest output pipe. If it doesn't, the student failed.
Quit qemu: quit
A program passing the test is here: https://stackoverflow.com/a/14424295/309483. Just use tcclib.h instead of stdio.h.
Try User-mode Linux. It has about 1% performance overhead for CPU-intensive jobs, but it may be 6 times slower for I/O-intensive jobs.
Running it inside a virtual machine should offer you all the security and restrictions you want.
QEMU would be a good fit for that and all the work (downloading the application, updating the disk image, starting QEMU, running the application inside it, and saving the output for later retrieval) could be scripted for automated tests runs.
When it goes about sanboxing based on ptrace (strace) check-out:
"sydbox" sandbox and "pinktrace" programming library ( it's C99 but there are bindings to python and ruby as far as I know).
Collected links related to topic:
http://www.diigo.com/user/wierzowiecki/sydbox
(sorry that not direct links, but no enough reputation points yet)
seccomp and seccomp-bpf accomplish this with the least effort: https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
ok thanks to all the answers they helped ME a lot. But i would suggest none of them as an solution for the person who asked the original question. All mentioned tools require to much work for the purpose to test students code as a teacher,tutor,prof. The best way in this case would be in my opinion virtualbox. Ok, its emulates an complete x68-system and has nothing to do with the meaning of sandboxing in this way but if i imagine my programming teacher it would be the best for him. So "apt-get install virtualbox" on debian based systems, all others head over to http://virtualbox.org/ , create a vm, add an iso, click install, wait some time and be lucky. It will be much easier to use as to set up user-mode-linux or doing some heavy strace stuff...
And if you have fears about your students hacking you i guess you have an authority problem and a solution for that would be threaten them that you will sue the living daylights out of them if you can prove just one bite of maleware in the work they give you...
Also if there is a class and 1% of it is as good as he could do such things, dont bore them with such simple tasks and give them some big ones where they have to code some more. Integrative learning is best for everyone so dont relay on old deadlocked structures...
And of cause, never use the same computer for important things (like writing attestations and exams), that you are using for things like browsing the web and testing software.
Use an off line computer for important things and an on line computer for all other things.
However to everyone else who isnt a paranoid teacher (dont want to offend anybody, i am just the opinion that you should learn the basics about security and our society before you start being a programmers teacher...)
... where was i ... for everyone else:
happy hacking !!

Resources