I was wondering - are there any known techniques to control access to a shared memory object from anywhere but an authorized program?
For instance, lets say I create a shared memory segment for use in a program P, to be accessed by Q, and I make it Read-Write. I can access it using Q because I've given it (Q) the required permissions to do so (running as a particular user with groups, etc).
However, I'm guessing there are instances where someone could potentially access this shared memory from a program R - simply attaching to it and modifying it. To stop this, you could make the memory segment read only - but now program R could still read what was in the memory.
My question is in parts -
Is there a way to,
a) allow only Q to access the shared memory?
b) figure whether a read was done by someone apart from Q - and who did it? [Is this even possible?] For bonus points, could this be done cross-platform? [Probably not, but no harm trying :)]
Under what circumstances could a rogue program attach to the shared memory? I presume one way is if a user is able to exploit OS holes and become the user that started the program. Any others?
POSIX shared memory has the same permissions system as files - if you run ipcs you'll see the permissions of the shared memory segments on your system:
$ ipcs -m
IPC status from <running system> as of Tue Jul 14 23:21:25 BST 2009
T ID KEY MODE OWNER GROUP
Shared Memory:
m 65536 0x07021999 --rw-r--r-- root wheel
m 65537 0x60022006 --rw-r--r-- root wheel
In answer to question 1a), you can use the normal UNIX permissions system to only allow access from a certain user and/or group. This can be controlled using shmctl :
struct ipc_perm perms;
perms.uid = 100;
perms.gid = 200;
perms.mode = 0660; // Allow read/write only by
// uid '100' or members of group '200'
shmctl(shmid, IPC_SET, &perms);
For 1b), I don't think any auditing interfaces exist for shared memory access.
With regards to your second question, any process running as the shm owner/group, or running as root will be able to access your memory - this is no different to accessing any other resource. Root can always access anything on a *ix system; and so any exploit which escalated a user to root would allow access to any shared memory region.
Related
I'm under the impression that the Linux kernel's attempts to protect itself revolve around not letting malicious code run in kernelspace. In particular, if a malicious kernel module were to get loaded, that it would be "game over" from a security standpoint. However, I recently came across a post that contradicts this belief and says that there is some way that the kernel can protect parts of itself from other parts of itself:
There is plenty of mechanisms to protect you against malicious modules. I write kernel code for fun so I have some experience in the field; it's basically a flag in the pagetable.
What's there to stop any kernel module from changing that flag in the pagetable back? The only protection against malicious modules is keeping them from loading at all. Once one loads, it's game over.
Make the pagetable readonly. Done.
Kernel modules can just make it read-write again, the same way your code made it read-only, then carry on with their changes.
You can actually lock this down so that kernel mode cannot modify the page table until an interrupt occurs. If your IDT is read-only as well there is no way for the module to do anything about it.
That doesn't make any sense to me. Am I missing something big about how kernel memory works? Can kernelspace code restrict itself from modifying the page table? Would that really prevent kernel rootkits? If so, then why doesn't the Linux kernel do that today to put an end to all kernel rootkits?
If the malicious kernel code is loaded in the trusted way (e.g. loading a kernel module and not exploiting a vulnerability) then no: kernel code is kernel code.
Intel CPUs do have a series of mechanisms to disable read/write access to kernel memory:
CR0.WP if set disallows writes accesses to both user and kernel read-only pages. Used to detect bugs in the kernel code.
CR4.PKE if set (4-level paging must be enabled, mandatory in 64-bit mode) disallows the kernel from accessing (not including instruction fetches) the user page mode unless these are tagged with the right key (which marks their RW permissions). Used to allow the kernel to write to structures like VSDO and KUSER_SHARED_DATA but not other user mode structures. The keys permissions are in an MSR, not in memory; the keys themselves are in the page table entries.
CR4.SMEP if set disallows kernel instruction fetching from user mode pages. Used to prevent attacks where a kernel function pointer is redirected to a user mode allocated page (e.g. the nelson.c privilege escalation exploit).
CR4.SMAP if set disallows kernel access to user mode pages during implicit access or during any type (implicit or explicit) of access (if EFLAGS.AC=0, thus overriding the protection keys). Used to enforce a more strictly no-user-mode-access policy.
Of course the R/W and U/S bits in the paging structures control if the item is read-only/read-write and assigned to user or kernel.
You can read how permissions are applied for supervisor-mode accesses in the Intel manual:
Data writes to supervisor-mode addresses.
Access rights depend on the value of CR0.WP:
- If CR0.WP = 0, data may be written to any supervisor-mode address.
- If CR0.WP = 1, data may be written to any supervisor-mode address with a translation for which the
R/W flag (bit 1) is 1 in every paging-structure entry controlling the translation; data may not be written
to any supervisor-mode address with a translation for which the R/W flag is 0 in any paging-structure
entry controlling the translation.
So even if the kernel protected a page X as read-only and then protected the page structures themselves as read-only, a malicious module could simply clear CR0.WP.
It could also change CR3 and use its own paging structures.
Note that Intel developed SGX to address the threat model where the kernel itself is evil.
However, running the kernel components into enclaves in a secure way (i.e. no single point of failure) may not be trivial.
Another approach is virtualizing the kernel with the VMX extension, though this is by no way trivial to implement.
Finally, the CPU has four protection levels at the segmentation layer but paging has only two: supervisor (CPL = 0) and user (CPL > 0).
It is theoretically possible to run a kernel component in "Ring 1" but then you'd need to make an interface (e.g. something like a call gate or syscall) for it to access the other kernel functions.
It's easier to run it in user mode altogether (since you don't trust the module in the first place).
I have no idea what this is supposed to mean:
You can actually lock this down so that kernel mode cannot modify the page table until an interrupt occurs.
I don't recall any mechanism by which the interrupt handling will lock/unlock anything.
I'm curious though, if anybody can shed some light they are welcome.
Security in the x86 CPUs (but this may be generalized) has always been hierarchical: whoever cames first set up the constraints for whoever cames later.
There is usually little to no protection between nonisolated components at the same hierarchical level.
I have a program (https://github.com/raboof/connbeat) that relies on /proc/[pid]/fd/* to find processes given a (networking) inode.
/proc/[pid]/fd can only be read by root, but I'd like to drop privileges as much as possible for security.
Is there some way I could (efficiently) get to the relationship between processes and inodes without requiring full root rights? Perhaps some syscall that I can selectively give access to using capabilities?
To be able to read fd's of all the processes you need:
CAP_DAC_READ_SEARCH - for access to /proc/[pid]/fd
CAP_SYS_PTRACE - to read symlinks under /proc/[pid]/fd/*
You can restrict your program to just these two capabilities. Then you can access the information in question using ordinary API calls like readdir() or readlink() or whatever else you prefer.
For a broader description of these two capabilities please refer to capabilities(7)
I want to execute code for an online-judge project.
I couldn't find documentation for the Python wrapper of libsandbox
I've found sample2.py and some test cases but without explanations.
What are the defaults when creating a sandbox?
Is it secure by default?
I want to execute untrusted code and
- Limit CPU
- Limit Memory
- Limit execution time
- Allow read/write access only to a specific folder and limit the size of this folder.
- Block network IO.
- Block executing other programs.
This code combines two examples I found:
cookbook = {
'args': args[1:], # targeted program
'stdin': sys.stdin, # input to targeted program
'stdout': sys.stdout, # output from targeted program
'stderr': sys.stderr, # error from targeted program
'jail': './foo',
'owner': 'nobody',
'quota': dict(wallclock = 30000,# 30 sec
cpu = 2000, # 2 sec
memory = 8388608, # 8 MB
disk = 1048576)} # 1 MB
# create a sandbox instance and execute till end
s = Sandbox(**cookbook)
s.run()
s.result == S_RESULT_OK
What does disk in quota limit? Does it limit the total bytes the script can write in this run or the size of the folder?
What does setting owner to nobody do?
Will the code in my example block executing arbitrary code, block network IO and block access to files outside of the jailed folder?
Thanks
What are the defaults when creating a sandbox? Is it secure by default?
A Sandbox instance is permissive by default. It grants unlimited quota unless you specify quota; it allows all system calls except ones that lead to multi-processing (i.e. fork(), vfork(), clone(), ...) and inter-process communication (i.e. waitid(), ptrace(), ...) unless you filter system call events with a custom policy.
The sample code (sample2.py) distributed with libsandbox is a minimal working example of restrictive, white-list sandboxing. Use that as the framework of your watchdog program.
What does disk in quota limit? Does it limit the total bytes the script can write in this run or the size of the folder?
disk quota limits the total bytes that the target program can write to all eligible file systems (i.e. ones that support quota limitation and are capable of generating SIGXFSZ signals).
If the program writes a regular file on ext3 or ext4 file system, that usually counts; but writing to standard output stream or /dev/null does not count against the quota. Nevertheless, you can implement folder-based quota within your custom policy.
What does setting owner to nobody do?
Execute the targeted program on behalf of user nobody. The owner argument wraps the OS-level service setuid(). After setuid() to nobody, the targeted program has all the permissions the OS granted to user nobody, and nothing beyond.
Please note that, you have to be a super user to be able to specify an owner other than yourself.
Will the code in my example block executing arbitrary code, block network IO and block access to files outside of the jailed folder?
Not exactly. All system calls made by the program are reported to the Sandbox, but you have to plug a policy module that explicitly blocks system calls relating to network IO. Or you can filter all system calls against a white list, as did by the sample code sample2.py.
Also note that, you have to be a super user to be able to specify a jail other than the root directory /.
DISCLAIMER: I am the author of libsandbox.
If a user didn't have root privileges, could that user still write a user space program with inline assembly to turn off protection mode on the computer to overwrite memory in other segments assuming the OS is linux?
Not unless the user knows of a security vulnerability to get root permissions. Mechanisms like /dev/mem allow root to read and write all userspace memory, and kernel module loading allows root access to kernel memory and the rest of the system's IO space.
Assuming the system is working as intended, no.
In reality, there are undoubtedly a few holes somewhere that would allow it -- given a code base that size, bugs are inevitable, and a few could probably be exploited to get into ring 0.
That said, I'm only guessing based on statistics -- I can't point to a specific vulnerability.
I'm trying to understand what logic determines whether a user can log in with a particular MLS sensitivity level. At first I suspected that pam_selinux.so reads the /etc/selinux/.../seusers file to understand which user is bound to which seuser and then restricts the user to sensitivities equal to or lower than the high component of the MLS range.
However, after scratching through its source code I found that, after asking the user if he would like to change their security context from the the default context, pam_selinux checks that the new MLS labels are appropriate by calling into the kernel policy.
The following code is in modules/pam_selinux/pam_selinux.c from the Ubuntu libpam-modules 1.1.1-4ubuntu2 package.
static int mls_range_allowed(pam_handle_t *pamh, security_context_t src, security_context_t dst, int debug)
{
struct av_decision avd;
int retval;
unsigned int bit = CONTEXT__CONTAINS;
context_t src_context = context_new (src);
context_t dst_context = context_new (dst);
context_range_set(dst_context, context_range_get(src_context));
if (debug)
pam_syslog(pamh, LOG_NOTICE, "Checking if %s mls range valid for %s", dst, context_str(dst_context));
retval = security_compute_av(context_str(dst_context), dst, SECCLASS_CONTEXT, bit, &avd);
context_free(src_context);
context_free(dst_context);
if (retval || ((bit & avd.allowed) != bit))
return 0;
return 1;
}
It seems to me that this check is actually checked in the kernel policy, seen in the security_compute_av() call. This turned my understanding of SELinux login on my head.
So, could someone please explain:
How is the validity of a user-chosen login security level determined?
How exactly is that logic implemented in the policy, in pam_selinux, and in the kernel?
Currently, I'm not too interested in type enforcement multi, categories security, or role based access control, so no need to explain how those components are validated if they don't affect MLS sensitivities.
Given that I also share the "SELinux folds my brain in half" problem, I think I can help. First and foremost, you need to remember the difference between discretionary access control and mandatory access control. You also need to remember that user space defines a lot of things, but the kernel gets to enforce them.
First, here is a partial list of user space versus kernel space issues:
User space defines a valid user ID, the kernel creates processes owned by that user ID (the number, not the name)
User space puts permissions and ownership on a file on an ext3/4 file system, the kernel enforces access to that file based upon the file inode and every subsequent parent directory inode
If two users share the same user ID in /etc/passwd, the kernel will grant them both the same privileges because enforcement is done by the numeric identifier, not the textual one
User space requests a network socket to another host, the kernel isolates that conversation from others on the same system
With SELinux, user space defines roles, logins, and users via semanage and the kernel compiles those down into a large Access Vector Cache (AVC) so that it can enforce role-based access control and mandatory access control
Also under SELinux, a security administrator can use semanage to define a minimum and maximum security context. If you are in a multi-level security (MLS) configuration, and during log in, the users picks some context, then the kernel measures that against the AVC to determine if it is allowed.
What would probably help this make sense is to be in a multi-level security configuration. I took the class on SELinux and we touched it for about two hours. Most people don't want to go there. Ever. I've been in an MLS configuration quite a bit, so I understand the reasoning behind the coding decision you were chasing, but I agree that tinkering with MLS is a pretty painful way to understand how and why PAM works like it does.
Discretionary Access Control (DAC) is where user space, especially non-root users, can define who can access data that they control and in what fashion. Think file permissions. Because users control it, there is a trivial amount of effort needed to allow one user to see processes and/or files owned by another user. Normally, we don't care that much because a good administrator assumes that any one user could compromise the whole box and so all users are trusted equally. This might be very little trust, but there is still some level of trust.
Mandatory Access Control (MAC) is where user space is not to be trusted. Not all users are created equal. From a non-MLS perspective, consider the case where you have a web server and a database server on the same hardware (it will never survive the Slashdot effect). The only time the two processes communicate is over a dedicated connection channel over TCP. Otherwise, they must not even know that the other exists. We would operate them under two different contexts and the kernel will enforce the separation. Even looking at the process table or wandering around the hard drive as root will not get you any closer unless you change contexts.
In an MLS configuration, I can't tell you how many times I've tried to get random combinations of sensitivity and context only to be rebuffed for picking an invalid combination. It can be very frustrating because it takes a lot of exploring of your existing policy (/etc/selinux/policy/src/policy/policy under Red Hat 5) to know what it or is not allowed.
Under a Strict configuration, I can clearly see why all of this is overkill. And that's simply because SELinux is overkill for simple situations. It has other features though that partially redeem it, but chief among them is fine grained access control that enforces administrator-set permissions. One area that this is used the most is in restricting service daemons to just their essential access needed. It is painful to set up a new daemon, but it keeps trivial exploitations like a shared library exploit from going any further because the process in question may be assigned to a role that won't let it run non-daemon commands like /bin/ls or a shell. Exploits don't do you much good in those situations.