what is The poisoned NUL byte, in 1998 and 2014 editions? - linux

I have to make a 10 minutes presentation about "poisoned null-byte (glibc)".
I searched a lot about it but I found nothing, I need help please because operating system linux and the memory and process management isn't my thing.
here is the original article, and here is an old article about the same problem but another version.
what I want is a short and simple explanation to the old and new versions of the problem or/and sufficient references where I can better read about this security threat.

To even begin to understand how this attack works, you will need at least a basic understanding of how a CPU works, how memory works, what the "heap" and "stack" of a process are, what pointers are, what libc is, what linked lists are, how function calls are implemented at the machine level (including calls to function pointers), what the malloc and free functions from the C library do, and so on. Hopefully you at least have some basic knowledge of C programming? (If not, you will probably not be able to complete this assignment in time.)
If you have a couple "gaps" in your knowledge of the basic topics mentioned above, hit the books and fill them in as quickly as you can. Talk to others if you need to, to make sure you understand them. Then read the following very carefully. This will not explain everything in the article you linked to, but will give you a good start. OK, ready? Let's start...
C strings are "null-terminated". That means the end of a string is marked by a zero byte. So for example, the string "abc" is represented in memory as (hex): 0x61 0x62 0x63 0x00. Notice, that 3-character string actually takes 4 bytes, due to the terminating null.
Now if you do something like this:
char *buffer = malloc(3); // not checking for error, this is just an example
strcpy(buffer, "abc");
...then that terminating null (zero byte) will go past the end of the buffer and overwrite something. We allocated a 3-byte buffer, but copied 4 bytes into it. So whatever was stored in the byte right after the end of the buffer will be replaced by a zero byte.
That was what happened in __gconv_translit_find. They had a buffer, which had been allocated with enough space to append ".so", including the terminating null byte, onto the end of a string. But they copied ".so" in starting from the wrong position. They started the copy operation one byte too far to the "right", so the terminating null byte went past the end of the buffer and overwrote something.
Now, when you call malloc to get back a dynamically allocated buffer, most implementations of malloc actually store some housekeeping data right before the buffer. For example, they might store the size of the buffer. Later, when you pass that buffer to free to release the memory, so it can be reused for something else, it will find that "hidden" data right before the beginning of the buffer, and will know how many bytes of memory you are actually freeing. malloc may also "hide" other housekeeping data in the same location. (In the 2014 article you referred to, the implementation of malloc used also stored some "flag" bits there.)
The attack described in the article passed carefully crafted arguments to a command-line program, designed to trigger the buffer overflow error in __gconv_translit_find, in such a way that the terminating null byte would wipe out the "flag" bits stored by malloc -- not the flag bits for the buffer which overflowed, but those for another buffer which was allocated right after the one which overflowed. (Since malloc stores that extra housekeeping data before the beginning of an allocated buffer, and we are overrunning the previous buffer. You follow?)
The article shows a diagram, where 0x00000201 is stored right after the buffer which overflows. The overflowing null byte wipes out the bottom 1 and changes that into 0x00000200. That might not make sense at first, until you remember that x86 CPUs are little-endian -- if you don't understand what "little-endian" and "big-endian" CPUs are, look it up.
Later, the buffer whose flag bit was wiped out is passed to free. As it turns out, wiping out that one flag bit "confuses" free and makes it, in turn, also overwrite some other memory. (You will have to understand the implementation of malloc and free which are used by GNU libc, in order to understand why this is so.)
By carefully choosing the input arguments to the original program, you can set things up so that the memory overwritten by the "confused" free is that used for something called tls_dtor_list. This is a linked list maintained by GNU libc, which holds pointers to certain functions which it must call when the main program is exiting.
So tls_dtor_list is overwritten. The attacker has set things up just right, so that the function pointers in the overwritten tls_dtor_list will point to some code which they want to run. When the main program is exiting, some code in libc iterates over that list and calls each of the function pointers. Result: the attacker's code is executed!
Now, in this case, the attacker already has access to the target system. If all they can do is run some code with the privilege level of their own account, that doesn't get them anywhere. They want to run code with root (administrator) privileges. How is that possible? It is possible because the buggy program is a setuid program, owned by root. If you don't know what "setuid" programs in Unix are, look it up and make sure you understand it, because that is also a key to the whole exploit.
This is all about the 2014 article -- I didn't look at the one from 1998. Good luck!

Related

Can a single byte instruction be executed while being only partially overwritten?

I have made an experiment in which a new thread executes a shellcode with this simple infinite loop:
NOP
JMP REL8 0xFE (-0x2)
This generate the following shellcode:
0x90, 0xEB, 0xFE
After this infinite loop there are other instructions ending by the overwriting of the destination byte back to -0x2 to make it an infinite loop again, and an absolute jump to send the thread back to this infinite loop.
Now I was asking myself if it was possible that the instruction of the jump was executed while the single byte of the destination is only partially overwritten by the other thread.
For example, let's say that the other thread overwrites the destination of the jump (0xFE, or 11111110 in binary) to 0x0 (00000000) to release the thread of this infinite loop.
Could it happen that the jump goes to let's say 0x1E (00011110) because the destination byte wasn't completely overwritten at that nanosecond?
Before asking this question here I have done the experiment myself in a C++ program and I have let it run for some hours without it never missing a single jump.
If you want to have a look at the code I made for this experiment I have uploaded it to GitHub
Accordingly to this experiment, it seems to be impossible that an instruction is executed while being only partially overwritten .
However, I have very little knowledge in assembly and in processors, this is for this reason that I ask the question here:
Can anyone confirm my observation please? Is it indeed impossible to have an instruction executed while being partially overwritten by another thread? Does anyone knows why for sure?
Thank you very much for your help and knowledge on that, I did not know where to look for such an information.
No, byte stores are always atomic on x86, even for cross-modifying code.
See Observing stale instruction fetching on x86 with self-modifying code for some links to Intel's manuals for cross modifying code. And maybe Reproducing Unexpected Behavior w/Cross-Modifying Code on x86-64 CPUs
Of course, all the recommendations for writing efficient cross-modifying code (and running code that you just JIT-compiled) involve avoiding stores into pages that other threads are currently executing.
Why are you doing this with "shellcode", anyway? Is this supposed to be part of an exploit? If not, why not just write code in asm like a normal person, with a label on the jmp instruction so you can store to it from C by assigning to extern char jmp_bytes[2]?
And if this is supposed to be an efficient cross-thread notification mechanism... it isn't. Spinning on a data load and a conditional branch with a pause loop would allow a lower latency exit from the loop than a self-modifying code machine nuke that flushes the whole pipeline right when you want it to finally be doing something useful instead of wasting CPU time. At least several times the delay of a simple branch miss.
Even better, use an OS-supported condition variable so the thread can sleep instead of heating up your CPU (reducing the thermal headroom for the CPU to turbo above its rated clock speed up when there is work to do).
The mechanism used by current CPUs is that if a store near the EIP/RIP or any instruction in flight in the pipeline is detected, it does a machine clear. (perf counter machine_clears.smc, aka machine nuke.) It doesn't even try to handle it "efficiently", but if you did a non-atomic store (e.g. actually two separate stores, or a store split across a cache line boundary) the target CPU core could see it in different parts and potentially decode it with some bytes updated and other bytes not. But a single byte is always updated atomically, so tearing within a byte is not possible.
However, x86 on paper doesn't guarantee that, but as Andy Glew (one of the architects of Intel's P6 microarchitecture family) says, implementing stronger behaviour than the paper spec can actually be the most efficient way to meet all the required guarantees and run fast. (And / or avoid breaking existing code in widely-used software!)

Explicitly removing sensitive data from memory?

The recent leak from Wikileaks has the CIA doing the following:
DO explicitly remove sensitive data (encryption keys, raw collection
data, shellcode, uploaded modules, etc) from memory as soon as the
data is no longer needed in plain-text form.
DO NOT RELY ON THE OPERATING SYSTEM TO DO THIS UPON TERMINATION OF
EXECUTION.
Me being a developer in the *nix world; I'm seeing this as merely changing the value of a variable (ensuring I do not pass by value; and instead by reference); so if it's a string thats 100 characters; writing 0's thats 101 characters. Is it really this simple? If not, why and what should be done instead?
Note: There are similar question that asked this; but it's in the C# and Windows world. So, I do not consider this question a duplicate.
Me being a developer in the *nix world; I'm seeing this as merely
changing the value of a variable (ensuring I do not pass by value; and
instead by reference); so if it's a string thats 100 characters;
writing 0's thats 101 characters. Is it really this simple? If not,
why and what should be done instead?
It should be this simple. The devil is in the details.
memory allocation functions, such as realloc, are not guaranteed to leave memory alone (you should not rely on their doing it one way or the other - see also this question). If you allocate 1K of memory, then realloc it to 10K, your original K might still be there somewhere else, containing its sensitive payload. It might then be allocated by another insensitive variable or buffer, or not, and through the new variable, it might be possible to access a part or all of the old content, much as it happened with slack space on some filesystems.
manually zeroing memory (and, with most compilers, bzero and memset count as manual loops) might be blithely optimized out, especially if you're zeroing a local variable ("bug" - actually a feature, with workaround).
some functions might leave "traces" in local buffers or in memory they allocate and deallocate.
in some languages and frameworks, whole portions of data could end up being moved around (e.g. during so-called "garbage collection", as noticed by #gene). You may be able to tell the GC not to process your sensitive area or otherwise "pin" it to that effect, and if so, must do so. Otherwise, data might end up in multiple, partial copies.
information might have come through and left traces you're not aware of (trivial example: a password sent through the network might linger in the network library read buffer).
live memory might be swapped out to disk.
Example of realloc doing its thing. Memory gets partly rewritten, and with some libraries this will only "work" if "a" is not the only allocated area (so you need to also declare c and allocate something immediately after a, so that a is not the last object and left free to grow):
int main() {
char *a;
char *b;
a = malloc(1024);
strcpy(a, "Hello");
strcpy(a + 200, "world");
printf("a at %08ld is %s...%s\n", a, a, a + 200);
b = realloc(a, 10240);
strcpy(b, "Hey!");
printf("a at %08ld is %s...%s, b at %08ld is %s\n", a, a, a + 200, b, b);
return 0;
}
Output:
a at 19828752 is Hello...world
a at 19828752 is 8????...world, b at 19830832 is Hey!
So the memory at address a was partly rewritten - "Hello" is lost, "world" is still there (as well as at b + 200).
So you need to handle reallocations of sensitive areas yourself; better yet, pre-allocate it all at program startup. Then, tell the OS that a sensitive area of memory must never be swapped to disk. Then you need to zero it in such a way that the compiler can't interfere. And you need to use a low-level enough language that you're sure doesn't do things by itself: a simple string concatenation could spawn two or three copies of the data - I'm fairly certain it happened in PHP 5.2.
Ages ago I wrote myself a small library - there wasn't valgrind yet - inspired by Steve Maguire's Writing Solid Code, and apart from overriding the various memory and string functions, I ended up overwriting memory and then calculating the checksum of the overwritten buffer. This not for security, I used it to track buffer over/under flows, double frees, use of freed memory -- this kind of things.
And then you need to ensure your failsafes work - for example, what happens if the program aborts? Might it be possible to make it abort on purpose?
You need to implement defense in depth, and always look at ways to keep as little information around as possible - for example clearing the intermediate buffers during a calculation rather than waiting and freeing the whole lot in one fell swoop at the very end, or just when exiting the program; keeping hashes instead of passwords when at all possible; and so on.
Of course all this depends on how sensitive the information is and what the attack surface is likely to be (mandatory xkcd reference: here). Rebooting the PC with a memtest86 image could be a viable alternative. Think of a dual-boot computer with memtest86 set to test memory and power down the PC as default boot option. When you want to turn off the system... you reboot it instead. The PC will reboot, enter memtest86 by default, and before powering off for good, it'll start filling all available RAM with marching troops of zeros and ones. Good luck freeze-booting information from that.
Zeroing out secrets (passwords, keys, etc) immediately after you are done with them is fairly standard practice. The difficulty is in dealing with language and platform features that can get in your way.
For example, C++ compilers can optimize out calls to memset if it determines that the data is not read after the write. Or operating systems may have paged the memory out to disk, potentially leaving the data available that way.

MOVDQU instruction + page boundary

I have a simple test program that loads an xmm register with the
movdqu instruction accessing data across a page boundary (OS = Linux).
If the following page is mapped, this works just fine. If it's not
mapped then I get a SIGSEGV, which is probably expected.
However this diminishes the usefulness of the unaligned loads quite
a bit. Additionally SSE4.2 instructions (like pcmpistri) which
allow for unaligned memory references appear to exhibit this behavior
as well.
That's all fine -- except there's many an implementation of strcmp
using pcmpistri that I've found that don't seem to address this issue
at all -- and I've been able to contrive trivial testcases that will
cause these implementations to fail, while the byte-at-a-time trivial
strcmp implementation will work just fine with the same data layout.
One more note -- it appears the the GNU C library implementation for
64-bit Linux has a __strcmp_sse42 variant that appears to use the
pcmpistri instruction in a more safe manner. The implementation of
this strcmp is fairly complex, but it appears to be carefully trying
to avoid the page boundary issue. I'm not sure if that's due to the
issue I describe above, or whether it's just a side-effect of trying to
get better performance by aligning the data.
Anyway the question I have is primarily -- where can I find out more
about this issue? I've typed in "movdqu crossing page boundary" and
every variant of that I can think of to Google, but haven't come across
anything particularly useful. If anyone can point me to further info
on this it would be greatly appreciated.
First, any algorithm which tries to access an unmapped address will cause a SegFault. If a non-AVX code flow used a 4 byte load to access the last byte of a page and the first 3 bytes of "the next page" which happened to not be mapped then it would also cause a SegFault. No? I believe that the "issue" is that the AVX(1/2/3) registers are so much bigger than "typical" that algorithms which were unsafe (but got away with it) get caught if they are trivially extended to the larger registers.
Aligned loads (MOVDQA) can never have this problem since they don't cross any boundaries of their own size or greater. Unaligned loads CAN have this problem (as you've noted) and "often" do. The reason for this is that the instruction is defined to load the full size of the target register. You need to look at the operand types in the instruction definitions quite carefully. It doesn't matter how much of the data you are interested in. It matters what the instruction is defined to do.
However...
AVX1 (Sandybridge) added a "masked move" capability which is slower than a movdqa or movdqu but will not (architecturally) access the unmapped page so long as the mask is not enabled for the portion of the access which would have fallen in that page. This is meant to address the issue. In general, moving forward, it appears that masked portions (See AVX512) of loads/stores will not cause access violations on IA either.
(It is a bummer about PCMPxSTRx behavior. Perhaps you could add 15 bytes of padding to your "string" objects?)
Facing a similar problem with a library I was writing, I got some information from a very helpful contributor.
The core of the idea is to align the 16-byte reads to the end of the string, then handle the leftover bytes at the beginning. This works because the end of the string must live in an accessible page, and you are guaranteed that the 16-byte truncated starting address must also live in an accessible page.
Since we never read past the string we cannot potentially stray into a protected page.
To handle the initial set of bytes, I chose to use the PCMPxSTRM functions, which return the bitmask of matching bytes. Then it's simply a matter of shifting the result to ignore any mask bits that occur before the true beginning of the string.

Kernel Panic -- Failed copy_from_user, kmalloc?

I am writing a rootkit for my OS class (the teacher is okay with me asking for help here). My rootkit hooks the sys_read system call to hide "magic" ports from the user. When I copy the user buffer *buf (one of the arguments of sys_read) to kernel space (into a buffer called kbuf) I get kernel panic/core dump error. It is possible that this is just because breaking read brings the system to a halt, but I wonder if anyone has any perspective on this.
The code is available online. Look at line 207: https://github.com/joshimhoff/toykit/blob/master/toykit.c
I hooked getdents and used copy_from_user to bring the getdents structs into kernel space, and this worked well! I am not sure what is different about read.
Thanks for the help!
I figured it out. I called the actual sys_read function and didn't check the return value. Sometimes it is negative to indicate an error. Instead of failing early, I asked kmalloc for a negative number of bytes.
Imagine that. Allocating negative memory. That would be a crazy world.

Can I write-protect every page in the address space of a Linux process?

I'm wondering if there's a way to write-protect every page in a Linux
process' address space (from inside of the process itself, by way of
mprotect()). By "every page", I really mean every page of the
process's address space that might be written to by an ordinary
program running in user mode -- so, the program text, the constants,
the globals, and the heap -- but I would be happy with just constants,
globals, and heap. I don't want to write-protect the stack -- that
seems like a bad idea.
One problem is that I don't know where to start write-protecting
memory. Looking at /proc/pid/maps, which shows the sections of memory
in use for a given pid, they always seem to start with the address
0x08048000, with the program text. (In Linux, as far as I can tell,
the memory of a process is laid out with the program text at the
bottom, then constants above that, then globals, then the heap, then
an empty space of varying size depending on the size of the heap or
stack, and then the stack growing down from the top of memory at
virtual address 0xffffffff.) There's a way to tell where the top of
the heap is (by calling sbrk(0), which simply returns a pointer to the
current "break", i.e., the top of the heap), but not really a way to
tell where the heap begins.
If I try to protect all pages from 0x08048000 up to the break, I
eventually get an mprotect: Cannot allocate memory error. I don't know why mprotect would be
allocating memory anyway -- and Google is not very helpful. Any ideas?
By the way, the reason I want to do this is because I want to create a
list of all pages that are written to during a run of the program, and
the way that I can think of to do this is to write-protect all pages,
let any attempted writes cause a write fault, then implement a write
fault handler that will add the page to the list and then remove the write
protection. I think I know how to implement the handler, if only I could
figure out which pages to protect and how to do it.
Thanks!
You recieve ENOMEM from mprotect() if you try to call it on pages that aren't mapped.
Your best bet is to open /proc/self/maps, and read it a line at a time with fgets() to find all the mappings in your process. For each writeable mapping (indicated in the second field) that isn't the stack (indicated in the last field), call mprotect() with the right base address and length (calculated from the start and end addresses in the first field).
Note that you'll need to have your fault handler already set up at this point, because the act of reading the maps file itself will likely cause writes within your address space.
Start simple. Write-protect a few page and make sure your signal handler works for these pages. Then worry about expanding the scope of the protection. For example, you probably do not need to write-protect the code-section: operating systems can implement write-or-execute protection semantics on memory that will prevent code sections from ever being written to:
http://en.wikipedia.org/wiki/Self-modifying_code#Operating_systems

Resources