According to the Wikipedia (http://en.wikipedia.org/wiki/Buffer_overflow)
Programming languages commonly associated with buffer overflows include C and C++, which provide no built-in protection against accessing or overwriting data in any part of memory and do not automatically check that data written to an array (the built-in buffer type) is within the boundaries of that array. Bounds checking can prevent buffer overflows.
So, why are 'Bounds Checking' not implemented in some of the languages like C and C++?
Basically, it's because it means every time you change an index, you have to do an if statement.
Let's consider a simple C for loop:
int ary[X] = {...}; // Purposefully leaving size and initializer unknown
for(int ix=0; ix< 23; ix++){
printf("ary[%d]=%d\n", ix, ary[ix]);
}
if we have bounds checking, the generated code for ary[ix] has to be something like
LOOP:
INC IX ; add `1 to ix
CMP IX, 23 ; while test
CMP IX, X ; compare IX and X
JGE ERROR ; if IX >= X jump to ERROR
LD R1, IX ; put the value of IX into register 1
LD R2, ARY+IX ; put the array value in R2
LA R3, Str42 ; STR42 is the format string
JSR PRINTF ; now we call the printf routine
J LOOP ; go back to the top of the loop
;;; somewhere else in the code
ERROR:
HCF ; halt and catch fire
If we don't have that bounds check, then we can write instead:
LD R1, IX
LOOP:
CMP IX, 23
JGE END
LD R2, ARY+R1
JSR PRINTF
INC R1
J LOOP
This saves 3-4 instructions in the loop, which (especially in the old days) meant a lot.
In fact, in the PDP-11 machines, it was even better, because there was something called "auto-increment addressing". On a PDP, all of the register stuff etc turned into something like
CZ -(IX), END ; compare IX to zero, then decrement; jump to END if zero
(And anyone who happens to remember the PDP better than I do, don't give me trouble about the precise syntax etc; you're an old fart like me, you know how these things slip away.)
It's all about the performance. However, the assertion that C and C++ have no bounds checking is not entirely correct. It is quite common to have "debug" and "optimized" versions of each library, and it is not uncommon to find bounds-checking enabled in the debugging versions of various libraries.
This has the advantage of quickly and painlessly finding out-of-bounds errors when developing the application, while at the same time eliminating the performance hit when running the program for realz.
I should also add that the performance hit is non-negigible, and many languages other than C++ will provide various high-level functions operating on buffers that are implemented directly in C and C++ specifically to avoid the bounds checking. For example, in Java, if you compare the speed of copying one array into another using pure Java vs. using System.arrayCopy (which does bounds checking once, but then straight-up copies the array without bounds-checking each individual element), you will see a decently large difference in the performance of those two operations.
It is easier to implement and faster both to compile and at run-time. It also simplifies the language definition (as quite a few things can be left out if this is skipped).
Currently, when you do:
int *p = (int*)malloc(sizeof(int));
*p = 50;
C (and C++) just says, "Okey dokey! I'll put something in that spot in memory".
If bounds checking were required, C would have to say, "Ok, first let's see if I can put something there? Has it been allocated? Yes? Good. I'll insert now." By skipping the test to see whether there is something which can be written there, you are saving a very costly step. On the other hand, (she wore a glove), we now live in an era where "optimization is for those who cannot afford RAM," so the arguments about the speed are getting much weaker.
The primary reason is the performance overhead of adding bounds checking to C or C++. While this overhead can be reduced substantially with state-of-the-art techniques (to 20-100% overhead, depending upon the application), it is still large enough to make many folks hesitate. I'm not sure whether that reaction is rational -- I sometimes suspect that people focus too much on performance, simply because performance is quantifiable and measurable -- but regardless, it is a fact of life. This fact reduces the incentive for major compilers to put effort into integrating the latest work on bounds checking into their compilers.
A secondary reason involves concerns that bounds checking might break your app. Particularly if you do funky stuff with pointer arithmetic and casting that violate the standard, bounds checking might block something your application is currently doing. Large applications sometimes do amazingly crufty and ugly things. If the compiler breaks the application, then there's no point in pointing blaming the crufty code for the problem; people aren't going to keep using a compiler that breaks their application.
Another major reason is that bounds checking competes with ASLR + DEP. ASLR + DEP are perceived as solving, oh, 80% of the problem or so. That reduces the perceived need for full-fledged bounds checking.
Because it would cripple those general purpose languages for HPC requirements. There are plenty of applications where buffer overflows really do not matter one iota, simply because they do not happen. Such features are much better off in a library (where in fact you can already find examples for C/C++).
For domain specific languages it may make sense to bake such features into the language definition and trade the resulting performance hit for increased security.
Related
I have an embedded project in Rust on the STM32F446 MCU. Consider the next line:
leds::set_g(self.next_update_time % 2000 == 0)
The modulo is used and reading online, it appears that the Cortex M4 doesn't have a modulo instruction. Instead, a function gets added to the binary that does this in software. Using cargo bloat (based on Google's Bloaty), it can be found.
File .text Size Crate Name
...
0.1% 6.9% 990B compiler_builtins __udivmoddi4
...
Much to my surprise, it takes just under a kilobyte of memory. I think that's a lot. The code behind it is quite long as well, see this link. I assume this implementation is made to be fast. Luckily I have the memory to spare.
Using opt-level = 'z' doesn't change this.
But what if I couldn't afford this, how could I let it take up less memory?
Of course resorting to a solution like this would work, but then I'd lose the ability to use the % operator.
Not sure how clever the Rust linker is, but in many embedded linker implementations you would be able to swap in your own implementation of __udivmodi4 which used a smaller (but slower) method in preference to the version provided by the compiler.
In general generic division and modulo are expensive on embedded platforms, but division by a constant can often be specialized with a "fixed" implementation by a smart compiler (often with special cases for common divisors - 3, 5, 7, 10, etc).
If you can control the application then changing the code to divide or modulo by 2^N is obviously preferable (it collapses to either a "right shift" instruction for divide, or an "and" instruction for modulo). E.g. in this case 2048 might be acceptably close to 2000, and turns 1 KB of code into 4 bytes of code.
FWIW the Rust version of this does seem a little on the fat side - the GCC implementation for example is much smaller.
As part of my project, I have to insert small codes in a C program called checksum guards. What these guards do is they calculate the checksum value of a portion of code using a function(add, xor, etc.) which operates on the instruction opcodes. So, if somebody has tampered with the instructions(add, modify, delete) in that region of code, the checksum value will change and intrusion will be detected.
Here is the research paper which talks about this technique:
https://www.cerias.purdue.edu/assets/pdf/bibtex_archive/2001-49.pdf
Here is the guard template:
guard:
add ebp, -checksum
mov eax, client_addr
for:
cmp eax, client_end
jg end
mov ebx, dword[eax]
add ebp, ebx
add eax, 4
jmp for
end:
Now, I have two questions.
Would putting the guards in the assembly better than putting it in the source program?
Assuming I put them in the assembly(at an appropriate place) what kind of obfuscation should I use to prevent the guard template to be easily visible? (Since when I have more than 1 guard, the attacker should not easily find out all the guard templates and disable all the guards together as that would leave the code with no security)
Thank you in advance.
From attacker's (without sources) point of view the first question doesn't matter; he's tampering with the final binary machine code, whether it was produced from .c or .s will make zero difference. So I would worry mainly how to generate the correct binary with appropriate checksums. I'm not aware of any way how to get proper checksum inside the C source. But I can imagine to have some external tool running over assembler files created by C compiler, in some post-process way - before compiling the .s files into .o. But... Keep in mind, that some calls and addresses are just relative offsets, and the binary loaded into memory is patched by the OS loader according to linker's table, to make those point to the real memory addresses. Thus the data bytes will change (opcodes will stay fixed).
Yours guard template doesn't take that into account, and does checksum whole opcodes with data bytes as well (Some advanced guards have opcodes definitions, and checksum/encrypt/decipher only the opcodes themselves without operand bytes).
Otherwise it's neat, that the result is damaged ebp value, ruining any C code around (*) working with stack variables. But it's still artificial test, you can simply comment out both add ebp,-checksum and add ebp,ebx making the guard harmless.
(*) notice you have to put the guard in between some classic C code to get some real runtime problems from invalid ebp value. If you would put it at the end of subroutine, which ends with pop ebp, everything would work well.
So to the second question:
You definitely want more malicious ways to guard correct value, than only ebp damage. Usually the hardest (to remove) way is to make checksum value part of some calculation, eventually skewing results just slightly, so serious usage of the SW will be impossible, but it will take time to notice by the user.
You can also use some genuine code loop to add your checksumming to it, so simply skipping whole loop will skip also valid code (but I can imagine this one only added by hand into generated assembly from C, so you will have to redo it after every new compilation of particular C source).
Then the particular guard template can be obfuscated by any imaginable mutation (different registers used, modified order of instructions, instruction variants), try to search about viruses with mutation encoding to get some ideas.
And I didn't read that paper, but from the Figures I would say the main point is to make those guarding areas to overlap, so patching off one of them will affect another one, which sounds to me like that extra sugar to make it somewhat functional (although this still looks like normal challenge to 8bit game crackers ;), not even "hard" level). But that also means you would need either very cunning external tool to calculate that cyclic tree of dependencies, and insert the guard templates in correct order, or do it again manually completely.
Of course when doing manually, you have to do it after each new C compilation, so it's worth of the effort only on something very precious and expensive, or rock solid stable, where you will not produce another revision for next 10y or so... :D
For fun I'm implementing an NES emulator. I'm currently reading through documentation for the 6502 CPU and I'm a little confused.
I've seen documentation stating because the 6502 is little-endian so when using absolute addressing mode you need to swap the bytes. I'm writing this on an x86 machine which is also little-endian, so I don't understand why I couldn't simply cast to a uint16_t*, dereference that, and let the compiler work out the details.
I've written some simple tests in google test and they seem to agree with me.
// implementation of READ16
#define READ16(addr) (*(uint16_t*)addr)
TEST(MemMacro, READ16) {
uint8_t arr[] = {0xFF,0xCC};
uint8_t *mem = (&arr[0]);
EXPECT_EQ(0xCCFF, READ16(mem));
}
This passes, so it appears my supposition is correct, but I thought I'd ask someone with more experience than I.
Is this correct for pulling out the operand in 6502 absolute addressing mode? Am I possibly missing something?
It will work for simple cases on little-endian systems, but tying your implementation to those feels unnecessary when the corresponding portable implementation is simple. Sticking to the macro, you could do this instead:
#define READ16(addr) (addr[0] + (addr[1] << 8))
(Just to be pedantic, you should also make sure that addr[1] can't be out-of-bounds, and would need to add some more parentheses if addr could be a complex expression.)
However, as you keep developing your emulator, you will find that it's most natural to use a pair of general-purpose read_mem() and write_mem() functions that operate on single bytes. Remember that the address space is split up into multiple regions (RAM, ROM, and memory-mapped registers from the PPU and APU), so having e.g. a single array that you index into won't work well. The fact that memory regions can be remapped by mappers also complicates things. (You won't have to worry about that for simple games though -- I recommend starting with Donkey Kong.)
What you need to do is to figure out what region or memory-mapped register the address belongs to inside your read_mem() and write_mem() functions (this is called address decoding), and do the right thing for the address.
Returning to the original question, the fact that you'll end up using read_mem() to read the individual bytes of the address anyway means that the uint16_t casting trickery is even less likely to be useful. This is the simplest and most robust approach w.r.t. handling corner cases, and what every emulator I've seen does in practice (Nestopia, Nintendulator, and FCEUX).
In case you've missed it, the #nesdev channel on EFNet is very active and a good resource by the way. I assume you're already familiar with the NESDev wiki. :)
I've also been working on an emulator which can be found here.
I am writing a computationally-heavy code for a server (in C/C++). In the inside loops, I need to call some external user functions, millions of times, so they have to run natively fast and their invocation should have no more overhead than a C function call. Each time I receive a user function, in source form, I will automatically compile it into binary and it will be dynamically linked by the main code.
Those functions will only be used as simple Math kernels, e.g. in a peudo-C:
Function f(double x) ->double {
return x * x;
}
or with array access:
Function f(double* ar, int length) ->double {
double sum = 0;
for(i = 0 to length) {
sum = sum + ar[i];
}
return sum;
}
or with basic math library calls:
Function f(double x) ->double {
return cos(x);
}
However, they have to be safe for the server. It's OK if they halt (Turing completeness), but not if they access process memory that is not their own, if they do system calls, if they cause stack overflow, or to generalize, it's unwanted for the external code to "be able to hack the server code".
So my question: I'm wandering if there is a safe-by-design language with an LLVM frontend, (with no pointers etc., with bound checking for arrays/stack, isolation of system calls), with no speed penalties (referring to supervisors, garbage collectors), that I can use. LLVM is not necessary, but it's preferred.
I had a look at Mozillas "Rust" but it doesn't seem to be safe enough [rust-dev].
If there is no such language my fallback option right now is to use a NodeJS Sandboxed VM.
I believe that such a language, if made simple, is feasible but does it exist?
The type of language doesn't matter. A toy language with simplistic design and easy to prove safety would do.
EDIT: Concerning the system calls and harmful dependencies, for any language, it should be easy enough to isolate them with plain bash. Just try to link the produced .bc with no libraries. If it fails, the .bc has dependencies, so drop it. Since LLVM IR are otherwise totally harmless, the only thing that should be guaranteed by the language is memory access.
I would really like to add a comment, however Stack-Overflow is preventing me. So I'll just add it as an answer. Perhaps it will be useful.
You might try looking at https://github.com/andoma/vmir. I have been working with it a bit with the hopes of sandboxing arbitrary c++/swift code. I think, it might be possible to create a "safe" interpreter/JIT.
You can control all functions which are called. You can control how memory is accessed. So... Basically, I think, (and am hoping), that I can modify the JIT and interpreter enough so that I can reject code which is inherently not safe, and put up memory boundaries/function restrictions.
Having distinct processes ala PNaCL is the obvious sandboxing choice, but the overhead is substantial. I believe the sandboxing is done process wise.
I was always wondering why such a simple and basic operation like swapping the contents of two variables is not built-in for many languages.
It is one of the most basic programming exercises in computer science classes; it is heavily used in many algorithms (e.g. sorting); every now and then one needs it and one must use a temporary variable or use a template/generic function.
It is even a basic machine instruction on many processors, so that the standard scheme with a temporary variable will get optimized.
Many less obvious operators have been created, like the assignment operators (e.g. +=, which was probably created for reflecting the cumulative machine instructions, e.g. add ax,bx), or the ?? operator in C#.
So, what is the reason? Or does it actually exist, and I always missed it?
In my experience, it isn't that commonly needed in actual applications, apart from the already-mentioned sort algorithms and occasionally in low level hardware poking, so in my view it's a bit too special-purpose to have in a general-purpose language.
As has also been mentioned, not all processors support it as an instruction (and many do not support it for objects bigger than a word). So if it were supported with some useful additional semantics (e.g. being an atomic operation) it would be difficult to support on some processors, and if it didn't have the additional semantics then it's just (seldom used) synatatic sugar.
The assigment operators (+= etc) were supported because these are much more common in real-world programs - and so the syntacic sugar they provide was more useful, and also as an optimisation - remember C dates from the late 60s/early 70s, and compiler optimisation wasn't as advanced (and the machines less capable, so you didn't want lengthy optimisation passes anyway).
Paul
C++ does have swapping.
#include <algorithm>
#include <cassert>
int
main()
{
using std::swap;
int a(3), b(5);
swap(a, b);
assert(a == 5 && b == 3);
}
Furthermore, you can specialise swap for custom types too!
It's a widely used example in computer science courses, but I almost never find myself needing it in real code - whereas I use += very frequently.
Yes, in sorting it would be handy - but you don't tend to need to implement sorting yourself, so the number of actual uses in source code would still be pretty low.
You do have the XOR operator that does a variable substitution for primitive type...
I think they just forgot to add it :-) Yes, not all CPUs have this kind of instructions, so what ? We have bunch of other things that most CPUs don't have instructions to compute. It would be much easier/clearer and also faster ( by intrinsic ) if we had it !!!