Views on Design By Contract or code contract - code-contracts

I am currently reading up to understand more about Design By contract / code contract.
As from what i know, it is to write contracts (Invariants, Pre and Post conditions) to ensure that the codes can be maintain orderly. It will also guarantees that bugs will be prevented by a well defined mechanism based on checks and balances
But wouldnt this implicated the software performance? As there are additional checks between each method calls.
I will really appreciate people to share their views and experience with Design By Contract with me. Disadvantages or advantages are welcome.

Typically such frameworks support runtime checks and static ananlysis. The latter is performed at compile time (or before); it does not slow down your code at all. The former could potentially affect performance.
The Microsoft Research Code Contracts project is a nice example of this. You can configure your system such that:
static analysis applies a subset of the possible contract enforcement at compile time, or even within the deigner environment;
runtime checks are enabled for all code compiled in debug mode; and
a subset of the runtime checks is enabled for the public API of code compiled in release mode (non-public code has no runtime checks).
This is often a good compromise between performance and robustness.

One way to apply the design-by-contract philosophy is purely static.
Consider a contract for a function max_element():
/*#
requires IsValidRange(a, n);
assigns \nothing;
behavior empty:
assumes n == 0; ensures \result == 0;
behavior not_empty:
assumes 0 < n;
ensures 0 <= \result < n;
ensures \forall integer i; 0 <= i < n ==> a[i] <= a[\result]; ensures \forall integer i; 0 <= i < \result ==> a[i] < a[\result];
complete behaviors;
disjoint behaviors;
*/
size_type max_element(const value_type* a, size_type n);
If you are able to verify at compile-time that an implementation always guarantees that the post-conditions in the ensures clauses are satisfied provided that the function is called with arguments that satisfy the pre-conditions in the requires clauses, it is unnecessary to generate checks for the post-conditions.
Similarly, if you verify that all the callers, when their own pre-conditions are satisfied, call max_element() only with arguments that satisfy its pre-conditions, then the checks are unnecessary at the entry of the function.
The above example is from ACSL by Example. This library provides many function contracts in ACSL. Implementations in C are provided for the contracts. The implementations have been statically formally verified to guarantee that the post-conditions hold for all calls with arguments that satisfy the pre-conditions. Therefore, no run-time check is necessary for the post-conditions. The compiler can treat the annotations as comments (which they are, using the /*# ... */ syntax).

Design-by-contract is intended to ensure correct implementation - that the code is written correctly with respect to the intended API between caller and callees. Ideally, you verify the contracts are upheld using static analysis tools and runtime checks in development and alpha test environments. Hopefully by then you have caught any implementation errors with respect to API errors. You probably don't use the runtime checks for beta and production unless you are trying to track down an error.

Related

Why would more array accesses perform better?

I'm taking a course on coursera that uses minizinc. In one of the assignments, I was spinning my wheels forever because my model was not performing well enough on a hidden test case. I finally solved it by changing the following types of accesses in my model
from
constraint sum(neg1,neg2 in party where neg1 < neg2)(joint[neg1,neg2]) >= m;
to
constraint sum(i,j in 1..u where i < j)(joint[party[i],party[j]]) >= m;
I dont know what I'm missing, but why would these two perform any differently from eachother? It seems like they should perform similarly with the former being maybe slightly faster, but the performance difference was dramatic. I'm guessing there is some sort of optimization that the former misses out on? Or, am I really missing something and do those lines actually result in different behavior? My intention is to sum the strength of every element in raid.
Misc. Details:
party is an array of enum vars
party's index set is 1..real_u
every element in party should be unique except for a dummy variable.
solver was Gecode
verification of my model was done on a coursera server so I don't know what optimization level their compiler used.
edit: Since minizinc(mz) is a declarative language, I'm realizing that "array accesses" in mz don't necessarily have a direct corollary in an imperative language. However, to me, these two lines mean the same thing semantically. So I guess my question is more "Why are the above lines different semantically in mz?"
edit2: I had to change the example in question, I was toting the line of violating coursera's honor code.
The difference stems from the way in which the where-clause "a < b" is evaluated. When "a" and "b" are parameters, then the compiler can already exclude the irrelevant parts of the sum during compilation. If "a" or "b" is a variable, then this can usually not be decided during compile time and the solver will receive a more complex constraint.
In this case the solver would have gotten a sum over "array[int] of var opt int", meaning that some variables in an array might not actually be present. For most solvers this is rewritten to a sum where every variable is multiplied by a boolean variable, which is true iff the variable is present. You can understand how this is less efficient than an normal sum without multiplications.

"getenv... function ... may be unsafe" - really?

I'm using MSVC to compile some C code which uses standard-library functions, such as getenv(), sprintf and others, with /W3 set for warnings. I'm told by MSVC that:
'getenv': This function or variable may be unsafe. Consider using _dupenv_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS
Questions:
Why would this be unsafe, theoretically - as opposed to its use on other platforms?
Is it unsafe on Windows in practice?
Assuming I'm not writing security-oriented code - should I disable this warning or actually start aliasing a bunch of standard library functions?
getenv() is potentially unsafe in that subsequent calls to that same function may invalidate earlier returned pointers. As a result, usage such as
char *a = getenv("A");
char *b = getenv("B");
/* do stuff with both a and b */
may break, because there's no guarantee a is still usable at that point.
getenv_s() - available in the C standard library since C11 - avoids this by immediately copying the value into a caller-supplied buffer, where the caller has full control over the buffer's lifetime. dupenv_s() avoids this by making the caller responsible for managing the lifetime of the allocated buffer.
However, the signature for getenv_s is somewhat controvertial, and the function may even be removed from the C standard at some point... see this report.
getenv suffers like much of the classic C Standard Library by not bounding the string buffer length. This is where security bugs like buffer overrun often originate from.
If you look at getenv_s you'll see it provides an explicit bound on the length of the returned string. It's recommended for all coding by the Security Development Lifecycle best practice, which is why Visual C++ emits deprecation warnings for the less secure versions.
See MSDN and this blog post
There was an effort by Microsoft to get the C/C++ ISO Standard Library to include the Secure CRT here, some of which was approved for C11 Annex K as noted here. That also means that getenv_s should be part of the C++17 Standard Library by reference. That said, Annex K is officially considered optional for conformance. The _s bounds-checking versions of these functions are also still a subject of some debate in the C/C++ community.

Is there an LLVM-based programming language that can guarantee sandbox-safe fast binaries?

I am writing a computationally-heavy code for a server (in C/C++). In the inside loops, I need to call some external user functions, millions of times, so they have to run natively fast and their invocation should have no more overhead than a C function call. Each time I receive a user function, in source form, I will automatically compile it into binary and it will be dynamically linked by the main code.
Those functions will only be used as simple Math kernels, e.g. in a peudo-C:
Function f(double x) ->double {
return x * x;
}
or with array access:
Function f(double* ar, int length) ->double {
double sum = 0;
for(i = 0 to length) {
sum = sum + ar[i];
}
return sum;
}
or with basic math library calls:
Function f(double x) ->double {
return cos(x);
}
However, they have to be safe for the server. It's OK if they halt (Turing completeness), but not if they access process memory that is not their own, if they do system calls, if they cause stack overflow, or to generalize, it's unwanted for the external code to "be able to hack the server code".
So my question: I'm wandering if there is a safe-by-design language with an LLVM frontend, (with no pointers etc., with bound checking for arrays/stack, isolation of system calls), with no speed penalties (referring to supervisors, garbage collectors), that I can use. LLVM is not necessary, but it's preferred.
I had a look at Mozillas "Rust" but it doesn't seem to be safe enough [rust-dev].
If there is no such language my fallback option right now is to use a NodeJS Sandboxed VM.
I believe that such a language, if made simple, is feasible but does it exist?
The type of language doesn't matter. A toy language with simplistic design and easy to prove safety would do.
EDIT: Concerning the system calls and harmful dependencies, for any language, it should be easy enough to isolate them with plain bash. Just try to link the produced .bc with no libraries. If it fails, the .bc has dependencies, so drop it. Since LLVM IR are otherwise totally harmless, the only thing that should be guaranteed by the language is memory access.
I would really like to add a comment, however Stack-Overflow is preventing me. So I'll just add it as an answer. Perhaps it will be useful.
You might try looking at https://github.com/andoma/vmir. I have been working with it a bit with the hopes of sandboxing arbitrary c++/swift code. I think, it might be possible to create a "safe" interpreter/JIT.
You can control all functions which are called. You can control how memory is accessed. So... Basically, I think, (and am hoping), that I can modify the JIT and interpreter enough so that I can reject code which is inherently not safe, and put up memory boundaries/function restrictions.
Having distinct processes ala PNaCL is the obvious sandboxing choice, but the overhead is substantial. I believe the sandboxing is done process wise.

Why is bounds checking not implemented in some of the languages?

According to the Wikipedia (http://en.wikipedia.org/wiki/Buffer_overflow)
Programming languages commonly associated with buffer overflows include C and C++, which provide no built-in protection against accessing or overwriting data in any part of memory and do not automatically check that data written to an array (the built-in buffer type) is within the boundaries of that array. Bounds checking can prevent buffer overflows.
So, why are 'Bounds Checking' not implemented in some of the languages like C and C++?
Basically, it's because it means every time you change an index, you have to do an if statement.
Let's consider a simple C for loop:
int ary[X] = {...}; // Purposefully leaving size and initializer unknown
for(int ix=0; ix< 23; ix++){
printf("ary[%d]=%d\n", ix, ary[ix]);
}
if we have bounds checking, the generated code for ary[ix] has to be something like
LOOP:
INC IX ; add `1 to ix
CMP IX, 23 ; while test
CMP IX, X ; compare IX and X
JGE ERROR ; if IX >= X jump to ERROR
LD R1, IX ; put the value of IX into register 1
LD R2, ARY+IX ; put the array value in R2
LA R3, Str42 ; STR42 is the format string
JSR PRINTF ; now we call the printf routine
J LOOP ; go back to the top of the loop
;;; somewhere else in the code
ERROR:
HCF ; halt and catch fire
If we don't have that bounds check, then we can write instead:
LD R1, IX
LOOP:
CMP IX, 23
JGE END
LD R2, ARY+R1
JSR PRINTF
INC R1
J LOOP
This saves 3-4 instructions in the loop, which (especially in the old days) meant a lot.
In fact, in the PDP-11 machines, it was even better, because there was something called "auto-increment addressing". On a PDP, all of the register stuff etc turned into something like
CZ -(IX), END ; compare IX to zero, then decrement; jump to END if zero
(And anyone who happens to remember the PDP better than I do, don't give me trouble about the precise syntax etc; you're an old fart like me, you know how these things slip away.)
It's all about the performance. However, the assertion that C and C++ have no bounds checking is not entirely correct. It is quite common to have "debug" and "optimized" versions of each library, and it is not uncommon to find bounds-checking enabled in the debugging versions of various libraries.
This has the advantage of quickly and painlessly finding out-of-bounds errors when developing the application, while at the same time eliminating the performance hit when running the program for realz.
I should also add that the performance hit is non-negigible, and many languages other than C++ will provide various high-level functions operating on buffers that are implemented directly in C and C++ specifically to avoid the bounds checking. For example, in Java, if you compare the speed of copying one array into another using pure Java vs. using System.arrayCopy (which does bounds checking once, but then straight-up copies the array without bounds-checking each individual element), you will see a decently large difference in the performance of those two operations.
It is easier to implement and faster both to compile and at run-time. It also simplifies the language definition (as quite a few things can be left out if this is skipped).
Currently, when you do:
int *p = (int*)malloc(sizeof(int));
*p = 50;
C (and C++) just says, "Okey dokey! I'll put something in that spot in memory".
If bounds checking were required, C would have to say, "Ok, first let's see if I can put something there? Has it been allocated? Yes? Good. I'll insert now." By skipping the test to see whether there is something which can be written there, you are saving a very costly step. On the other hand, (she wore a glove), we now live in an era where "optimization is for those who cannot afford RAM," so the arguments about the speed are getting much weaker.
The primary reason is the performance overhead of adding bounds checking to C or C++. While this overhead can be reduced substantially with state-of-the-art techniques (to 20-100% overhead, depending upon the application), it is still large enough to make many folks hesitate. I'm not sure whether that reaction is rational -- I sometimes suspect that people focus too much on performance, simply because performance is quantifiable and measurable -- but regardless, it is a fact of life. This fact reduces the incentive for major compilers to put effort into integrating the latest work on bounds checking into their compilers.
A secondary reason involves concerns that bounds checking might break your app. Particularly if you do funky stuff with pointer arithmetic and casting that violate the standard, bounds checking might block something your application is currently doing. Large applications sometimes do amazingly crufty and ugly things. If the compiler breaks the application, then there's no point in pointing blaming the crufty code for the problem; people aren't going to keep using a compiler that breaks their application.
Another major reason is that bounds checking competes with ASLR + DEP. ASLR + DEP are perceived as solving, oh, 80% of the problem or so. That reduces the perceived need for full-fledged bounds checking.
Because it would cripple those general purpose languages for HPC requirements. There are plenty of applications where buffer overflows really do not matter one iota, simply because they do not happen. Such features are much better off in a library (where in fact you can already find examples for C/C++).
For domain specific languages it may make sense to bake such features into the language definition and trade the resulting performance hit for increased security.

Groovy for loop execution time

O Groovy Gurus,
This code snippet runs in around 1 second
for (int i in (1..10000000)) {
j = i;
}
while this one takes almost 9 second
for (int i = 1; i < 10000000; i++) {
j = i;
}
Why is it so?
Ok. Here is my take on why?
If you convert both scripts to bytecode, you will notice that
ForInLoop uses Range. Iterator is used to advance during each loop. Comparison (<) is made directly to int (or Integer) to determine whether the exit condition has been met or not
ForLoop uses traditional increment, check condition, and perform action. For checking condition i < 10000000 it uses Groovy's ScriptBytecodeAdapter.compareLessThan. If you dig deep into that method's code, you will find both sides of comparison is taken in as Object and there are so many things going on, casting, comparing them as object, etc.
ScriptBytecodeAdapter.compareLessThan --> ScriptBytecodeAdapter.compareTo --> DefaultTypeTransformation.compareTo
There are other classes in typehandling package which implements compareTo method specifically for math data types, not sure why they are not being used, (if they are not being used)
I am suspecting that is the reason second loop is taking longer.
Again, please correct me if I am wrong or missing something...
In your testing, be sure to "warm" the JVM up before taking the measure, otherwise you may wind up triggering various startup actions in the platform (class loading, JIT compilation). Run your tests many times in a row too. Also, if you did the second test while a garbage collect was going on, that might have an impact. Try running each of your tests 100 times and print out the times after each test, and see what that tells you.
If you can eliminate potential artifacts from startup time as Jim suggests, then I'd hazard a guess that the Java-style for loop in Groovy is not so well implemented as the original Groovy-style for loop. It was only added as of v1.5 after user requests, so perhaps its implementation was a bit of an afterthought.
Have you taken a look at the bytecode generated for your two examples to see if there are any differences? There was a discussion about Groovy performance here in which one of the comments (from one 'johnchase') says this:
I wonder if the difference you saw related to how Groovy uses numbers (primitives) - since it wraps all primitives in their equivalent Java wrapper classes (int -> Integer), I’d imagine that would slow things down quite a bit. I’d be interested in seeing the performance of Java code that loops 10,000,000 using the wrapper classes instead of ints.
So perhaps the original Groovy for loop does not suffer from this? Just speculation on my part really though.

Resources