How to report failing test cases

How to report failing test cases - haskell

I am using Haskell Test Framework through Stack to evaluate QuickCheck properties. When I run stack test, failing properties are reported in the form of Gave up! Passed only 95 tests. The many examples of property testing I've found report failures in the form of Falsifiable, after 48 tests followed by the arguments that failed. These examples, however, seem to be running QuickCheck directly instead of through Stack and HTF.
How can I configure my environment to report the arguments generated by QuickCheck that failed to satisfy the property under test? As pointed out in Testing with HTF, documentation is already sparse and poor for some of these tools alone, let alone for combining them together.

"Gave up!" means a different kind of failure than "Falsifiable".
QuickCheck has a way of discarding test cases you consider "improper", counting neither towards actual success or failure. A typical source of such discards comes from using the implication operator (==>), where tests cases which do not satisfy the precondition are discarded: "success" is only counted when the precondition is satisfied, to give you a better idea of the extent to which you are testing the postcondition to the right (which is likely the part that actually matters to you as a user). Explicit use of the discard property is also possible, with a different meaning from an actual failure such as returning False.
Discarded tests thus do not falsify the property as a whole (an implication with a false precondition is logically true), but too many discarded tests may result in insufficient coverage, which is signaled through the failure you observed, and there is no counterexample to print. To resolve this failure, find where the discards are coming from, possible outcomes include:
use a better generator (avoiding discards);
raise the discard threshold, #stefanwehr shows how to do this in HTF in the other answer;
these discards should actually be failures.

#Li-yao Xia is right by saying that your generator generates to many discardable test cases. To raise the discard threshold with HTF, you would write your property like this:
prop_somePropertyWithRaisedDiscardThreshold =
withQCArgs (\args -> args { maxDiscardRatioy = 1000 })
somePredicateOrProperty
The args variable has type Args, directly from the quickcheck package.

Related

Handling invalid states in haskell

I'm trying to get a better feel for how to handle error states in Haskell, since there seem to be a lot of ways to do it. Ideally, my data structures would make any invalid inputs unrepresentable, but despite considerable effort to the contrary, I still occasionally end up working with data where the type system can allow invalid states. As an example, let's consider that my program input is the training results for a neural network. In order for math to work, each matrix needs to have the correct bounds, and that's not (really) representable by the type system. If data is invalid, there's really nothing the application can do but halt any further processing and notify someone of the problem (so it's not recoverable). What's the best way to handle this in Haskell? It seems like I could:
1) Use error or other partial functions when processing my data. My understanding is this should only be used to represent a bug in the code. So it would have to be coupled with some sort of validation at the point that I load the data, and any point "after" that check I just assume that the data is in a valid format. This feels imperative to me, and doesn't seem to fit very well with lazy, declarative code.
2) Throw an exception when processing the data using Control.Exception.throw, and then catch it at the top level where I can alert someone. Contrary to error, I believe this doesn't indicate a bug in the program, so perhaps there wouldn't be verification when I load the data beyond what can be represented through the type system? The presence or absence of an exception when processing the data would define the verification.
3) Lift any data processing that could fail into the IO monad and use Control.Exception.throwIO.
4) Lift any data processing that could fail into the IO monad and use fail (I've read that using fail frowned on by the community?)
5) Return an Either or something similar, and let that bubble up through all your logic. I've definitely had some cases where composing Eithers becomes (to me) exceedingly impractical.
6) Use Control.Monad.Exception, which I only marginally understand, but seems to involve lifting any data processing that could fail into some exceptional monad, that I think is supposed to be more easily composeable than Either?
and I'm not even sure that's all the options. Is there an approach to this problem that's generally accepted by the community, or is this really an opinionated topic?

How to interpret the Result of msat LTL commands of NuXMV

I am using NuXMV for checking LTL properties using msat_check_ltlspec_bmc command on a fairly large model. The result shows no counterexample found within the given bounds. Do I interpret it as that property is True. Or it can alternatively mean that analysis is not complete.
This is because, by changing the property proposition to true or false, the result is always no counterexample. Most of The results are counterintuitive.
Started with real variables based properties but since unable to understand the result, shifted to Boolean based properties on the same model, using the same command.

Bounded Model Checking is a bug-oriented technique which checks the validity of a property on execution traces up to a given lenght k.
When an execution trace violates a property, great: a bug was found.
Otherwise, (in the general case) the model checking result provides no useful information and it should be treated as such.
In some cases, knowing additional information about the model can help. In particular, if one knows that every execution trace of length k must loop-back to one of the k-1 states, then it is possible to draw stronger conclusions from the lack of counter-examples of length smaller or equal k.

Is there a way to test termination in Haskell using existing libraries?

There are some libraries in Haskell for testing properties or for unit tests, etc. But I have yet to find one that allows for testing termination/nontermination of a function.
Of course you can't simply test if an expression terminates because of the Halting problem. But I could have a guess on the upper bound of how long the function takes to evaluate and if the function doesn't end in that time, the test would fail as the function was (likely) not going to terminate.
Is there a library that allows for testing termination like this?

Programming style question on how to code functions

So, I was just coding a bit today, and I realized that I don't have much consistency when it comes to a coding style when programming functions. One of my main concerns is whether or not its proper to code it so that you check that the input of the user is valid OUTSIDE of the function, or just throw the values passed by the user into the function and check if the values are valid in there. Let me sketch an example:
I have a function that lists hosts based on an environment, and I want to be able to split the environment into chunks of hosts. So an example of the usage is this:
listhosts -e testenv -s 2 1
This will get all the hosts from the "testenv", split it up into two parts, and it is displaying part one.
In my code, I have a function that you pass it in a list, and it returns a list of lists based on you parameters for splitting. BUT, before I pass it a list, I first verify the parameters in my MAIN during the getops process, so in the main I check to make sure there are no negatives passed by the user, I make sure the user didnt request to split into say, 4 parts, but asking to display part 5 (which would not be valid), etc.
tl;dr: Would you check the validity of a users input the flow of you're MAIN class, or would you do a check in your function itself, and either return a valid response in the case of valid input, or return NULL in the case of invalid input?
Obviously both methods work, I'm just interested to hear from experts as to which approach is better :) Thanks for any comments and suggestions you guys have! FYI, my example is coded in Python, but I'm still more interested in a general programming answer as opposed to a language-specific one!

Good question! My main advice is that you approach the problem systematically. If you are designing a function f, here is how I think about its specification:
What are the absolute requirements that a caller of f must meet? Those requirements are f's precondition.
What does f do for its caller? When f returns, what is the return value and what is the state of the machine? Under what circumstances does f throw an exception, and what exception is thrown? The answers to all these questions constitute f's postcondition.
The precondition and postcondition together constitute f's contract with callers.
Only a caller meeting the precondition gets to rely on the postcondition.
Finally, bearing directly on your question, what happens if f's caller doesn't meet the precondition? You have two choices:
You guarantee to halt the program, one hopes with an informative message. This is a checked run-time error.
Anything goes. Maybe there's a segfault, maybe memory is corrupted, maybe f silently returns a wrong answer. This is an unchecked run-time error.
Notice some items not on this list: raising an exception or returning an error code. If these behaviors are to be relied upon, they become part of f's contract.
Now I can rephrase your question:
What should a function do when its caller violates its contract?
In most kinds of applications, the function should halt the program with a checked run-time error. If the program is part of an application that needs to be reliable, either the application should provide an external mechanism for restarting an application that halts with a checked run-time error (common in Erlang code), or if restarting is difficult, all functions' contracts should be made very permissive so that "bad input" still meets the contract but promises always to raise an exception.
In every program, unchecked run-time errors should be rare. An unchecked run-time error is typically justified only on performance grounds, and even then only when code is performance-critical. Another source of unchecked run-time errors is programming in unsafe languages; for example, in C, there's no way to check whether memory pointed to has actually been initialized.
Another aspect of your question is
What kinds of contracts make the best designs?
The answer to this question varies more depending on the problem domain.
Because none of the work I do has to be high-availability or safety-critical, I use restrictive contracts and lots of checked run-time errors (typically assertion failures). When you are designing the interfaces and contracts of a big system, it is much easier if you keep the contracts simple, you keep the preconditions restrictive (tight), and you rely on checked run-time errors when arguments are "bad".
I have a function that you pass it in a list, and it returns a list of lists based on you parameters for splitting. BUT, before I pass it a list, I first verify the parameters in my MAIN during the getops process, so in the main I check to make sure there are no negatives passed by the user, I make sure the user didnt request to split into say, 4 parts, but asking to display part 5.
I think this is exactly the right way to solve this particular problem:
Your contract with the user is that the user can say anything, and if the user utters a nonsensical request, your program won't fall over— it will issue a sensible error message and then continue.
Your internal contract with your request-processing function is that you will pass it only sensible requests.
You therefore have a third function, outside the second, whose job it is to distinguish sense from nonsense and act accordingly—your request-processing function gets "sense", the user is told about "nonsense", and all contracts are met.
One of my main concerns is whether or not its proper to code it so that you check that the input of the user is valid OUTSIDE of the function.
Yes. Almost always this is the best design. In fact, there's probably a design pattern somewhere with a fancy name. But if not, experienced programmers have seen this over and over again. One of two things happens:
parse / validate / reject with error message
parse / validate / process
This kind of design has one data type (request) and four functions. Since I'm writing tons of Haskell code this week, I'll give an example in Haskell:
data Request -- type of a request
parse :: UserInput -> Request -- has a somewhat permissive precondition
validate :: Request -> Maybe ErrorMessage -- has a very permissive precondition
process :: Request -> Result -- has a very restrictive precondition
Of course there are many other ways to do it. Failures could be detected at the parsing stage as well as the validation stage. "Valid request" could actually be represented by a different type than "unvalidated request". And so on.

I'd do the check inside the function itself to make sure that the parameters I was expecting were indeed what I got.
Call it "defensive programming" or "programming by contract" or "assert checking parameters" or "encapsulation", but the idea is that the function should be responsible for checking its own pre- and post-conditions and making sure that no invariants are violated.
If you do it outside the function, you leave yourself open to the possibility that a client won't perform the checks. A method should not rely on others knowing how to use it properly.
If the contract fails you either throw an exception, if your language supports them, or return an error code of some kind.

Checking within the function adds complexity, so my personal policy is to do sanity checking as far up the stack as possible, and catch exceptions as they arise. I also make sure that my functions are documented so that other programmers know what the function expects of them. They may not always follow such expectations, but to be blunt, it is not my job to make their programs work.

It often makes sense to check the input in both places.
In the function you should validate the inputs and throw an exception if they are incorrect. This prevents invalid inputs causing the function to get halfway through and then throw an unexpected exception like "array index out of bounds" or similar. This will make debugging errors much simpler.
However throwing exceptions shouldn't be used as flow control and you wouldn't want to throw the raw exception straight to the user, so I would also add logic in the user interface to make sure I never call the function with invalid inputs. In your case this would be displaying a message on the console, but in other cases it might be showing a validation error in a GUI, possibly as you are typing.

"Code Complete" suggests an isolation strategy where one could draw a line between classes that validate all input and classes that treat their input as already validated. Anything allowed to pass the validation line is considered safe and can be passed to functions that don't do validation (they use asserts instead, so that errors in the external validation code can manifest themselves).

How to handle errors depends on the programming language; however, when writing a commandline application, the commandline really should validate that the input is reasonable. If the input is not reasonable, the appropriate behavior is to print a "Usage" message with an explanation of the requirements as well as to exit with a non-zero status code so that other programs know it failed (by testing the exit code).
Silent failure is the worst kind of failure, and that is what happens if you simply return incorrect results when given invalid arguments. If the failure is ever caught, then it will most likely be discovered very far away from the true point of failure (passing the invalid argument). Therefore, it is best, IMHO to throw an exception (or, where not possible, to return an error status code) when an argument is invalid, since it flags the error as soon as it occurs, making it much easier to identify and correct the true cause of failure.
I should also add that it is very important to be consistent in how you handle invalid inputs; you should either check and throw an exception on invalid input for all functions or do that for none of them, since if users of your interface discover that some functions throw on invalid input, they will begin to rely on this behavior and will be incredibly surprised when other function simply return invalid results rather than complaining.

Is the valid state domain of a program a regular language?

If you look at the call stack of a program and treat each return pointer as a token, what kind of automata is needed to build a recognizer for the valid states of the program?
As a corollary, what kind of automata is needed to build a recognizer for a specific bug state?
(Note: I'm only looking at the info that could be had from this function.)
My thought is that if these form regular languages than some interesting tools could be built around that. E.g. given a set of crash/failure dumps, automatically group them and generate a recognizer to identify new instances of know bugs.
Note: I'm not suggesting this as a diagnostic tool but as a data management tool for turning a pile of crash reports into something more useful.
"These 54 crashes seem related, as do those 42."
"These new crashes seem unrelated to anything before date X."
etc.
It would seem that I've not been clear about what I'm thinking of accomplishing, so here's an example:
Say you have a program that has three bugs in it.
Two bugs that cause invalid args to be passed to a single function tripping the same sanity check.
A function that if given a (valid) corner case goes into an infinite recursion.
Also as that when the program crashes (failed assert, uncaught exception, seg-V, stack overflow, etc.) it grabs a stack trace, extracts the call sites on it and ships them to a QA reporting server. (I'm assuming that only that information is extracted because 1, it's easy to get with a one time per project cost and 2, it has a simple, definite meaning that can be used without any special knowledge about the program)
What I'm proposing would be a tool that would attempt to classify incoming reports as connected to one of the known bugs (or as a new bug).
The simplest thing would be to assume that one failure site is one bug, but in the first example, two bugs get detected in the same place. The next easiest thing would be to require the entire stack to match, but again, this doesn't work in cases like the second example where you have multiple pieces of (valid) valid code that can trip the same bug.

The return pointer on the stack is just a pointer to memory. In theory if you look at the call stack of a program that just makes one function call, the return pointer (for that one function) can have different value for every execution of the program. How would you analyze that?
In theory you could read through a core dump using a map file. But doing so is extremely platform and compiler specific. You would not be able to create a general tool for doing this with any program. Read your compiler's documentation to see if it includes any tools for doing postmortem analysis.

If your program is decorated with assert statements, then each assert statement defines a valid state. The program statements between the assertions define the valid state changes.
A program that crashes has violated enough assertions that something broken.
A program that's incorrect but "flaky" has violated at least one assertion but hasn't failed.
It's not at all clear what you're looking for. The valid states are -- sometimes -- hard to define but -- usually -- easy to represent as simple assert statements.
Since a crashed program has violated one or more assertions, a program with explicit, executable assertions, doesn't need an crash debugging. It will simply fail an assert statement and die visibly.
If you don't want to put in assert statements then it's essentially impossible to know what state should have been true and which (never-actually-stated) assertion was violated.
Unwinding the call stack to work out the position and the nesting is trivial. But it's not clear what that shows. It tells you what broke, but not what other things lead to the breakage. That would require guessing what assertions where supposed to have been true, which requires deep knowledge of the design.
Edit.
"seem related" and "seem unrelated" are undefinable without recourse to the actual design of the actual application and the actual assertions that should be true in each stack frame.
If you don't know the assertions that should be true, all you have is a random puddle of variables. What can you claim about "related" given a random pile of values?
Crash 1: a = 2, b = 3, c = 4
Crash 2: a = 3, b = 4, c = 5
Related? Unrelated? How can you classify these without knowing everything about the code? If you know everything about the code, you can formulate standard assert-statement conditions that should have been true. And then you know what the actual crash is.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string