Can TDD be a valid alternative to overkill data validation?

Can TDD be a valid alternative to overkill data validation? - security

Consider these two data validation scenarios:
Check everything everywhere
Make sure that every method that takes one or more arguments actually checks them to ensure that they're syntactically valid.
Pros
Very fine check granularity.
If the code that is being written is for some kind of library we make sure to limit the damage that can be done if the developers that will be using it fail to provide valid data.
Cons
It's costly to always perform checks that most of the time shouldn't be needed.
It's still possible to forget to add a check every now and then.
More code is being written and hence in need of maintenance.
Make use of TDD goodness
Validate data only when it enters your code from the external world.
To make sure that internally data will be always syntactically correct, create tests that check every method that returns a value. To make sure that if valid data enters, valid data exits.
The pros and the cons are practically switched with the ones from the former approach.
As of now I'm using the first approach, but since I'm employing test driven development I thought that maybe I could go with the second one.
The advantages are clear, still, I wonder if it's as secure as the first method.

It sounds like the first method is contract driven, and one aspect of that is that you also need to verify that what you return from any public interface meets the contract.
But, I think that both approaches are valid, but very different.
TDD only partially deals with the public interface, in that it should check that every input is properly validated, unfortunately, unless you have all your validation in separate functions, to adequately test, it becomes very difficult to ensure that this function of 3 or 4 parameters is being properly tested for validity. The number of tests you have to write is quite high, in either approach.
If you are using a library, then in every function that can be called directly from the outside (outside being outside the library) then you will need to check that every input is valid, and that invalid input is handled as per the contract, either returning a null or throwing an exception. But, it must be in agreement with the documentation.
Once you have verified it, then there is no reason to force the verification on private functions as those can only be called from within the library, and you should be verifying that you are only dealing with valid data.
Lots of tests will be needed, regardless, unfortunately. All these tests do is to ensure that you don't have any surprise problems, but that should generally help justify the cost of writing and maintaining them.
As to your question, if your tests are really well written, and you ensure that all validity checks are done completely, then it should be as secure, but the risk is that if you believe it is secure and you have poorly written tests then it will actually be worse than no tests, as there is an assumption that your tests are well-written.
I would use both methods, until you know your tests are well-written then just go with TDD.

My opinion is that in the first scenario, two of your Cons outweigh everything else:
It's costly to always perform checks
that most of the time shouldn't be
needed.
More code is being written and hence
in need of maintenance.
Also, technically TDD has no bearing on this question, because it is not a testing technique. More later...
To mitigate the Cons I would strongly advocate (as I think you say) splitting the code into an outside and an inside: The outside is where all the validation occurs. Hopefully this is but a thin wrapper around the inside, to prevent GIGO. Once inside, data never needs to be validated again.
As for TDD, I would strongly advocate (as you are now doing) employing it to develop your code, with the added benefit of leaving a trail of tests that become a regression test suite. Now you will naturally develop your outside code to perform robust validation, with the promise of easily adding any checks that you might initially forget. Your inside code can be developed assuming it will only handle valid data, but TDD will still give you the confidence that it will function to spec.
I'm saying that I would go with the second approach, as I've described, independently of whether I'm developing with TDD, or not (but TDD is always my first choice).

The advantages are clear, still, I wonder if it's as secure as the first method.
This completely depends on how well you test it.
This could be just as secure, if the following two criteria are met:
Every publicly exposed means of adding data to the system are validated completely
Every internal method that translates data is completely and adequately tested
However, I question that this would be easier or that it would require less code. The amount of code required to check every public entry point is going to be very similar to the amount of code required to validate each method. You're going to need more checks in the entry points, since they'll have to check things that might otherwise be checked internally.

For the second method, you need two good sets of tests. You must not only check that
To make sure that if valid data
enters, valid data exits.
You must also check that if Invalid data enters, an exception is thrown. I suppose you still have to validate data and kick out if you have invalid data. This is really the only way if you don't want pesky ArgumentNullException s or other cryptic errors in your production application. However TDD can really toughen up the quality of all that checking (especially with Fuzz Testing).

One item is missing from your list of Pros and Cons and that is something important enough to make unit testing a much more safer method than maniac parameters checking.
You just have to consider the When and the Where.
For unit testing the when and the where are:
when: at design time
where: in a dedicated source file outside of the application code
For overkill data checking they are:
when: at runtime
where: entangled in the application source code, typically using asserts.
That is the point: code covered by unit testing detects errors at design time when you run the tests, if you are the paranoid and schizofrenic kind of tester (the bests) you write tests designed to break whatever can be, checking each data boundary and perverse input. You also use code coverage tools to ensure every branch of every alternative is tested. You have no limit : tests lies in their own files and do not clutter application. Doesn't matter if you get ten times as many test lines than the actual application code, no run time penalty, no readability penalty.
On the other hand integrated overkill testing detects errors at runtime. In the worst-case it will detects errors on the user system, where you can do nothing about it (if even you ever heard of this error happening). Also even if you are the paranoid kind you will have to limit your testing. Assertion just can't be 90 percents of the application code. It raise readability issues, maintenance, often heavy performances penalty. Where will you stop then: only checking parameters for external input ? Checking every possible or impossible inputs of inner functions ? Checking every loop invariant ? Also testing behavior when out of flow data (globals, system files, etc) is changed ? You must also be conscious that assertion code can also contain some bugs. What if the formula of an assertion perform a divide. You must ensure it will not lead by a DIVIDE-BY-ZERO error or such ?
Another problem is that in many cases you just don't know what can be done when an assertion failure. If you are at a real entry point you can return back something understandable for your user or the lib user... when you are checking innner functions

Related

Should Assertions be performed in BDD Given and When

In Behavior Driven Development style of writing automated functional tests, it is generally understood that Givens should be the pre-condition that the system must be in, in order to begin the test, When should be the user action performed and Then should be asserting whether the observed matches the expected and fail or pass the test accordingly.
Off late my team has started performing assertions in Givens and Whens too which lead me to wonder if this is a correct practice.
For e.g -
Given a user with xyz privileges is logged in
When I click on the abc tab
Then records should be displayed
Should the Given in this test actually assert that the logged in user indeed has xyz privileges or assume the user has required privileges and just perform login
Also should the When assert that tab is visible before clicking?

If "logging in" is a behaviour that's interesting* and you need examples to illustrate it** then it should be a "When", with the context in which it happens being the "Given", and the outcome that results being the "Then".
This is the case for any behaviour you need to illustrate with an example.
However, sometimes it can be useful to make assertions in a Given, just to check that the context really is set up properly. Sometimes when people start adopting BDD the environments can be a bit flaky, and it's handy to know if it's your scenario finding a bug that made it fail, or something earlier in the process. So for that reason, you might find assertions there.
The Given doesn't concern itself with how the system got into that state, though. If it has assertions, it should merely be to check that it is.
Another form I've seen is a check that the system is in the correct state for the context, with corrective action if it isn't.
Note that these are largely interim patterns. They can be helpful while teams are adopting BDD and getting their pipelines and automated deployments into shape.
Assertions which check the results of the "When" are part of the outcome, so part of the "Then". I can't imagine a case where you'd need to check the results of a "When" without it being an outcome. If you've got one, please give me an example.
We discourage using clicking and UI details in scenarios. Work out what you're trying to achieve, and do that. Hide the clicking under the covers.
Most of the time automated scenarios aren't actually there to catch bugs; they're living documentation that helps people think about what they're trying to achieve and what the system already does, thus encouraging good design and preventing bugs in the first place.
I'd say something like "navigate to the ABC tab" and just do it; you'll get a relevant exception if it isn't there, and that won't happen as often as people reading the scenario will.
* It's logging in. It probably isn't.
** It's logging in. You probably don't.

Assertions in Givens and Whens are generally an indicator of immaturity of step definitions. So I might put one in a step that I am working on, but I wouldn't keep it there for very long.
I'd implement your step
Given a user with xyz privileges is logged in
as something like
'Given a user with xyz privileges is logged in' do
user = create_user(privileges: xyz)
login_as user: user
end
I would expect that the create_user method would be trusted pretty quickly and would not need an assertion. Same for the login_as method. (if methods like this aren't working properly you'd expect many scenarios to be broken)
Notice how this code makes it clear that there are two things going on and provides/uses an api that you would expect many other stepdefs to use. And how any assertion you might want to keep really belongs in the helper methods not in the step def.

There are no strict rules however in my experience I find it handy to define that there are no assertions there so it is clear what is actually tested and the test itself will run faster.
In the case if your abc tab is not displayed it shouldn't be clickable and then the test will fail either when identifying the object to click or when performing the next step(depends on the tools and method you use).
However you should make sure that the actual implementation of the test is not cheating and working with internals that are able to trigger a click even if the actual component isn't.
There is another point about the Given, there normally it is recommended to set up the environment for the test. This means you make sure that in your system there is a user and that user has been logged in. It makes no meaning to very that as you set it up, however if any failure occur you should fail fast to know what is wrong.

Generally I would be careful about using assertions in the Given or When steps in the automation code.
However, I use login steps like this all over the place in my code as - when used correctly - can turn the step into a piece of context rather than an assertion itself.
For instance:
Is this user logged in?
Yes => Do Nothing.
No => Log out, then Log in as this user
In the example you gave, we know that we need the specific user to be logged in for this scenario to work, however we do not know if there is a different user logged in (other scenarios may have been run before it). If you use this step as a check to make sure that the correct user is logged in (as in my example), it is part of the context and it will speed up run time of the automation as you won't have to log in before and log out after every scenario.

Command Query Separation: commands must return void?

If CQS prevents commands from returning status variables, how does one code for commands that may not succeed? Let's say you can't rely on exceptions.
It seems like anything that is request/response is a violation of CQS.
So it would seem like you would have a set of "mother may I" methods giving the statuses that would have been returned by the command. What happens in a multithreaded / multi computer application?
If I have three clients looking to request that a server's object increase by one (and the object has limits 0-100). All check to see if they can but one gets it - and the other two can't because it just hit a limit. It would seem like a returned status would solve the problem here.

It seems like anything that is request/response is a violation of CQS.
Pretty much yes, hence Command-Query-Separation. As Martin Fowler nicely puts it:
The fundamental idea is that we should divide an object's methods into two sharply separated categories:
Queries: Return a result and do not change the observable state of the system (are free of side effects).
Commands: Change the state of a system but do not return a value [my emphasis].
Requesting that a server's object increase by one is a Command, so it should not return a value - processing a response to that request means that you are doing a Command and Query action at the same time which breaks the fundamental tenet of CQS.
So if you want to know what the server's value is, you issue a separate Query.
If you really need a request-response pattern, you either need to have something like a convoluted callback event process to issue queries for the status of a specific request, or pure CQS isn't appropriate for this part of your system - note the word pure.
Multithreading is a main drawback of CQS and can make it can hard to do. Wikipedia has a basic example and discussion of this and also links to the same Martin Fowler article where he suggests it is OK to break the pattern to get something done without driving yourself crazy:
[Bertrand] Meyer [the inventor of CQS] likes to use command-query separation absolutely, but there are
exceptions. Popping a stack is a good example of a query that modifies
state. Meyer correctly says that you can avoid having this method, but
it is a useful idiom. So I prefer to follow this principle when I can,
but I'm prepared to break it to get my pop.
TL;DR - I would probably just look at returning a response, even tho it isn't correct CQS.

Article "Race Conditions Don’t Exist" may help you to look at the problem with CQS/CQRS mindset.
You may want to step back and ask why counter value is absolutely necessary to know before sending a command? Apparently, you want to make decision on the client side whether you can increase counter more or not.
The approach is to let the server make such decision. Let all the clients send commands (some will succeed and some will fail). Eventually clients will get consistent view of the server object state (where limit has reached) and may finally stop sending such commands.
This time window of inconsistency leads to wrong decisions by the clients, but it never breaks consistency of the object (or domain model) on the server side as long as commands are handled adequately.

Why does the CQS principle require a Command not to return any value?

The CQS principle says every method should either be a command that performs an action, or a query that returns data to the caller, but not both.
It makes sense for a Query not to do anything else, because you don't expect a query to change the state.
But it looks harmless if a Command returns some extra piece of information. You can either use the returned value or ignore it. Why does the CQS principle require a Command not to return any values?

But it looks harmless if a Command returns some extra piece of information?
It often is. It sometimes isn't.
People can start confusing queries for commands, or calling commands more for the information it returns than for its effect (along with "clever" ways of preventing that effect from being a real effect, that can be brittle).
It can lead to gaps in an interface. If the only use-case people can envision for a particular query is hand-in-hand with a particular command, it may seem pointless to add the pure form of the query (e.g. writing a stack with a Pop() but no Peek()) which can restrict the flexibility of the component in the face of future changes.
In a way, "looks harmless" is exactly what CQS is warning you about, in banning such constructs.
Now, that isn't to say that you might not still decide that a particular command-query combination isn't so useful to be worth it, but in weighing up the pros and cons of such a decision, CQS is always a voice arguing against it.

From my understanding, one of the benefits of CQS is how well it works in distributed environments. Commands become their own isolated unit that could be immediately executed, placed in a queue to be executed at a later date, executed by a remote event handler etc.
If the commander interface were to specify a return type then you greatly affect the strength of the CQS pattern in its ability to fit well within a distributed model.
The common approach to solving this problem (see this article for instance by Mark Seemann) is to generate a unique ID such as a guid which is unique to the event executed by the command handler. This is then persisted to allow the data to be identified at a later date.

Should I use Command to implement a domain derivations in CQRS

I'm using CQRS on an air booking application. one use case is help customer cancel their tickets. But before the acutal cancellation, the customer wants to know the penalty.
The penalty is calculated based on air rules. Some of our provider could calculate the penalty through exposing an web service while the others don't. (They publish some paper explaining the algorithm instead). So I define a domain service
public interface AirTicketService {
//ticket demand method
MonetaryAmount penalty(String ticketNumber);
void cancel(String ticketNumber, MonetaryAmount penalty);
}
My question is which side(command/query) is responsible for invoking this domain service and returning result in a CQRS style application?
I want to use a Command: CalculatePenlatyCommand, In this way, it's easy to resuse the domain model, but it's a little odd because this command does not modify state.
Or should I retrieve a readmodel of ticket if this is a query? But the same DomainService is needed on both command and query side, it's odd too.
Is domain derivation a query?

There is no need to shoehorn everything in to the command-query pipeline. You could query this service independently from the UI without issuing a command or asking the read-model.

There is nothing wrong with satisfying a query using an existing model if it "fits" both the terminology and the structure of that model. No need to build up a separate read model for that purpose. It's not without risk, since the semantics and the context of the query should be closely tied to the model that is otherwise used for write purposes only. The risk I allude to is the fact that the write and read concerns could drift apart (and we're back at square one, i.e. the reason why people pick CQRS in the first place). So you must keep paying attention as new requirements come in.
Queries that fit this model really well are what I call "simulators', where you want to run a simulation using current state to e.g. to give feedback to an end user. On more than one occasion I've found that the simulation logic could be reused both as a feedback mechanism and as an execution (of a write operation/command) steering mechanism. The difference is in what we do with the outcome of the simulation. Again, this is not without risk and requires careful judgement.

You may bring arguments that Calculate Penalty Command is not odd at all.
The user asks the system to do something - command enough.
You can even have a Penalty Calculation Requested Event event in your domain, and it would feel right. Because, at some time, you may be interested in, let's say, unsure clients, ones that want to cancel tickets but they change their mind every time etc. The calculation may be performed asynchronously, too - you can provide the result (penalty cost) to the user in various ways afterwards...
Or, in some other way: on your ticket booked event, store cancellation penalty, too. Then, you can make that value accessible any time, without the need to recompute it... But this may be wrong (?) because penalty would largely depend on time, right (the late you cancel your ticket, the more you pay)?
If all this would like over-complications etc., then I guess I agree with rmac's answer, too :)

How to test a program processing large amounts of data stored in an unpredictable format

What I have to do
I'm trying to manipulate some rather large amounts of data stored in Excel files (one of the workbooks has as much as 150 spreadsheets). The result of these manipulations may yield approximately 800.000 rows in a database table.
The problem
Data stored in the spreadsheets has unpredictable format. The company that generated these spreadsheets had no fixed/documented format for exporting these files, and sometimes erroneous data appear. For example most of the years are represented like "2009" but there are cases where a year is represented as "20". Other example, data is not really normalized in these files, so I use separators to split the values of certain cells. Sometimes these separators change.
There are things like these that I couldn't predict and I only discovered them only after running an already evolved version of my program over a pretty large part of the available data.
The question
How can one test the correctness of a program in such a situation? Or rather, how to achieve a pretty stable version of the product without running it over the whole available data?
Shall I take a defensive approach and throw exceptions whenever some kind of unexpected issue arises? Then the main loop of the program may catch and log them and continue with the available data? This would yield some processed data, but that means that on a subsequent iteration of the program I have to have checks for what's already inside the database from previous iterations (which I don't really like).
What's your opinion? How would you tackle this problem?

If there is no specification for what the format of the data is, then anything is acceptable.
If not, then there is either an explicit or implicit specification of the data. I would try and nail this down right now. If you can't get an explicit enough definition of the data to write your program so that it can be expected to run without error, then I would say you are taking a very large risk in causing some serious damage depending on how this data is being used.
You should write your program so that it either throws an exception or logs an error whenever running across data that does not meet the specification. Then, run the program on PART of the available data until it runs without exception. This can be viewed as a training set for the development of your program. Then, use some of the saved data to use as a TEST set. This will give you an estimate of how many exceptions/errors your program will generate in production.
Overfitting is a common machine learning concept, but it is useful to other tasks such as this - program development. It is surprising to me how developers can write a bunch of unit tests, code their application to perform well on it, and then expect similar or bug-free performance in production.
If you're not willing to take all these steps (i.e. run your code on essentially all of the data -- since the test set is also making use of the data) then I would say the task is too large to do.

As an aside, rather than creating a definition of a format that is very strange and peculiar to account for all the "errors" in the current data, you might want to create a new, normalized (in the sense these things are simplified away) specification for the data, and then write a "faulty document patcher" that can be run on faulty documents to fix the data.
If the application generating the data is still in production, then you might need to go to the developers of this application to get a buy in on the new spec. Once you have that, you can then start logging bugs against their application, so hopefully the faulty document patcher can be retired.
More likely, I'm guessing that the software developers are long gone, no one understands the code anymore, if it is even running at all.

How can one test the correctness of a program in such a situation? Or rather, how to achieve a pretty stable version of the product without running it over the whole available data?
For every single data type I would set reasonable constraints on the values that it is allowed to be.
If a cell violates these constraints then throw an exception containing the piece of data it failed on and its data type. If a piece of data violated its constraints you can modify the source to include the additional constraints required for that piece of data, and a conversion method to make it uniform.
To give an example on the date you gave, initially a date would have the constraint that it could be only four digits. When the program came across the "20" it would throw an exception.
Then you could go and allow two digit dates, and a method to convert the two-digit dates into a four digit one to allow further processing.

One question is, will you run your program more than once? From your question it sounds possible you only want to run it once, and then you will then work with the data in the database.
In which case you can be very defensive - throw exceptions whenever unexpected data appears. Run the program repeatedly on ever-larger sets of the data. Initially, solve any exceptions by altering the code, as it's a good rule of thumb that the exceptions you find first are going to be common. You might want to empty the output database between runs.
Later on, you will be finding rare exceptions that might only occur a couple of times in the input. Just solve these by hand and insert the corresponding rows in the database yourself. Or write another small program that reads your exception information and inserts the new rows, rather than running your whole big program again.

Typically for this sort of thing I do these as #MarkJ suggested, and I encode the whole thing in unit tests.
So I compose a small datafile that at first contains only a few rows of normal data. That's unit test number 1.
Then I take a quick visual scan of some of the data to spot any obvious exceptions. Unit tests 2 through n.
Finally, I write parser code until it passes all unit tests, and throws and logs exceptions for all un-managed data.
I then use these oddball bits of data to make new unit tests, and improve the parser until it can pass those too.
Although sometimes accommodating some really strange bit of data adds more parser complexity than it's worth, and I'll just log the exception, dump it, and move on. This is a matter of professional judgment.

How about processing every piece of data (so you don't have to check for dupes). Those that pass go into the database. The exceptions go into an exception file. The user can open the exception file and make corrections/modifications to the data. Then they can run your program on the exception file.
This will isolate unhandled data for the user to correct and prevent you from processing the same data twice (or more).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string