Unused object properties bad? - object

I have a object list with some properties that dont need .I always include the ones I know I won't use anyway because "maybe I need them later" excuse. Is there a clear answer if this is bad or not? I imagine it is a tiny waste of performance, but does it really matter?

It depends.
I often find that it is just as easy to add properties in later when I know for sure that I need them, and how I will use them.
Depending on what your compiler/interpreter does with the unused properties then you might get (very negligible) performance issues[citation for C#].
If you are just hacking together some code, then you are probably not going to hurt anything. On the other hand, if you are working with a team or even on a public API then having unused variables could be very harmful. Imagine I have some interface like this:
interface IMyInterface {
void DoStuff(); // implemented
bool WarningAsError {get; set;} // just speculative - no implementation yet
}
Anybody using this interface will probably assume that they can set the WarningAsError flag and your interface will treat warnings as errors. This could waste huge amounts of time wondering why the warnings are not treated as errors.

Related

Should Spark transformation be broken to functions

It looks like all the Spark examples found in web are built in as single long function (usually in main)
But it is often the case that it makes sense to break the long call into functions like:
Improve readability
Share code between solution paths
A typical signature would look like this (this is Java code but similar signature would appear in all languages)
private static Dataset<Row> myFiltering(Dataset<Row> data) {
return data.filter(functions.col("firstName").isNotNull()).filter(functions.col("lastName").isNotNull());
}
The problem here is that there is no safety for the content of the Row, as there is no enforcement on the fields, and calling the function becomes not only a matter of matching the singnature but also the content of Row. Which obviously may (and does in my case) cause errors.
What is the best practice you enforce in large scale development environments? do you leave the code as one long function? do you suffer every time you change field names?
Yes, you should split your method into smaller methods. Be aware, that many small functions also are not much readable ;)
My rules:
Split some chain of transformation if we can name it - if we have name, it means that this is some kind of subfunction
If there are many functions of the same domain - extract new class or even package.
KISS: don't split if your function call chain is short and also describes some subfunction, even if some lines may be extracted - it is not as much readable to read many many custom functions
Any larger - not in 1-3 lines - filters, maps I suggest to extract to method and use method reference
Marked as Community Wiki, because this is only my point of view and it's OT for StackOverflow. It someone else has any other suggestions, please share it :)

What is a good common term for errors and warnings?

I'd like to build an OO hierarchy of errors and warnings returned to the client during a, let's say, pricing operation:
interface PricingMessage {}
interface PricingWarning extends PricingMessage {}
interface PricingError extends PricingMessage {}
class NoSuchProductError implements PricingError {
...
}
I'm not very keen on the name PricingMessage. What is the concept that includes errors and warnings?
EDIT: To be clear, I'm looking for a common concept or name for errors and warnings specifically (excluding e.g. general info messages). For instance, compilers also report errors and warnings. What are these?
Some suggestions...
"Result"
An operation has results, which could be some number of errors, warnings, notes and explicit or implied (no errors) success.
PricingResult
"Issue"
An operation ran, but with issues, some of which could be fatal (errors) and some of which might not be (warnings). If there are no issues, success is implied.
PricingIssue
"Condition"
An operation might have or be in an error condition or a warning condition. I think some compilers use "condition" as an abstract term for an error or warning.
PricingCondition
"Diagnostic"
The outcome of an operation might include diagnostics for errors and warnings. Another compiler term, I believe.
PricingDiagnostic
If you were dealing with java, or similar OO languages, the word you are looking for would be Exception. This indicates that you have reached an "exceptional" condition which needs to be handled in a controlled way.
Through looking at a few synonym lists, I found the following:
Anomaly, oddity, deviation
Alert, message, notification
Fault, misstep, failure, glitch
I prefer the name Alert. An alert IMO can have any level of severity, it could be identified as informational, warning, critical or any other level deemed appropriate.
The counter argument I have heard to this naming is the idea that alert the noun follows too closely to alert the verb, making the distinction that the object (noun) may or many not have been brought to the users attention yet (verb). In this context naming them alert could creates a bit of cognitive dissonance, and perhaps confusion for developers reasoning about your code.
The best I can propose would be to create a hard distinction be made in your code base between Alert (the object of exceptional condition) and Notification (the act of bringing the alert to the users attention) to keep things intuitive for programmers moving forward.
In theory these could be defined as events - so you could use that.
This is very subjective, but here are a few suggestions:
Output
LogEntry
In web design, the term "admonition" is sometimes used for a block of text that can be an error, warning, or informational.

Meaning of Leaky Abstraction?

What does the term "Leaky Abstraction" mean? (Please explain with examples. I often have a hard time grokking a mere theory.)
Here's a meatspace example:
Automobiles have abstractions for drivers. In its purest form, there's a steering wheel, accelerator and brake. This abstraction hides a lot of detail about what's under the hood: engine, cams, timing belt, spark plugs, radiator, etc.
The neat thing about this abstraction is that we can replace parts of the implementation with improved parts without retraining the user. Let's say we replace the distributor cap with electronic ignition, and we replace the fixed cam with a variable cam. These changes improve performance but the user still steers with the wheel and uses the pedals to start and stop.
It's actually quite remarkable... a 16 year old or an 80 year old can operate this complicated piece of machinery without really knowing much about how it works inside!
But there are leaks. The transmission is a small leak. In an automatic transmission you can feel the car lose power for a moment as it switches gears, whereas in CVT you feel smooth torque all the way up.
There are bigger leaks, too. If you rev the engine too fast, you may do damage to it. If the engine block is too cold, the car may not start or it may have poor performance. And if you crank the radio, headlights, and AC all at the same time, you'll see your gas mileage go down.
It simply means that your abstraction exposes some of the implementation details, or that you need to be aware of the implementation details when using the abstraction. The term is attributed to Joel Spolsky, circa 2002. See the wikipedia article for more information.
A classic example are network libraries that allow you to treat remote files as local. The developer using this abstraction must be aware that network problems may cause this to fail in ways that local files do not. You then need to develop code to handle specifically errors outside the abstraction that the network library provides.
Wikipedia has a pretty good definition for this
A leaky abstraction refers to any implemented abstraction, intended to reduce (or hide) complexity, where the underlying details are not completely hidden
Or in other words for software it's when you can observe implementation details of a feature via limitations or side effects in the program.
A quick example would be C# / VB.Net closures and their inability to capture ref / out parameters. The reason they cannot be captured is due to an implementation detail of how the lifting process occurs. This is not to say though that there is a better way of doing this.
Here's an example familiar to .NET developers: ASP.NET's Page class attempts to hide the details of HTTP operations, particularly the management of form data, so that developers don't have to deal with posted values (because it automatically maps form values to server controls).
But if you wander beyond the most basic usage scenarios the Page abstraction begins to leak and it becomes hard to work with pages unless you understand the class' implementation details.
One common example is dynamically adding controls to a page - the value of dynamically-added controls won't be mapped for you unless you add them at just the right time: before the underlying engine maps the incoming form values to the appropriate controls. When you have to learn that, the abstraction has leaked.
Well, in a way it is a purely theoretical thing, though not unimportant.
We use abstractions to make things easier to comprehend. I may operate on a string class in some language to hide the fact that I'm dealing with an ordered set of characters that are individual items. I deal with an ordered set of characters to hide the fact that I'm dealing with numbers. I deal with numbers to hide the fact that I'm dealing with 1s and 0s.
A leaky abstraction is one that doesn't hide the details its meant to hide. If call string.Length on a 5-character string in Java or .NET I could get any answer from 5 to 10, because of implementation details where what those languages call characters are really UTF-16 data-points which can represent either 1 or .5 of a character. The abstraction has leaked. Not leaking it though means that finding the length would either require more storage space (to store the real length) or change from being O(1) to O(n) (to work out what the real length is). If I care about the real answer (often you don't really) you need to work on the knowledge of what is really going on.
More debatable cases happen with cases like where a method or property lets you get in at the inner workings, whether they are abstraction leaks, or well-defined ways to move to a lower level of abstraction, can sometimes be a matter people disagree on.
I'll continue in the vein of giving examples by using RPC.
In the ideal world of RPC, a remote procedure call should look like a local procedure call (or so the story goes). It should be completely transparent to the programmer such that when they call SomeObject.someFunction() they have no idea if SomeObject (or just someFunction for that matter) are locally stored and executed or remotely stored and executed. The theory goes that this makes programming simpler.
The reality is different because there's a HUGE difference between making a local function call (even if you're using the world's slowest interpreted language) and:
calling through a proxy object
serializing your parameters
making a network connection (if not already established)
transmitting the data to the remote proxy
having the remote proxy restore the data and call the remote function on your behalf
serializing the return value(s)
transmitting the return values to the local proxy
reassembling the serialized data
returning the response from the remote function
In time alone that's about three orders (or more!) of magnitude difference. Those three+ orders of magnitude are going to make a huge difference in performance that will make your abstraction of a procedure call leak rather obviously the first time you mistakenly treat an RPC as a real function call. Further a real function call, barring serious problems in your code, will have very few failure points outside of implementation bugs. An RPC call has all of the following possible problems that will get slathered on as failure cases over and above what you'd expect from a regular local call:
you might not be able to instantiate your local proxy
you might not be able to instantiate your remote proxy
the proxies may not be able to connect
the parameters you send may not make it intact or at all
the return value the remote sends may not make it intact or at all
So now your RPC call which is "just like a local function call" has a whole buttload of extra failure conditions you don't have to contend with when doing local function calls. The abstraction has leaked again, even harder.
In the end RPC is a bad abstraction because it leaks like a sieve at every level -- when successful and when failing both.
What is abstraction?
Abstraction is a way of simplifying the world.
It means you don't have to worry about what is actually happening under the hood.
Example: Flying a 737/747 is "abstracted" away
Planes are complicated systems: it involves: jet engines, oxygen systems, electrical systems, landing gear systems etc.
...but the pilot doesn't have to worry about it... all of that is "abstracted away". The only thing a pilot needs to focus on is yoke (i.e. steering wheel of the plane).
He pushes the yoke left to go left, and right to go right etc.
....that is in an ideal world. In reality, flying a plane is much more complicated. Because many details ARE NOT "abstracted away".
Leaky Abstractions in 737 Example
Pilots in reality have to worry about a LOT of things: wind speed, thrust, angles of attack, fuel, altitude, weather problems, angles of descent. Computers can help the pilot in these tasks, but not everything is automated / simplified......not everything is "abstracted away".
e.g. If the pilot pulls up too hard on the column - the plane will obey, but then the plane might stall, and that's really bad.
In other words, it is not enough for the pilot to simply control the steering wheel without knowing anything else.........nooooo.......the pilot must know about the underlying risks and limitations of the plane before the pilot flies one.......the pilot must know how the plane works, and how the plane flies; the pilot must know implementation details..... that pulling up too hard will lead to a stall, or that landing too steeply will destroy the plane etc.
Those things are not abstracted away. A lot of things are abstracted, but not everything. The abstraction is "leaky".
Leaky Abstractions in Code
......it's the same thing in your code. If you don't know the underlying implementation details, then you're gonna have problems.
ORMs abstract a lot of the hassle in dealing with database queries, but if you've ever done something like:
User.all.each do |user|
puts user.name # let's print each user's name
end
Then you will realise that's a nice way to kill your app. You need to know that calling User.allwith 25 million users is going to spike your memory usage, and is going to cause problems. You need to know some underlying details. The abstraction is leaky.
An example in the django ORM many-to-many example:
Notice in the Sample API Usage that you need to .save() the base Article object a1 before you can add Publication objects to the many-to-many attribute. And notice that updating the many-to-many attribute saves to the underlying database immediately, whereas updating a singular attribute is not reflected in the db until the .save() is called.
The abstraction is that we are working with an object graph, where single-value attributes and mult-value attributes are just attributes. But the implementation as a relational database backed data store leaks... as the integrity system of the RDBS appears through the thin veneer of an object interface.
The fact that at some point, which will guided by your scale and execution, you will be needed to get familiar with the implementation details of your abstraction framework in order to understand why it behave that way it behave.
For example, consider this SQL query:
SELECT id, first_name, last_name, age, subject FROM student_details;
And its alternative:
SELECT * FROM student_details;
Now, they do look like a logically equivalent solutions, but the performance of the first one is better due the individual column names specification.
It's a trivial example but eventually it comes back to Joel Spolsky quote:
All non-trivial abstractions, to some degree, are leaky.
At some point, when you will reach a certain scale in your operation, you will want to optimize the way your DB (SQL) works. To do it, you will need to know the way relational databases works. It was abstracted to you in the beginning, but it's leaky. You need to learn it at some point.
Assume, we have the following code in a library:
Object[] fetchDeviceColorAndModel(String serialNumberOfDevice)
{
//fetch Device Color and Device Model from DB.
//create new Object[] and set 0th field with color and 1st field with model value.
}
When the consumer calls the API, they get an Object[]. The consumer has to understand that the first field of the object array has color value and second field is the model value. Here the abstraction has leaked from library to the consumer code.
One of the solutions is to return an object which encapsulates Model and Color of the Device. The consumer can call that object to get the model and color value.
DeviceColorAndModel fetchDeviceColorAndModel(String serialNumberOfTheDevice)
{
//fetch Device Color and Device Model from DB.
return new DeviceColorAndModel(color, model);
}
Leaky abstraction is all about encapsulating state. very simple example of leaky abstraction:
$currentTime = new DateTime();
$bankAccount1->setLastRefresh($currentTime);
$bankAccount2->setLastRefresh($currentTime);
$currentTime->setTimestamp($aTimestamp);
class BankAccount {
// ...
public function setLastRefresh(DateTimeImmutable $lastRefresh)
{
$this->lastRefresh = $lastRefresh;
} }
and the right way(not leaky abstraction):
class BankAccount
{
// ...
public function setLastRefresh(DateTime $lastRefresh)
{
$this->lastRefresh = clone $lastRefresh;
}
}
more description here.

Can TDD be a valid alternative to overkill data validation?

Consider these two data validation scenarios:
Check everything everywhere
Make sure that every method that takes one or more arguments actually checks them to ensure that they're syntactically valid.
Pros
Very fine check granularity.
If the code that is being written is for some kind of library we make sure to limit the damage that can be done if the developers that will be using it fail to provide valid data.
Cons
It's costly to always perform checks that most of the time shouldn't be needed.
It's still possible to forget to add a check every now and then.
More code is being written and hence in need of maintenance.
Make use of TDD goodness
Validate data only when it enters your code from the external world.
To make sure that internally data will be always syntactically correct, create tests that check every method that returns a value. To make sure that if valid data enters, valid data exits.
The pros and the cons are practically switched with the ones from the former approach.
As of now I'm using the first approach, but since I'm employing test driven development I thought that maybe I could go with the second one.
The advantages are clear, still, I wonder if it's as secure as the first method.
It sounds like the first method is contract driven, and one aspect of that is that you also need to verify that what you return from any public interface meets the contract.
But, I think that both approaches are valid, but very different.
TDD only partially deals with the public interface, in that it should check that every input is properly validated, unfortunately, unless you have all your validation in separate functions, to adequately test, it becomes very difficult to ensure that this function of 3 or 4 parameters is being properly tested for validity. The number of tests you have to write is quite high, in either approach.
If you are using a library, then in every function that can be called directly from the outside (outside being outside the library) then you will need to check that every input is valid, and that invalid input is handled as per the contract, either returning a null or throwing an exception. But, it must be in agreement with the documentation.
Once you have verified it, then there is no reason to force the verification on private functions as those can only be called from within the library, and you should be verifying that you are only dealing with valid data.
Lots of tests will be needed, regardless, unfortunately. All these tests do is to ensure that you don't have any surprise problems, but that should generally help justify the cost of writing and maintaining them.
As to your question, if your tests are really well written, and you ensure that all validity checks are done completely, then it should be as secure, but the risk is that if you believe it is secure and you have poorly written tests then it will actually be worse than no tests, as there is an assumption that your tests are well-written.
I would use both methods, until you know your tests are well-written then just go with TDD.
My opinion is that in the first scenario, two of your Cons outweigh everything else:
It's costly to always perform checks
that most of the time shouldn't be
needed.
More code is being written and hence
in need of maintenance.
Also, technically TDD has no bearing on this question, because it is not a testing technique. More later...
To mitigate the Cons I would strongly advocate (as I think you say) splitting the code into an outside and an inside: The outside is where all the validation occurs. Hopefully this is but a thin wrapper around the inside, to prevent GIGO. Once inside, data never needs to be validated again.
As for TDD, I would strongly advocate (as you are now doing) employing it to develop your code, with the added benefit of leaving a trail of tests that become a regression test suite. Now you will naturally develop your outside code to perform robust validation, with the promise of easily adding any checks that you might initially forget. Your inside code can be developed assuming it will only handle valid data, but TDD will still give you the confidence that it will function to spec.
I'm saying that I would go with the second approach, as I've described, independently of whether I'm developing with TDD, or not (but TDD is always my first choice).
The advantages are clear, still, I wonder if it's as secure as the first method.
This completely depends on how well you test it.
This could be just as secure, if the following two criteria are met:
Every publicly exposed means of adding data to the system are validated completely
Every internal method that translates data is completely and adequately tested
However, I question that this would be easier or that it would require less code. The amount of code required to check every public entry point is going to be very similar to the amount of code required to validate each method. You're going to need more checks in the entry points, since they'll have to check things that might otherwise be checked internally.
For the second method, you need two good sets of tests. You must not only check that
To make sure that if valid data
enters, valid data exits.
You must also check that if Invalid data enters, an exception is thrown. I suppose you still have to validate data and kick out if you have invalid data. This is really the only way if you don't want pesky ArgumentNullException s or other cryptic errors in your production application. However TDD can really toughen up the quality of all that checking (especially with Fuzz Testing).
One item is missing from your list of Pros and Cons and that is something important enough to make unit testing a much more safer method than maniac parameters checking.
You just have to consider the When and the Where.
For unit testing the when and the where are:
when: at design time
where: in a dedicated source file outside of the application code
For overkill data checking they are:
when: at runtime
where: entangled in the application source code, typically using asserts.
That is the point: code covered by unit testing detects errors at design time when you run the tests, if you are the paranoid and schizofrenic kind of tester (the bests) you write tests designed to break whatever can be, checking each data boundary and perverse input. You also use code coverage tools to ensure every branch of every alternative is tested. You have no limit : tests lies in their own files and do not clutter application. Doesn't matter if you get ten times as many test lines than the actual application code, no run time penalty, no readability penalty.
On the other hand integrated overkill testing detects errors at runtime. In the worst-case it will detects errors on the user system, where you can do nothing about it (if even you ever heard of this error happening). Also even if you are the paranoid kind you will have to limit your testing. Assertion just can't be 90 percents of the application code. It raise readability issues, maintenance, often heavy performances penalty. Where will you stop then: only checking parameters for external input ? Checking every possible or impossible inputs of inner functions ? Checking every loop invariant ? Also testing behavior when out of flow data (globals, system files, etc) is changed ? You must also be conscious that assertion code can also contain some bugs. What if the formula of an assertion perform a divide. You must ensure it will not lead by a DIVIDE-BY-ZERO error or such ?
Another problem is that in many cases you just don't know what can be done when an assertion failure. If you are at a real entry point you can return back something understandable for your user or the lib user... when you are checking innner functions

What do you think of the Managed Contract Tools library

I recently saw this video
http://channel9.msdn.com/pdc2008/TL51/ about the managed Contract tools library which certainly looks very interesting. Sadly it seems they won't include this into the language itself which would have been more elegant as in Spec#. It would be nice to have is as both options in C#4.0 in fact as the Contracts adds a lot of noise to the business code.
Have anybody here used it and have some real world feedback? Could you also add contracts to class Properties and even variables? Something like
decimal Percentage (min 0, max 1)
string NotNullString (not null, regex("??"))
would maybe been nice.
I'm trying it, but I think that the library is too young to be used seriously on big projects, at least with the static checks enabled: compiling is very slow and the specific warning are not very clear to read.
Runtime checkings can be used without problems, because they seems to be implemented as Debug.Assert. At least you document methods.
In order to add contracts to properties i would add constraints to the set property, but in this specific case I think it would be better writing a class that could actually encapsulate the requirements, allowing the construction of good objects only. Anyway:
private decimal _Percentage;
decimal Percentage
{
get{ return _Percentage;}
set
{
CodeContract.RequiresAlways(value <= 1);
CodeContract.RequiresAlways(value >= 0);
_Percentage = value;
}
}
p.s: It seems to me that the trend in C# goes in the dynamic typing direction, instead of going to the strict and strong typing coding methods. I think that DbC works better with strong typing, at least because it let you add more requirements to types and to functions.
It seems the Contract Library could be a nice fit for Linq2Sql. Define fields and constraint in your sql database and contracts could be generated for you. This could be a nice introduction to Contracts.

Resources