How do you evaluate reliability in software? - reliability

We are currently setting up the evaluation criteria for a trade study we will be conducting.
One of the criterion we selected is reliability (and/or robustness - are these the same?).
How do you assess that software is reliable without being able to afford much time evaluating it?
Edit: Along the lines of the response given by KenG, to narrow the focus of the question:
You can choose among 50 existing software solutions. You need to assess how reliable they are, without being able to test them (at least initially). What tangible metrics or other can you use to evaluate said reliability?

Reliability and robustness are two different attributes of a sytem:
Reliability
The IEEE defines it as ". . . the
ability of a system or component to
perform its required functions under
stated conditions for a specified
period of time."
Robustness
is robust if it continues to operate despite abnormalities in input, calculations, etc.
So a reliable system performs its functions as it was designed to within constraints; A robust system continues to operate if the unexpected/unanticipated occurs.
If you have access to any history of the software you're evaluating, some idea of reliability can be inferred from reported defects, number of 'patch' releases over time, even churn in the code base.
Does the product have automated test processes? Test coverage can be another indication of confidence.
Some projects using agile methods may not fit these criteria well - frequent releases and a lot of refactoring are expected
Check with current users of the software/product for real world information.

It depends on what type of software you're evaluating. A website's main (and maybe only) criteria for reliability might be its uptime. NASA will have a whole different definition for reliability of its software. Your definition will probably be somewhere in between.
If you don't have a lot of time to evaluate reliability, it is absolutely critical that you automate your measurement process. You can use continuous integration tools to make sure that you only ever have to manually find a bug once.
I recommend that you or someone in your company read Continuous Integration: Improving Software Quality and Reducing Risk. I think it will help lead you to your own definition of software reliability.

Talk to people already using it. You can test yourself for reliability, but it's difficult, expensive, and can be very unreliable depending on what you're testing, especially if you're short on time. Most companies will be willing to put you in contact with current clients if it will help sell you their software and they will be able to give you a real-world idea of how the software handles.

As with anything, if you don't have the time to assess something yourself, then you have to rely on the judgement of others.

Reliability is one of three aspects of somethings' effectiveness.. The other two are Maintainability and Availability...
An interesting paper... http://www.barringer1.com/pdf/ARMandC.pdf discusses this in more detail, but generally,
Reliability is based on the probability that a system will break.. i.e., the more likely it is to break, the less reliable it is... In other systems (other than software) it is often measured in Mean Time Between Failure (MTBF) This is a common metric for things like a hard disk... (10000 hrs MTBF) In software, I guess you could measure it in Mean Time between critical system failures, or between application crashes, or between unrecoverable errors, or between errors of any kind that impede or adversely affect normal system productivity...
Maintainability is a measure of how long/how expensive (how many man-hours and/or other resources) it takes to fix it when it does break. In software, you could add to this concept how long/how expensive it is to enhance or extend the software (if that is an ongoing requirement)
Availability is a combination of the first two, and indicates to a planner, if I had a 100 of these things running for ten years, after figuring the failures and how long each failed unit was unavailable while it was being fixed, repaired, whatever, How many of the 100, on average, would be up and running at any one time? 20% , or 98% ?

Well, the keyword 'reliable' can lead to different answers... When thinking of reliability, I think of two aspects:
always giving the right answer (or the best answer)
always giving the same answer
Either way, I think it boils down to some repeatable tests. If the application in question is not built with a strong suite of unit and acceptance tests, you can still come up with a set of manual or automated tests to perform repeatedly.
The fact that the tests always return the same results will show that aspect #2 is taken care of. For aspect #1 it really is up to the test writers: come up with good tests that would expose bugs or imperfections.
I can't be more specific without knowing what the application is about, sorry. For instance, a messaging system would be reliable if messages were always delivered, never lost, never contain errors, etc etc... a calculator's definition of reliability would be much different.

My advice is to follow SRE methodology around SLI, SLO and SLA, best summarized in free ebooks:
Site Reliability Engineering which provides principal introduction
The Site Reliability Workbook which comes with concrete examples
Looking at the reliability more from tool perspective you need:
monitoring infrastructure (I recommend Prometheus)
alerting (I recommend Prometheus AlertManager, OpsGenie or PagerDuty)
SLO computation tooling for instance slo-exporter

You will have to go into the process by understanding and fully accepting that you will be making a compromise, which could have negative effects if reliability is a key criterion and you don't have (or are unwilling to commit) the resources to appropriately evaluate based on that.
Having said that - determine what the key requirements are that make software reliability critical, then devise tests to evaluate based on those requirements.
Robustness and reliability cross in their relationship to each other, but are not necessarily the same.
If you have a data server that cannot handle more than 10 connections and you expect 100000 connections - it is not robust. It will be unreliable if it dies at > 10 connections. If that same server can handle the number of required connections but intermittently dies, you could say that it is still not robust and not reliable.
My suggestion is that you consult with an experienced QA person who is knowledgeable in the field for the study you will conduct. That person will be able to help you devise tests for key areas -hopefully within your resource constraints. I'd recommend a neutral 3rd party (rather than the software writer or vendor) to help you decide on the key features you'll need to test to make your determination.

If you can't test it, you'll have to rely on the reputation of the developer(s) along with how well they followed the same practices on this application as their other tested apps. Example: Microsoft does not do a very good job with the version 1 of their applications, but 3 & 4 are usually pretty good (Windows ME was version 0.0001).

Depending on the type of service you are evaluating, you might get reliability metrics or SLI - service level indicators - metrics capturing how well the service/product is doing. For example - process 99% of requests under 1sec.
Based on the SLI you might setup service level agreements - a contract between you and the software provider on what SLO (service level objectives) you would like with the consequences of not them not delivering those.

Related

How can the reliability of Software be checked through analysis?

How can we analyze the software reliability? How to check the reliabilty of any application or product?
First try to define "software reliability" and the way to quantify it.
If you accomplish this task, you will probably be able to "check" this characteristic.
The most effective way to check reliability is going to be to run your software and gather statistics on its actual reliability. There are too many variables in play, both at the hardware and software levels, to realistically analyze reliability prior to execution, with the possible exception of groups with massive resources like NASA.
There are various methods for determining whether a piece of software meets a specification, but most of the really productive ones do this by construction, i.e., by constraining the way in which the software is written so that it can be easily shown to be correct. Check out VDM, Z and the B toolkit for schemes for doing this sort of thing. Note that these tend to be expensive ways to program if you're not in a safety-critical systems environment.
Proving the correctness of the specification itself is really non-trivial!
Reliability is about continuity of correct service.
The best approach to assess reliability of a software is by dynamic analysis, in other words: testing.
In order to reduce your testing time you may want to apply input profiles different from operational one.
Apply various input distributions, measure how much time your software runs without failure. Then find out how far your input distributions are from operational profile and draw your conclusion about how much time the software would have run with operational profile.
This involves modeling techniques such as Markov chains or stochastic Petri nets.
For further digging, useful keywords are: fault forecasting and statistical testing.

does a disaster proof language exist?

When creating system services which must have a high reliability, I often end up writing the a lot of 'failsafe' mechanisms in case of things like: communications which are gone (for instance communication with the DB), what would happen if the power is lost and the service restarts.... how to pick up the pieces and continue in a correct way (and remembering that while picking up the pieces the power could go out again...), etc etc
I can imagine for not too complex systems, a language which would cater for this would be very practical. So a language which would remember it's state at any given moment, no matter if the power gets cut off, and continues where it left off.
Does this exist yet? If so, where can I find it? If not, why can't this be realized? It would seem to me very handy for critical systems.
p.s. In case the DB connection is lost, it would signal that a problem arose, and manual intervention is needed. The moment he connection is restored, it would continue where it left off.
EDIT:
Since the discussion seems to have died off let me add a few points(while waiting before I can add a bounty to the question)
The Erlang response seems to be top rated right now. I'm aware of Erlang and have read the pragmatic book by Armstrong (the principal creator). It's all very nice (although functional languages make my head spin with all the recursion), but the 'fault tolerant' bit doesn't come automatically. Far from it. Erlang offers a lot of supervisors en other methodologies to supervise a process, and restart it if necessary. However, to properly make something which works with these structures, you need to be quite the erlang guru, and need to make your software fit all these frameworks. Also, if the power drops, the programmer too has to pick up the pieces and try to recover the next time the program restarts
What I'm searching is something far simpler:
Imagine a language (as simple as PHP for instance), where you can do things like do DB queries, act on it, perform file manipulations, perform folder manipulations, etc.
It's main feature however should be: If the power dies, and the thing restarts it takes of where it left off (So it not only remembers where it was, it will remember the variable states as well). Also, if it stopped in the middle of a filecopy, it will also properly resume. etc etc.
Last but not least, if the DB connection drops and can't be restored, the language just halts, and signals (syslog perhaps) for human intervention, and then carries on where it left off.
A language like this would make a lot of services programming a lot easier.
EDIT:
It seems (judging by all the comments and answers) that such a system doesn't exist. And probably will not in the near foreseeable future due to it being (near?) impossible to get right.
Too bad.... again I'm not looking for this language (or framework) to get me to the moon, or use it to monitor someones heartrate. But for small periodic services/tasks which always end up having loads of code handling bordercases (powerfailure somewhere in the middle, connections dropping and not coming back up),...where a pause here,...fix the issues,....and continue where you left off approach would work well.
(or a checkpoint approach as one of the commenters pointed out (like in a videogame). Set a checkpoint.... and if the program dies, restart here the next time.)
Bounty awarded:
At the last possible minute when everyone was coming to the conclusion it can't be done, Stephen C comes with napier88 which seems to have the attributes I was looking for.
Although it is an experimental language, it does prove it can be done and it is a something which is worth investigating more.
I'll be looking at creating my own framework (with persistent state and snapshots perhaps) to add the features I'm looking for in .Net or another VM.
Everyone thanks for the input and the great insights.
Erlang was designed for use in Telecommunication systems, where high-rel is fundamental. I think they have standard methodology for building sets of communicating processes in which failures can be gracefully handled.
ERLANG is a concurrent functional language, well suited for distributed, highly concurrent and fault-tolerant software. An important part of Erlang is its support for failure recovery. Fault tolerance is provided by organising the processes of an ERLANG application into tree structures. In these structures, parent processes monitor failures of their children and are responsible for their restart.
Software Transactional Memory (STM) combined with nonvolatile RAM would probably satisfy the OP's revised question.
STM is a technique for implementating "transactions", e.g., sets of actions that are done effectively as an atomic operation, or not at all. Normally the purpose of STM is to enable highly parallel programs to interact over shared resources in a way which is easier to understand than traditional lock-that-resource programming, and has arguably lower overhead by virtue of having a highly optimistic lock-free style of programming.
The fundamental idea is simple: all reads and writes inside a "transaction" block are recorded (somehow!); if any two threads conflict on the these sets (read-write or write-write conflicts) at the end of either of their transactions, one is chosen as the winner and proceeds, and the other is forced to roll back his state to the beginning of the transaction and re-execute.
If one insisted that all computations were transactions, and the state at the beginning(/end) of each transaction was stored in nonvolatile RAM (NVRAM), a power fail could be treated as a transaction failure resulting in a "rollback". Computations would proceed only from transacted states in a reliable way. NVRAM these days can be implemented with Flash memory or with battery backup. One might need a LOT of NVRAM, as programs have a lot of state (see minicomputer story at end). Alternatively, committed state changes could be written to log files that were written to disk; this is the standard method used by most databases and by reliable filesystems.
The current question with STM is, how expensive is it to keep track of the potential transaction conflicts? If implementing STM slows the machine down by an appreciable amount, people will live with existing slightly unreliable schemes rather than give up that performance. So far the story isn't good, but then the research is early.
People haven't generally designed languages for STM; for research purposes, they've mostly
enhanced Java with STM (see Communications of ACM article in June? of this year). I hear MS has an experimental version of C#. Intel has an experimental version for C and C++.
THe wikipedia page has a long list. And the functional programming guys
are, as usual, claiming that the side-effect free property of functional programs makes STM relatively trivial to implement in functional languages.
If I recall correctly, back in the 70s there was considerable early work in distributed operating systems, in which processes (code+state) could travel trivally from machine to machine. I believe several such systems explicitly allowed node failure, and could restart a process in a failed node from save state in another node. Early key work was on the
Distributed Computing System by Dave Farber. Because designing languages back in the 70s was popular, I recall DCS had it had its own programming language but I don't remember the name. If DCS didn't allow node failure and restart, I'm fairly sure the follow on research systems did.
EDIT: A 1996 system which appears on first glance to have the properties you desire is
documented here.
Its concept of atomic transactions is consistent with the ideas behind STM.
(Goes to prove there isn't a lot new under the sun).
A side note: Back in in 70s, Core Memory was still king. Core, being magnetic, was nonvolatile across power fails, and many minicomputers (and I'm sure the mainframes) had power fail interrupts that notified the software some milliseconds ahead of loss of power. Using that, one could easily store the register state of the machine and shut it down completely. When power was restored, control would return to a state-restoring point, and the software could proceed. Many programs could thus survive power blinks and reliably restart. I personally built a time-sharing system on a Data General Nova minicomputer; you could actually have it running 16 teletypes full blast, take a power hit, and come back up and restart all the teletypes as if nothing happened. The change from cacophony to silence and back was stunning, I know, I had to repeat it many times to debug the power-failure management code, and it of course made great demo (yank the plug, deathly silence, plug back in...). The name of the language that did this, was of course Assembler :-}
From what I know¹, Ada is often used in safety critical (failsafe) systems.
Ada was originally targeted at
embedded and real-time systems.
Notable features of Ada include:
strong typing, modularity mechanisms
(packages), run-time checking,
parallel processing (tasks), exception
handling, and generics. Ada 95 added
support for object-oriented
programming, including dynamic
dispatch.
Ada supports run-time checks in order
to protect against access to
unallocated memory, buffer overflow
errors, off-by-one errors, array
access errors, and other detectable
bugs. These checks can be disabled in
the interest of runtime efficiency,
but can often be compiled efficiently.
It also includes facilities to help
program verification.
For these
reasons, Ada is widely used in
critical systems, where any anomaly
might lead to very serious
consequences, i.e., accidental death
or injury. Examples of systems where
Ada is used include avionics, weapon
systems (including thermonuclear
weapons), and spacecraft.
N-Version programming may also give you some helpful background reading.
¹That's basically one acquaintance who writes embedded safety critical software
I doubt that the language features you are describing are possible to achieve.
And the reason for that is that it would be very hard to define common and general failure modes and how to recover from them. Think for a second about your sample application - some website with some logic and database access. And lets say we have a language that can detect power shutdown and subsequent restart, and somehow recover from it. The problem is that it is impossible to know for the language how to recover.
Let's say your app is an online blog application. In that case it might be enough to just continue from the point we failed and all be ok. However consider similar scenario for an online bank. Suddenly it's no longer smart to just continue from the same point. For example if I was trying to withdraw some money from my account, and the computer died right after the checks but before it performed the withdrawal, and it then goes back one week later it will give me the money even though my account is in the negative now.
In other words, there is no single correct recovery strategy, so this is not something that can be implemented into the language. What language can do is to tell you when something bad happens - but most languages already support that with exception handling mechanisms. The rest is up to application designers to think about.
There are a lot of technologies that allow designing fault tolerant applications. Database transactions, durable message queues, clustering, hardware hot swapping and so on and on. But it all depends on concrete requirements and how much the end user is willing to pay for it all.
There is an experimental language called Napier88 that (in theory) has some attributes of being disaster-proof. The language supports Orthogonal Persistence, and in some implementations this extends (extended) to include the state of the entire computation. Specifically, when the Napier88 runtime system check-pointed a running application to the persistent store, the current thread state would be included in the checkpoint. If the application then crashed and you restarted it in the right way, you could resume the computation from the checkpoint.
Unfortunately, there are a number of hard issues that need to be addressed before this kind of technology is ready for mainstream use. These include figuring out how to support multi-threading in the context of orthogonal persistence, figuring out how to allow multiple processes share a persistent store, and scalable garbage collection of persistent stores.
And there is the problem of doing Orthogonal Persistence in a mainstream language. There have been attempts to do OP in Java, including one that was done by people associated with Sun (the Pjama project), but there is nothing active at the moment. The JDO / Hibernate approaches are more favoured these days.
I should point out that Orthogonal Persistence isn't really disaster-proof in the large sense. For instance, it cannot deal with:
reestablishment of connections, etc with "outside" systems after a restart,
application bugs that cause corruption of persisted data, or
loss of data due to something bringing down the system between checkpoints.
For those, I don't believe there are general solutions that would be practical.
The majority of such efforts - termed 'fault tolerance' - are around the hardware, not the software.
The extreme example of this is Tandem, whose 'nonstop' machines have complete redundancy.
Implementing fault tolerance at a hardware level is attractive because a software stack is typically made from components sourced from different providers - your high availability software application might be installed along side some decidedly shaky other applications and services on top of an operating system that is flaky and using hardware device drivers that are decidedly fragile..
But at a language level, almost all languages offer the facilities for proper error checking. However, even with RAII, exceptions, constraints and transactions, these code-paths are rarely tested correctly and rarely tested together in multiple-failure scenerios, and its usually in the error handling code that the bugs hide. So its more about programmer understanding, discipline and trade-offs than about the languages themselves.
Which brings us back to the fault tolerance at the hardware level. If you can avoid your database link failing, you can avoid exercising the dodgy error handling code in the applications.
No, a disaster-proof language does not exist.
Edit:
Disaster-proof implies perfection. It brings to mind images of a process which applies some intelligence to resolve unknown, unspecified and unexpected conditions in a logical manner. There is no manner by which a programming language can do this. If you, as the programmer, can not figure out how your program is going to fail and how to recover from it then your program isn't going to be able to do so either.
Disaster from an IT perspective can arise in so many fashions that no one process can resolve all of those different issues. The idea that you could design a language to address all of the ways in which something could go wrong is just incorrect. Due to the abstraction from the hardware many problems don't even make much sense to address with a programming language; yet they are still 'disasters'.
Of course, once you start limiting the scope of the problem; then we can begin talking about developing a solution to it. So, when we stop talking about being disaster-proof and start speaking about recovering from unexpected power surges it becomes much easier to develop a programming language to address that concern even when, perhaps, it doesn't make much sense to handle that issue at such a high level of the stack. However, I will venture a prediction that once you scope this down to realistic implementations it becomes uninteresting as a language since it has become so specific. i.e. Use my scripting language to run batch processes overnight that will recover from unexpected power surges and lost network connections (with some human assistance); this is not a compelling business case to my mind.
Please don't misunderstand me. There are some excellent suggestions within this thread but to my mind they do not rise to anything even remotely approaching disaster-proof.
Consider a system built from non-volatile memory. The program state is persisted at all times, and should the processor stop for any length of time, it will resume at the point it left when it restarts. Therefore, your program is 'disaster proof' to the extent that it can survive a power failure.
This is entirely possible, as other posts have outlined when talking about Software Transactional Memory, and 'fault tolerance' etc. Curious nobody mentioned 'memristors', as they would offer a future architecture with these properties and one that is perhaps not completely von Neumann architecture too.
Now imagine a system built from two such discrete systems - for a straightforward illustration, one is a database server and the other an application server for an online banking website.
Should one pause, what does the other do? How does it handle the sudden unavailability of it's co-worker?
It could be handled at the language level, but that would mean lots of error handling and such, and that's tricky code to get right. That's pretty much no better than where we are today, where machines are not check-pointed but the languages try and detect problems and ask the programmer to deal with them.
It could pause too - at the hardware level they could be tied together, such that from a power perspective they are one system. But that's hardly a good idea; better availability would come from a fault-tolerant architecture with backup systems and such.
Or we could use persistant message queues between the two machines. However, at some point these messages get processed, and they could at that point be too old! Only application logic can really work what to do in that circumstances, and there we are back to languages delegating to the programmer again.
So it seems that the disaster-proofing is better in the current form - uninterrupted power supplies, hot backup servers ready to go, multiple network routes between hosts, etc. And then we only have to hope that our software is bug-free!
Precise answer:
Ada and SPARK were designed for maximum fault-tolerance and to move all bugs possible to compile-time rather than runtime. Ada was designed by the US Dept of Defense for military and aviation systems, running on embedded devices in such things as airplanes. Spark is its descendant. There's another language used in the early US space program, HAL/S geared to handling HARDWARE failure and memory corruption due to cosmic rays.
Practical answer:
I've never met anyone who can code Ada/Spark. For most users the best answer is SQL variants on a DBMS with automatic failover and clustering of servers. Integrity checks guarantee safety. Something like T-SQL or PL/SQL has full transactional security, is Turing-complete, and is pretty tolerant of problems.
Reason there isn't a better answer:
For performance reasons, you can't provide durability for every program operation. If you did, the processing would slow to the speed of your fastest nonvolative storage. At best, your performance will drop by a thousand or million fold, because of how much slower ANYTHING is than CPU caches or RAM.
It would be the equivalent of going from a Core 2 Duo CPU to the ancient 8086 CPU -- at most you could do a couple hundred operations per second. Except, this would be even SLOWER.
In cases where frequent power cycling or hardware failures exist, you use something like a DBMS, which guarantees ACID for every important operation. Or, you use hardware that has fast, nonvolatile storage (flash, for example) -- this is still much slower, but if the processing is simple, this is OK.
At best your language gives you good compile-time safety checks for bugs, and will throw exceptions rather than crashing. Exception handling is a feature of half the languages in use now.
There are several commercially avaible frameworks Veritas, Sun's HA , IBMs HACMP etc. etc.
which will automatically monitor processes and start them on another server in event of failure.
There is also expensive hardware like HPs Tandem Nonstop range which can survive internal hardware failures.
However sofware is built by peoples and peoples love to get it wrong. Consider the cautionary tale of the IEFBR14 program shipped with IBMs MVS. It basically a NOP dummy program which allows the declarative bits of JCL to happen without really running a program. This is the entire original source code:-
IEFBR14 START
BR 14 Return addr in R14 -- branch at it
END
Nothing code be simpler? During its long life this program has actually acummulated a bug bug report and is now on version 4.
Thats 1 bug to three lines of code, the current version is four times the size of the original.
Errors will always creep in, just make sure you can recover from them.
This question forced me to post this text
(Its quoted from HGTTG from Douglas Adams:)
Click, hum.
The huge grey Grebulon reconnaissance ship moved silently through the black void. It was travelling at fabulous, breathtaking speed, yet appeared, against the glimmering background of a billion distant stars to be moving not at all. It was just one dark speck frozen against an infinite granularity of brilliant night.
On board the ship, everything was as it had been for millennia, deeply dark and Silent.
Click, hum.
At least, almost everything.
Click, click, hum.
Click, hum, click, hum, click, hum.
Click, click, click, click, click, hum.
Hmmm.
A low level supervising program woke up a slightly higher level supervising program deep in the ship's semi-somnolent cyberbrain and reported to it that whenever it went click all it got was a hum.
The higher level supervising program asked it what it was supposed to get, and the low level supervising program said that it couldn't remember exactly, but thought it was probably more of a sort of distant satisfied sigh, wasn't it? It didn't know what this hum was. Click, hum, click, hum. That was all it was getting.
The higher level supervising program considered this and didn't like it. It asked the low level supervising program what exactly it was supervising and the low level supervising program said it couldn't remember that either, just that it was something that was meant to go click, sigh every ten years or so, which usually happened without fail. It had tried to consult its error look-up table but couldn't find it, which was why it had alerted the higher level supervising program to the problem .
The higher level supervising program went to consult one of its own look-up tables to find out what the low level supervising program was meant to be supervising.
It couldn't find the look-up table .
Odd.
It looked again. All it got was an error message. It tried to look up the error message in its error message look-up table and couldn't find that either. It allowed a couple of nanoseconds to go by while it went through all this again. Then it woke up its sector function supervisor.
The sector function supervisor hit immediate problems. It called its supervising agent which hit problems too. Within a few millionths of a second virtual circuits that had lain dormant, some for years, some for centuries, were flaring into life throughout the ship. Something, somewhere, had gone terribly wrong, but none of the supervising programs could tell what it was. At every level, vital instructions were missing, and the instructions about what to do in the event of discovering that vital instructions were missing, were also missing.
Small modules of software — agents — surged through the logical pathways, grouping, consulting, re-grouping. They quickly established that the ship's memory, all the way back to its central mission module, was in tatters. No amount of interrogation could determine what it was that had happened. Even the central mission module itself seemed to be damaged.
This made the whole problem very simple to deal with. Replace the central mission module. There was another one, a backup, an exact duplicate of the original. It had to be physically replaced because, for safety reasons, there was no link whatsoever between the original and its backup. Once the central mission module was replaced it could itself supervise the reconstruction of the rest of the system in every detail, and all would be well.
Robots were instructed to bring the backup central mission module from the shielded strong room, where they guarded it, to the ship's logic chamber for installation.
This involved the lengthy exchange of emergency codes and protocols as the robots interrogated the agents as to the authenticity of the instructions. At last the robots were satisfied that all procedures were correct. They unpacked the backup central mission module from its storage housing, carried it out of the storage chamber, fell out of the ship and went spinning off into the void.
This provided the first major clue as to what it was that was wrong.
Further investigation quickly established what it was that had happened. A meteorite had knocked a large hole in the ship. The ship had not previously detected this because the meteorite had neatly knocked out that part of the ship's processing equipment which was supposed to detect if the ship had been hit by a meteorite.
The first thing to do was to try to seal up the hole. This turned out to be impossible, because the ship's sensors couldn't see that there was a hole, and the supervisors which should have said that the sensors weren't working properly weren't working properly and kept saying that the sensors were fine. The ship could only deduce the existence of the hole from the fact that the robots had clearly fallen out of it, taking its spare brain, which would have enabled it to see the hole, with them.
The ship tried to think intelligently about this, failed, and then blanked out completely for a bit. It didn't realise it had blanked out, of course, because it had blanked out. It was merely surprised to see the stars jump. After the third time the stars jumped the ship finally realised that it must be blanking out, and that it was time to take some serious decisions.
It relaxed.
Then it realised it hadn't actually taken the serious decisions yet and panicked. It blanked out again for a bit. When it awoke again it sealed all the bulkheads around where it knew the unseen hole must be.
It clearly hadn't got to its destination yet, it thought, fitfully, but since it no longer had the faintest idea where its destination was or how to reach it, there seemed to be little point in continuing. It consulted what tiny scraps of instructions it could reconstruct from the tatters of its central mission module.
"Your !!!!! !!!!! !!!!! year mission is to !!!!! !!!!! !!!!! !!!!!, !!!!! !!!!! !!!!! !!!!!, land !!!!! !!!!! !!!!! a safe distance !!!!! !!!!! ..... ..... ..... .... , land ..... ..... ..... monitor it. !!!!! !!!!! !!!!!..."
All of the rest was complete garbage.
Before it blanked out for good the ship would have to pass on those instructions, such as they were, to its more primitive subsidiary systems.
It must also revive all of its crew.
There was another problem. While the crew was in hibernation, the minds of all of its members, their memories, their identities and their understanding of what they had come to do, had all been transferred into the ship's central mission module for safe keeping. The crew would not have the faintest idea of who they were or what they were doing there. Oh well.
Just before it blanked out for the final time, the ship realised that its engines were beginning to give out too.
The ship and its revived and confused crew coasted on under the control of its subsidiary automatic systems, which simply looked to land wherever they could find to land and monitor whatever they could find to monitor.
Try taking an existing open source interpreted language and see if you could adapt its implementation to include some of these features. Python's default C implementation embeds an internal lock (called the GIL, Global Interpreter Lock) that is used to "handle" concurrency among Python threads by taking turns every 'n' VM instructions. Perhaps you could hook into this same mechanism to checkpoint the code state.
For a program to continue where it left off if the machine loses power, not only would it need to save state to somewhere, the OS would also have to "know" to resume it.
I suppose implementing a "hibernate" feature in a language could be done, but having that happen constantly in the background so it's ready in the event anything bad happens sounds like the OS' job, in my opinion.
It's main feature however should be: If the power dies, and the thing restarts it takes of where it left off (So it not only remembers where it was, it will remember the variable states as well). Also, if it stopped in the middle of a filecopy, it will also properly resume. etc etc.
... ...
I've looked at erlang in the past. However nice it's fault tolerant features it has... It doesn't survive a powercut. When the code restarts you'll have to pick up the pieces
If such a technology existed, I'd be VERY interested in reading about it. That said, The Erlang solution would be having multiple nodes--ideally in different locations--so that if one location went down, the other nodes could pick up the slack. If all of your nodes were in the same location and on the same power source (not a very good idea for distributed systems), then you'd be out of luck as you mentioned in a comment follow-up.
The Microsoft Robotics Group has introduced a set of libraries that appear to be applicable to your question.
What is Concurrency and Coordination
Runtime (CCR)?
Concurrency and Coordination Runtime
(CCR) provides a highly concurrent
programming model based on
message-passing with powerful
orchestration primitives enabling
coordination of data and work without
the use of manual threading, locks,
semaphores, etc. CCR addresses the
need of multi-core and concurrent
applications by providing a
programming model that facilitates
managing asynchronous operations,
dealing with concurrency, exploiting
parallel hardware and handling partial
failure.
What is Decentralized Software
Services (DSS)?
Decentralized Software Services (DSS)
provides a lightweight, state-oriented
service model that combines
representational state transfer (REST)
with a formalized composition and
event notification architecture
enabling a system-level approach to
building applications. In DSS,
services are exposed as resources
which are accessible both
programmatically and for UI
manipulation. By integrating service
composition, structured state
manipulation, and event notification
with data isolation, DSS provides a
uniform model for writing highly
observable, loosely coupled
applications running on a single node
or across the network.
Most of the answers given are general purpose languages. You may want to look into more specialized languages that are used in embedded devices. The robot is a good example to think about. What would you want and/or expect a robot to do when it recovered from a power failure?
In the embedded world, this can be implemented through a watchdog interrupt and a battery-backed RAM. I've written such myself.
Depending upon your definition of a disaster, it can range from 'difficult' to 'practicly impossible' to delegate this responsibility to the language.
Other examples given include persisting the current state of the application to NVRAM after each statement is executed. This only works so long as the computer doesn't get destroyed.
How would a language level feature know to restart the application on a new host?
And in the situation of restoring the application to a host - what if significant time had passed and assumptions/checks made previously were now invalid?
T-SQL, PL/SQL and other transactional languages are probably as close as you'll get to 'disaster proof' - they either succeed (and the data is saved), or they don't. Excluding disabling transactional isolation, it's difficult (but probably not impossible if you really try hard) to get into 'unknown' states.
You can use techniques like SQL Mirroring to ensure that writes are saved in atleast two locations concurrently before a transaction is committed.
You still need to ensure you save your state every time it's safe (commit).
If I understand your question correctly, I think that you are asking whether it's possible to guarantee that a particular algorithm (that is, a program plus any recovery options provided by the environment) will complete (after any arbitrary number of recoveries/restarts).
If this is correct, then I would refer you to the halting problem:
Given a description of a program and a finite input, decide whether the program finishes running or will run forever, given that input.
I think that classifying your question as an instance of the halting problem is fair considering that you would ideally like the language to be "disaster proof" -- that is, imparting a "perfectness" to any flawed program or chaotic environment.
This classification reduces any combination of environment, language, and program down to "program and a finite input".
If you agree with me, then you'll be disappointed to read that the halting problem is undecidable. Therefore, no "disaster proof" language or compiler or environment could be proven to be so.
However, it is entirely reasonable to design a language that provides recovery options for various common problems.
In the case of power failure.. sounds like to me: "When your only tool is a hammer, every problem looks like a nail"
You don't solve power failure problems within a program. You solve this problem with backup power supplies, batteries, etc.
If the mode of failure is limited to hardware failure, VMware Fault Tolerance claims similar thing that you want. It runs a pair of virtual machines across multiple clusters, and using what they call vLockstep, the primary vm sends all states to the secondary vm real-time, so in case of primary failure, the execution transparently flips to the secondary.
My guess is that this wouldn't help communication failure, which is more common than hardware failure. For serious high availability, you should consider distributed systems like Birman's process group approach (paper in pdf format, or book Reliable Distributed Systems: Technologies, Web Services, and Applications ).
The closest approximation appears to be SQL. It's not really a language issue though; it's mostly a VM issue. I could imagine a Java VM with these properties; implementing it would be another matter.
A quick&dirty approximation is achieved by application checkpointing. You lose the "die at any moment" property, but it's pretty close.
I think its a fundemental mistake for recovery not to be a salient design issue. Punting responsibility exclusivly to the environment leads to a generally brittle solution intolerant of internal faults.
If it were me I would invest in reliable hardware AND design the software in a way that it was able to recover automatically from any possible condition. Per your example database session maintenance should be handled automatically by a sufficiently high level API. If you have to manually reconnect you are likely using the wrong API.
As others have pointed out procedure languages embedded in modern RDBMS systems are the best you are going to get without use of an exotic language.
VMs in general are designed for this sort of thing. You could use a VM vendors (vmware..et al) API to control periodic checkpointing within your application as appropriate.
VMWare in particular has a replay feature (Enhanced Execution Record) which records EVERYTHING and allows point in time playback. Obviously there is a massive performance hit with this approach but it would meet the requirements. I would just make sure your disk drives have a battery backed write cache.
You would most likely be able to find similiar solutions for java bytecode run inside a java virtual machine. Google fault tolerant JVM and virtual machine checkpointing.
If you do want the program information saved, where would you save it?
It would need to be saved e.g. to disk. But this wouldn't help you if the disk failed, so already it's not disaster-proof.
You are only going to get a certain level of granularity in your saved state. If you want something like tihs, then probably the best approach is to define your granularity level, in terms of what constitutes an atomic operation and save state to the database before each atomic operation. Then, you can restore to the point of that level atomic operation.
I don't know of any language that would do this automatically, sincethe cost of saving state to secondary storage is extremely high. Therefore, there is a tradeoff between level of granularity and efficiency, which would be hard to define in an arbitrary application.
First, implement a fault tolerant application. One where, where, if you have 8 features and 5 failure modes, you have done the analysis and test to demonstrate that all 40 combinations work as intended (and as desired by the specific customer: no two will likely agree).
second, add a scripting language on top of the supported set of fault-tolerant features. It needs to be as near to stateless as possible, so almost certainly something non-Turing-complete.
finally, work out how to handle restoration and repair of scripting language state adapted to each failure mode.
And yes, this is pretty much rocket science.
Windows Workflow Foundation may solve your problem. It's .Net based and is designed graphically as a workflow with states and actions.
It allows for persistence to the database (either automatically or when prompted). You could do this between states/actions. This Serialises the entire instance of your workflow into the database. It will be rehydrated and execution will continue when any of a number of conditions is met (certain time, rehydrated programatically, event fires, etc...)
When a WWF host starts, it checks the persistence DB and rehydrates any workflows stored there. It then continues to execute from the point of persistence.
Even if you don't want to use the workflow aspects, you can probably still just use the persistence service.
As long as your steps were atomic this should be sufficient - especially since I'm guessing you have a UPS so could monitor for UPS events and force persistence if a power issue is detected.
If I were going about solving your problem, I would write a daemon (probably in C) that did all database interaction in transactions so you won't get any bad data inserted if it gets interrupted. Then have the system start this daemon at startup.
Obviously developing web stuff in C is quite slower than doing it in a scripting language, but it will perform better and be more stable (if you write good code of course :).
Realistically, I'd write it in Ruby (or PHP or whatever) and have something like Delayed Job (or cron or whatever scheduler) run it every so often because I wouldn't need stuff updating ever clock cycle.
Hope that makes sense.
To my mind, the concept of failure recover is, most of the time, a business problem, not a hardware or language problem.
Take an example : you have one UI Tier and one subsystem.
The subsystem is not very reliable but the client on the UI tier should percieve it as if it was.
Now, imagine that somehow your sub system crash, do you really think that the language you imagine, can think for you how to handle the UI Tier depending on this sub system ?
Your user should be explicitly aware that the subsystem is not reliable, if you use messaging to provide high reliability, the client MUST know that (if he isn't aware, the UI can just freeze waiting a response which can eventually come 2 weeks later). If he should be aware of this, this means that any abstrations to hide it will eventually leak.
By client, I mean end user. And the UI should reflect this unreliability and not hide it, a computer cannot think for you in that case.
"So a language which would remember it's state at any given moment, no matter if the power gets cut off, and continues where it left off."
"continues where it left off" is often not the correct recovery strategy. No language or environment in the world is going to attempt to guess how to recover from a particular fault automatically. The best it can do is provide you with tools to write your own recovery strategy in a way that doesn't interfere with your business logic, e.g.
Exception handling (to fail fast and still ensure consistency of state)
Transactions (to roll back incompleted changes)
Workflows (to define recovery routines that are called automatically)
Logging (for tracking down the cause of a fault)
AOP/dependency injection (to avoid having to manually insert code to do all the above)
These are very generic tools and are available in lots of languages and environments.

Building an Aircraft using Agile? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
Developers can learn a lot from other industries. As a thought exercise, is it possible to build a passenger aircraft using agile techniques?
Forgetting cost for now; how feasible is it to use iterative and incremental development for both the hardware (fuselage, wings, etc) as well as software, and still come out with a working and safe product which meets the customer’s requirements at the time of delivery?
Does it make sense to refactor a plane?
Agile in software and Agile in manufacturing are really quite different, although they share similar principals and values.
Agile in manufacturing emerged in Japan in the 1950s. Read up on W.E. Deming and the Toyota Production System to find out more. It's all about constantly improving the process whereby a product is reproduced.
Agile in software evolved in the early 1990s as a rapid development model. It's all about constantly improving the product.
You can certainly build a plane using Agile manufacturing methods, I've no doubt that some already are. Anything built in Japan definitely will be as Agile manufacturing is very well established there (it's taught in primary schools).
You couldn't build a plane using Agile software methods because you can't afford to rapidly change the product - in software changes and mistakes are cheap and reproduction is free. This isn't the case for aviation.
You could design a prototype plane using something like Agile software methods - but it would have to be standardised in order to be reproduced (a design task in itself).
How would you work using Test Driven Development? Would you automatically build and test a plane every iteration? Would you be able to make a ten minutes build? How easy is to make changes to the airplane? Even if you are really flexible in your desing the building some components need to be sent to special factories so there is not inmmediate feedback.
From de design using CAD software you need to make a mould, take the piece of fiber, put it in the plane. Etc. So here a trivial change has a non trivial cost. In Agile you can make a very little change and have it tested, built and an ready to ship in 20 minutes. If small changes are expensive then the short development cycle and refactoring won't be so usefull. Your feedback can take longer than a week so there is a strong reason for thinking in advance like in the waterfall model. And every attempt has a cost in physical materials unless you are programming. The Idea is not new. Carpenters measure twice. Programmers just first code and then test.
In summary. There may be some similarities but it will definitively be the same.
I'm going to say "kind of". In fact there's one example right now that I think is pretty close to answering this question.
Boeing is attempting to do this now with the new 787 - see following: Boeing 787 - Specification vs. Collaboration (From the 777 to the 787, the initial specifications document supposedly went from 2500 pages to 20 pages with the change in technique.) Suppliers from around the world are working independently to develop the components for this aircraft. (We'll call this the "teams".)
So, I want to say yes, but at the same time, iterations in creating the aircraft has resulted in delays of 2+ years and has resulted in stories like this one - (787 Delayed for 5th Time)
Will the airplane ever get built? Yes, of course it will. But when you look at the rubber hitting the road here, it seems like "integration test" is having one heck of a time.
Edit: At the same time, this shift in technique has resulted in building a new breed of aircraft built out of entirely new materials that will arguably be one of the most advanced in the world. This may be a direct result of the more Agile approach. Maybe that's actually the question - not a "can you?" but a "if Agile delays complex systems, does it provide a more innovative product in the payoff?"
Toyota pioneered Lean Production which Agile methodoligies followed on from. For the building of the hardware of the aircraft lean production would be the way to go and for the software an agile methodology would be the way to go.
Pick the right tools for the job.
A great book following how TPS was created and works
http://www.amazon.com/Machine-That-Changed-World-Production/dp/0060974176
http://en.wikipedia.org/wiki/Toyota_Production_System
I think in this case you are thinking too big. Agile is about breaking things down into more managable pieces and then working against that. The whole idea of Agile (XP in particular) is that you do your testing first so that you cut the number of bugs out and because plane software needs to have a very high code coverage for its testing it fits in quite neatly I think.
You aren't going to 'refactor' the mechanics of the plane but you will tweak them if they are unsafe and thats the whole iterative approach for you.
I have heard of Air Traffic Control software written with Agile Methodologies pushing it forward.
This is taken from http://requirements.seilevel.com/blog/2008/06/incose-2008-can-you-build-airplane-with.html
***Actually, that’s not true,***
My first guess - this probably relates to some of the core differences between systems and software engineering. I am going to over simplify this and just say that it is about scale. Systems projects are just a superset of software and hardware projects, integrating and deploying some combination of these. The teams of people deploying systems projects are quite large. And many of the projects discussed here are for government or regulated systems where specification and traceability is necessary. I could see how subsets of systems projects could in fact be developed using agile (pure software components), but I’m not convinced that an entire end-to-end project can. To put this in context, imagine you are building an airplane - a very commonly referenced type of systems engineering projects. Can you see this working using agile?
All skeptism aside, I do think that iterative development most certainly could work well on systems projects, and most people here would not argue that. In fact, I would love it if we could find examples of agile working on systems projects, because the number one feeling I get at systems engineering conferences is a craving for lighter processes.
I decided to do a little research outside the conference walls, and low-and-behold, I found a great article on this exact topic – “Toward Agile Systems Engineering Processes” by Dr. Richard Turner of the Systems and Software Consortium. The article is very well laid out, and I highly recommend reading it. He defines what agile is and what he believes the issue why most systems engineering projects are not agile. For example, he suggests that executives and program managers tend to believe that the teams involved have perfect knowledge about systems we are building, so we can plan them out in advance and work to a perfect execution against a perfect schedule.
Agile Can Work With Complex Systems
He talks to how to the agile concepts can work in systems projects. Here are a few examples summarized from his article:
Learning based: The traditional V-model implies a one-time trip through, implying one time to re-learn. However, perhaps the model can be re-interpreted to allow multiple iterations through it to fulfill this.
Customer focus: Typically systems engineering processes do not support multiple interactions with the customer throughout the project (just up front to gather requirements). That said, he references a study indicating the known issues with that on systems projects. Therefore, perhaps processes should be adapted to allow for this, particularly allowing for them to help prioritize requirements throughout projects.
Short iterations: Iterations are largely unheard of because the V-model is a one-time pass through the development lifecycle. That said, iterations of prototyping through testing could be done in systems engineering in many cases. The issue is in delivering something complete at the end of each iteration. He suggests that this is not as important to the customer in large deployments as is reducing risk, validating requirements, etc. This is a great point to rememember the airplane example! Could we have even a working part of an airplane after 2 weeks? What about even the software to run a subsystem on the aircraft?
Team ownership: Systems engineering is very process driven, so this one is tricky. Dr. Turner puts the idea out that perhaps allowing the systems engineers to drive it instead of the process to drive them, while more uncomfortable for management, might produce more effective results.
There is this story of an aircraft engine plant (September 1999). Their methods seem quite agile:
http://www.fastcompany.com/magazine/28/ge.html
Yes, you could do it. If you followed Agile Software Development techniques too closely however, it would be astronomically expensive, because of the varying costs of activities.
Consider the relative costs of design and build. If we include coding as part of the software design process, then design is definitely the expensive part and build is ridiculously easy and cheap. Most Agile projects would plan to release every few iterations at least. So we can work in small iterations with a continuous build process. Not so easy when you have to assemble a plane once a fortnight. Worse if you actually plan on "releasing" it. You'd probably need to get the airworthiness & safety people on to an Agile process too.
I'd truly love to see it tried.
Yes, you can use agile techniques for building complex systems, but I don't know if I'd use it for this particular system.
The problem with aircraft is the issue of safety. This means every precaution needs to be taken, from correctly identifying and interpreting the requirements to verifying and validating each and every single line of code.
Additionally, formal methods should probably be used to make sure that the system is truly safe by making sure the programming logic is sound and satisfies its conditions properly.
I'm fairly certain the answer is irrelevant. Even if you could, you would not be allowed to. There are too many safety requirements. You would not even be allowed to develop the flight software using Agile. For instance, you do not have a Software Requirements Specification (SRS) in Agile. Yet, for any avionics software onboard an airplane that can affect flight safety you will need an SRS.
Of course one can refactor a plane.
When one refactor, one modifies the source code, not the binaries. With a plane it's exactly the same thing: one modifies the blueprints, not the plane itself.

Useful and unuseful real life development techniques

I work on a medium/large company that follows what I think are some good practices for development, maybe not the best ones but good enough.
We've some development resource that get implemented on the basis of "do, test, if it useful the use, else throw away". I've found that most of the so called best practices are, sometimes, ideally great but unfeasible or even harmful in real life.
For example, we use to have a dotproject website for our team. The idea was to track tasks, update progress and so on. We used the "do, test, if" with it and the result was... we threw it away and just keep the forum that was extremly useful to communicate between us and keep track of conclusions of meetings and TO DO lists... Tracking each task on the other hand proved to be both time consuming and unrealistic.
Firs of all nobody was doing it, it didn't take a lot of time but developers hated it and make them unhappy because they had to remember updating every task, the estimates for the times of the task proved to be unrealistic most of the time.
So my question is, What development techniques have you tried and found useful/unuseful?
I mean that as in real life, not some theoretical best practice, but as a hands on experience. I'm looking to explore new techniques (or tools or whatever) and I'd like opinions on what to do next. Our current status:
Internal issue tracking system (Useful)
Semiautomated builds (every developer has to maintain the equivalent of a makefile in order for the system to be able to make them).
NO automated testing. Test are performed by a test team. We have integration tests and wide system tests.
Two test labs, one for the Test team, the other for the developers (in case they need to perform test that involves more than one machine or to test out of the development machine)
No unit testing in general. Some libraries have them but usually the developer test its units as he wants.
Full Specification using DOORs.
Test protocols. Formal, written in Word.
Source control (Clear Case). Usually everything is done in the main branch, whenever we ship a version it gets labeled, if needed a branch is made for fixes to that version.
Note: When you can (if you don't mind :P), could you try to justify your proposal? How and why was it useful? How did improve your work?
One of the most useful things we introduced was a project Wiki, an extremely useful dumping ground for all the little titbits of information floating around in peoples head but too trivial to record in a full document.
Having been involved in all sorts of development teams with different methodologies, my experience is that most of the agile principles tend to work great. It generally makes development more fun, since everyone is more engaged. In the larger development environments the basic principle of co-locating all team members brings great benefits, especially when you have separate information analysts and testers next to developers. Have information analysts, testers and developers work together per feature. This creates the best flow of information, with as little overhead as possible. You can take this even further to a Lean development process.
In general, the things that gave us the greatest benefits were the things that lowered the barriers of communication. In a practical sense, a company wiki helped out a great deal as well, lowering the barrier for sharing information. A good bug/features/RFC tracking tool also helped a great deal to have a joint understanding amongst stakeholders of what direction the project is heading. And such a tracking tool does not just have to be internal: lower the barriers to your customers as well. This also helps a great deal in managing expectations.
I feel I'm just getting started here. Others will no doubt come with more suggestions.
Pascal.
Follow this link...
And personally, as a developer I prefer to focus on things that improve my performance. I don't mind checking some bug reporting site to check if new bugs are reported, but I need to be able to view it quickly without having to go through a few dozens of pages or dozens of clicks.
I don't mind writing technical designs before writing code either, as long as I have the tools to write it properly. These tools must be created to increase performance with a minimum of fluffy features that no one uses. For example, in the past I've used Enterprise Architect to create UML models before writing code. It worked fine but the application has some flaws and lots of features that I don't need. When I discovered Altoma UModel, I quickly changed to a much leaner tool for UML generation that offered me exactly what I need. Nothing more, nothing less.
Basically, you have to keep people focused on the final goal. And the final goal is creating some product to be used by your users. Many development teams get lost because they focus on other things instead. None of your users will care about how you created the thing they use. They just need your project to reach their goals.
Thus, the best practice is that which makes your team the most comfortable, including any new team member that would join halfway of any project.
Personally, I'm a very big fan of Automatic Testing (both unit tests and integration tests). In my view it's like a safety net - developers feel safer when changing code when they know there's a test harness that could catch them if the break something. This allows you faster introduction of new features, but also removes the 'fear' of refactoring, for example.
As management for our team: SCRUM

Resistance to performance testing for a big bang rewrite?

I've been working on a "big bang" rewrite for, literally, over two years. The management has consistently and relentlessly ignored and belittled my calls to allocate time / resources for performance measurement, capacity planning, and optimization before the app replaces their mega-millions money maker flagship web app.
Finally, they have agreed to do it (and we successfully prevented them from big-banging by bringing up a parallel beta server that is in production now and will be the target of the tests). I don't like that they waited until the end to prioritize this, but it's better late than never.
What suggestions does everyone have for dealing with situations like these in the future? What is the best way to educate managers / clients about the need for these kinds of tests.
I've shown them Microsoft's performance guide on CodePlex, complete with its stark warnings from seasoned professionals in the opening pages. I've also shown them the book "Release It!" and the guidance its author gives about "the 3 am call". That has finally convinced them reluctantly, but the truth is that this should have been prioritized into the development and partly measured during development prior to final complete system testing.
Many managers and old-school engineers who wrote ASP only, but never did .NET, are used to coding everything themselves and don't understand all the options for caching, tuning, and health monitoring in newer .NET apps.
Thanks
What you didn't realize (and many engineers don't) is that this was a "sales situation", not an engineering one. It doesn't matter if the customer is in-house or not, the process is largely the same.
Sales is all about finding out what kind of problems drive your customers and then showing how your product solves one or more of their problems. If they don't think they have a performance problem, then they don't -- it's that simple. Although you may be able to educate them to the point where they see things your way, "educational selling" is expensive in time and money, and many customers resent being told "something they already know." It sounds like you had to educate this group by beating them over the head with the book, but there may have been easier ways to accomplish your goal.
What would it have been? I don't know, but they do, so ask them. Ask what it was that ultimately pushed them to making the decision. It might have been a sudden realization that you were right, but more likely it was something more basic, like a growing fear of being humiliated in the boardroom or the marketplace. They are unlikely to say so directly, but if you really listen to their answers you may be able to read between the lines. In sales, doing a postmortem on a sales call (successful or not) is critical to understanding what motivates your customer and how you can tune your own skills in presenting ideas.
And, next time, you will know to ask open-ended questions about what your customer wants to achieve, and what his/her problems are now and in the long run. Will it always work? Of course not, but learning to deal with the social side of engineering issues is a valuable skill to acquire.
Get them to agree on solid numbers for what they expect the system to be able to support (number of concurrent users/tasks/etc), then it becomes an obvious part of the development work to make certain the system can meet the requirements.
Don't discuss this as an open-ended performance tuning and benchmarking process, as that will make older managers concerned that you're on a fishing expedition or gold-plating the system.
Instead, discuss it as a certification exercise. Identify your current traffic levels, add in a safety margin, and explain that your testing is intended to certify that the system will stand up to real life.
You can still do the performance hotspot work; you just need to give the pointy haired bosses comfrt that all of your work is going to tangible business objectives.
There are all sorts of ways of convincing people - the examples you mention are "invoke higher authority". Most managers, however, would not necessarily be persuaded by technical guidance.
For situations like this, I've used a risk-based approach. For each project, I keep a risk log, identifying the biggest risks to the project, their likelihood, impact, and mitigation options. Often, you can quantify those items - and that allows managers to make a good decision.
At the very start of the re-write, your risk log might have had the following entry:
Risk: System performance fails to meet user expectations
Likelihood: unknown
Impact: end users abandon the website due to excessive load times. Project fails.
Cost of impact: $$$whatever your project cost.
Mitigation: fortnightly performance tests.
Mitigation cost: $$$whatever you think it would cost in time and money
Recommendation: run performance test to quantify the risk.
Most managers would be very uncomfortable with a risk whose likelihood is unknown, but whose cost is the failure of the project. On the other hand, you're not asking for a huge commitment - just enough to quantify the risk.
I like to review the risk log regularly with the project stakeholders - at least monthly. I always start with the "high impact/high likelihood" risks, but then move to the "high impact/unknown likelihood" risks. It's also a good idea to distribute meeting notes, recording the stakeholder decisions on each risk. Again, a manager who sees their name attached to a decision to ignore a high-impact risk, in a written record, will think carefully about the decision.
Once you can quantify the risk - by running some performance tests - you can make further risk-based decisions, based on the cost and likelihood of performance problems. This is also a good way to manage the other classic non-functional issues like security, accessibility and scalability.
By quantifying the issue, you turn it into a business decision, not an engineering decision.
Take careful notes about this development project, including what performance problems crop up after deployment. People will bemoan the problems, and you can tactfully suggest that they prioritize that sort of problem higher earlier. Some people will only accept direct first-person evidence.

Resources