Use SimPy to simulate Chord distributed system - modeling

I am doing some research on several distributed systems such as Chord, and I would like to be able to write algorithms and run simulations of the distributed system with just my desktop.
In the simulation, I need to be able to have each node execute independently and communicate with each other, while manually inducing elements such as lag, packet loss, random crashes etc. And then collect data to estimate the performance of the system.
After some searching, I find SimPy to be a good candidate for my purpose.
Would SimPy be a suitable library for this task?
If yes, what are some suggestions/caveats for implementing such a system?

I would say yes.
I used SimPy (version 2) for simulating arbitary communication networks as part of my doctorate. You can see the code here:
https://github.com/IncidentNormal/CommNetSim
It is, however, a bit dense and not very well documented. Also it should really be translated to SimPy version 3, as 2 is no longer supported (and 3 fixes a bunch of limitations I found with 2).
Some concepts/ideas I found to be useful:
Work out what you want out of the simulation before you start implementing it; communication network simulations are incredibly sensitive to small design changes, as you are effectively trying to monitor/measure emergent behaviours from the system.
It's easy to start over-engineering the simulation, using native SimPy objects is almost always sufficient when you strip away the noise from your design.
Use Stores to simulate mediums for transferring packets/payloads. There is an example like this for simulating latency in the SimPy docs: https://simpy.readthedocs.io/en/latest/examples/latency.html
Events are tricky - as they can only fire once per simulation step, so often this can be the source of bugs as behaviour is effectively lost if multiple things fire the same event in a step. For robustness, try not to use them to represent behaviour in communication networks (you rarely need something that low-level), as mentioned above - use Stores instead as these act like queues by design.
Pay close attention to the probability distributions you use to generating randomness. Expovariate distributions are usually closer to simulating natural systems than uniform distributions, but make sure to check every distribution you use for sanity. Generating network traffic usually follows a Poisson distribution, for example, and data volume often follows a Power Law (Pareto) distribution.

Related

What is the difference between MultiAgent Systems and Distributed Computing

I'm curious about differences between distributed and multi-agent systems. I have seen many fundemental similarities and my mind is confused.
Similarities:
1- there are multiple processing units
2- both are used for computing and simulation applications
3- processing units interacting
4- processing units work collectively and become powerfull machine
5- units work with their own properties like own specific clock, own specific processor speed, own memory etc..
So what is the difference(s)?
It is a matter of abstraction and purpose. Multi-agent systems employ powerful high-level abstractions, based on complex (i.e. intelligent) components, which are usually not found in regular distributed system created only to split simple number crunching algorithms over different machines. Multi-agent systems can be used to solve problems that are difficult or impossible for an individual agent or a monolithic system to solve. Distributed computing can be used to solve problems that are embarrassingly parallel. Sure, there are similarities, but if you look close at their abstractions, they can profoundly contrast, leveraging from different algorithms and data structures.
In my perspective the key is the definition of (intelligent) agent. S. Russel and P. Norvig in their "Artificial Intelligence: A Modern approach" defined:
An agent is anything that can be viewed as perceiving is environment through sensors and acting upon that environment through actuators.
So a multi-agent system will be formed by a collection of agents that perceive the environment and and act upon it but remain in some degree independent and decentralized, with a local view to the environment.
A distributed system is (usually) defined as a collection of nodes performing distributed calculations, linked together to multiply processing power.
In a way a MAS is a distributed system, but has some characteristics that make it unique. It depends on the usage and the particular implementation of the system but in some way those definitions overlap a bit.
The question is a bit old but I will still take a shot at it.
We can start by looking at definitions.
Distributed system [1]:
We define a distributed system as one in which hardware or software components located at networked computers communicate and coordinate their actions only by passing messages. This simple definition covers the entire range of systems in which networked computers can usefully be deployed.
Multiagent system [2]:
Multiagent systems are those systems that include multiple autonomous entities with either diverging information or diverging interests, or both.
So, fundamentally, "Distributed" is concerned with the architecture of a system while "Multiagent" is concerned with a specific method of problem solving employed in a system.
By virtue of being distributed, a system is made up of several networked computers. A multiagent system, on the other hand, can exist in a networked environment or on a single non-networked computer.
References
[1] G. Couloris, J. Dollimore, T. Kindberg, G. Blair, Distributed Systems Concepts and Design (Fifth Edition), 2012, Addison-Wesley.
[2] Y. Shoham, K. Leyton-Brown, Multiagent Systems: Algorithmic Game-Theoretic and Logical Foundations (Revision 1.1), 2010, Cambridge Univ. Press.
When I think about Distributed Computing, load is distributed to multi parts, be it multi-thread or multi-computers. In the distributed computing, every part is parallel, that is they are almost the same. Some last computing parts that collects and summarizes results of others may be different than others.
Multi Agent Systems as its name implies has multiple agents that work together to accomplish a goal. Different than Distributed Computing, a multi agent system may work on single computer but it will certainly have more than one agent. These agents may be collector agent, reporter agent, computing agent, ....

FMU co-simulation using openMP or pThread

Say I have a vehicle model, the chassis will be used as a master FMU, its engine, transmission, tires, etc are from 3rd parties and I want to used them as slave FMUs. I want to parallel the model in this way, the master FMU is put on the main thread, and fork everything else on other threads.
I want to know if this simple idea is achievable by using FMUs exported from Dymola...
If possible, is it worthwhile doing it? I wander if the parallel model is as efficient as as a sequential one at the physics level. (I understand that a badly paralleled program is slower than a sequential one, but I just need to know if it is physically slower or faster)
The latest Dymola has built in the openMP features, has anyone ever used it? What does it look like?
I found a paper about this: Master for Co-Simulation Using FMI http://www.ep.liu.se/ecp/063/014/ecp11063014.pdf
I think it can make perfect sense to launch several FMU in parallel if they can do their job separately. What is difficult in co-simulation is to understand when the simulators must be synchronized (for instance to exchange information). These synchronization should be minimal to increase efficiency but enough to avoid track back the simulator states (when possible). Also, it has chance to work when you have causal relations between your FMUs. If you have acausal relations, this is a different story...
technically, I would say:
for 1), you can always launch a FMU in a thread if you want, no problem with that
for 2), it mainly depends on the number and frequency of the synchronizations required between the different FMUs
for 3) I do not know but I think you should distinguish between launching different FMU in parallel and making one FMU parallel...
my two cents

Are there concurrent designs where the actor model isn't good for?

I've noticed that all designs I have come across can be multi-threaded using the actor mode - separating each work module into a different actor and using a message queue (for me a .NET ConcurrentQueue) to pass messages. What other good multi threaded models exist?
Communicating Sequential Processes is, I think, a far better model for concurrency than the actor model. It addresses a number of problems with the actor model (and other models) such as deadlock, livelock, starvation. Take a look at this and, more practically useful, this.
The main difference is as follows. In the actor model a message is sent asynchronously. However in CSP messages are sent synchronously; the sender cannot send until the receiver is ready to receive.
This one simple restriction makes the world of difference. If you've got an incorrect design with deadlock potential then in the actor model it may or may not occur (and it usually occurs only when demo-ing to the boss...). However in CSP the deadlock will always occur, leaving you in no doubt that your design is incorrect. Ok, so you've still got to fix it but that's OK; fixing problems you know are there is much easier than attempting to exhaustively test for the absence of problems (your only choice in the actor model).
The strictly synchronous approach of CSP seems like it will cause problems with response times; for example one fears that a GUI thread can't move on because it's not been able to send a message to a busy worker thread that's not got as far as its 'read'. What you have to do is to ensure that the workload is spread across enough threads so that they can all get back to waiting for new messages within an acceptable period of time. CSP doesn't let you get away with it. The actor model does, however don't be deceived; you're just building up future problems.
In .NET a ConcurrentQueue is not the right primitive for CSP, not unless you layer a synchronising mechanism on top. I've added strict synchronisation on top of TCP sockets too. In fact I generally end up writing some sort of library that abstracts both sockets and pipes so that it becomes immaterial as to whether a 'Process' (as they're known in CSP parlance) is a thread on this machine or a whole other process on another machine at the end of a network connection. Nice - scalabilty built in from the very beginning.
I've been doing it the CSP way for 23 years now, I won't do it any other way. Built some big systems with thousands of threads that way.
==EDIT==
It seems this answer is still attracting some attention, so I thought I'd add to it. For Windows developers there is the DataFlow namespace for the Task Parallel Library. It has to be separately downloaded. Microsoft desribe it thusly: "This dataflow model promotes actor-based programming by providing in-process message passing for coarse-grained dataflow and pipelining tasks." Excellent! It uses classes like BufferBlocks as communications channels. The important thing is that a BufferBlock has a BoundedCapacity property that defaults to Unbounded, which fits the Actor model. Set this to a value of 1, and you have now transformed it into a CSP-style communcation channel.
To add to my last, there are various other multi threading models beyond CSP. This Wikipedia page lists several others like CCS, ACP, and LOTOS. Reading those articles hints at a deep and dark cavern where academics roam, waiting to pounce on a stray software developer.
The problem is that academic obscurity often means a complete lack of tools and libraries at the practical, usable level. It takes a lot of effort to convert a sound, proven academic study into a set of libraries and tools. There's little real incentive for the wider software community to take up a theoretical paper and turn it into a practical reality.
I like CSP because it's actually dead simple to implement your own CSP library based on select() or pselect(). I've done that several times now (I must learn about code re-use), plus the nice people at Kent University put together JCSP for those who like Java. I don't recommend developing in Occam (though it's still just about possible); support and maintainability are going to be issues going forward. CSP is probably the easiest one to get into, and given its good characteristics it's well worthwhile.
#JeremyFriesner
Future Problems
To expand on what I meant by "future problems", I was referring to the fact that in an asynchronous system the sender of messages has no knowledge as to whether the receiver is actually keeping up with the demand. The sender doesn't know because all it knows is that some message buffer has accepted the message. The transport underneath (e.g. tcp) then gets on with the job of pushing the message over as and when the receiver is willing to accept it.
Thus it might be that when under stress the system fails to perform as required, because the message transport will inevitably have a limited capacity to absorb messages that the receiver can't accept yet. The sender only finds this out after the problem has already begun to develop, by which time it might be too late to do anything about it.
Testing of course can reveal this problem, but you have to be careful that the testing really has exhausted the transport's ability to absorb messages. Just a quick blast at full speed might be deceiving.
Of course, a synchronous system imposes an overhead ("are you ready yet?", "no, not yet", "now?", "yes!", "here you are then") which just doesn't happen in an asynchronous system. So on average the asynchronous system will be more efficient, might actually have a higher throughput, etc. Which is why most the of the worlds systems are actually asynchronous, but also the reason why systems don't always reach the full capacity that the raw network bandwidths / processing times might suggest. When approaching full capacity asynchronous systems tend not to limit gracefully, in my opinion. Token Bus (nb not Token Ring) was a good example of a synchronous network with totally dependable and deterministic throughput but was just a little bit slower than Ethernet and Token Ring...
Having always been blessed with a surfeit of bandwidth in my problems I've chosen the synchronous route for certainty-of-success reasons; I'm not really losing out much on bandwidth, but I am losing tons of risk, which is good.
Convert from Synchronous to Asynchronous
Maybe, but it's possibly of little value. In a synchronous system it only works as per the requirement if you have successfully balanced the division of labour between threads. That is, there are enough threads doing the slow bits so that the fast bits aren't held back. Get that wrong and the system definitely isn't quick enough.
But having done that you have a system where every component is able to send messages onwards with no delay, because everything it is sending to is ready and waiting (because of your skill and judgement at balancing out the workloads). So if you did then convert to an asynchronous message transport all you're doing is saving fractionally small amounts of time in the transport of those messages. You're not making changes that will result in the workloads getting processed quicker. However, if saving bandwidth is the goal then perhaps its worthwhile.
Of course, doing this balancing can be a difficult thing, and dealing with variabilities like HDD access times, networks, etc can be difficult to overcome. I've often had to implement a 'next available' workload sharing scheme. But certainly in real time signal processing systems like the ones I play with you're basically dealing with a very dependable transport like OpenVPX's RapidIO, you're only doing sums on the data (not dealing with databases, disks, etc), and the data rates are very high (1GByte/sec is perfectly doable these days, and in fact I was handling data rates that high 13 years ago; that was haaard work). Being strictly synchronous means that you're either definitely keeping up with the data rate or definitely not. With asynchronous, it's more of a maybe...
Real Time OS for Everyone!
Having a real time OS is an essential component too, and these days it seems to be the PREEMPT_RT patch set for Linux that does the job for a lot of people in the trade. Redhat do a prepack spin of that (RedHat MRG), but for a freebie Scientific Linux from the nice people at CERN is good and free! I strongly suspect that a lot of systems would work much more smoothly near their capacity limits if PREEMPT_RT was used - it does a good job of smoothing things out.
Concurrency is a fascinating topic with a lot of approaches to implementation with the fundamental question being - "How do I coordinate parallel computations?".
Some models of concurrency are:
Futures
Futures also known as Promises or Tasks are objects that act as proxies for an asynchronously calculated result. When the value is actually needed for a calculation the thread freezes until the calculation is complete and thus, synchronization is achieved.
Futures are the preferred concurrency model for .NET and ES6.
Software Transactional Memory
Software Transactional Memory (STM) synchronizes access to shared memory (much like locks) by grouping actions into transactions. Any single transaction only sees a single view of the shared memory and is atomic. This is conceptually similar to how many databases deal with concurrency.
STM is the preferred concurrency model for Clojure and Haskell.
The Actor Model
The Actor Model focuses of message passing. An actor receives a message and can decide to send a message in response, spawn other actors, make local changes etc. This is, probably, the least tightly coupled model of these discussed as Actors exchange messages only and nothing else.
The Actor Model is the preferred concurrency model for Erlang and Rust.
Note that unlike the languages mentioned above most languages don't have cannon or preferred concurrency models and even those languages who show a strong preference for one model usually have the other ones implemented as libraries.
My personal opinion is that Futures outclass STM and Actors in simplicity of use and reasoning but none of these models are inherently "wrong" and I can think of no disadvantages for either. You could use whichever you preferred with no consequences.
The most general model for parallel processing is Petri Nets. It represents computation as pure data dependency graph, which expreses maximum parallelism. All other models stem from it.
Dataflow Computing model http://www.cs.colostate.edu/cameron/dataflow.html, http://en.wikipedia.org/wiki/Dataflow_programming is almost as powerful. It restricts Petri Net places to have only one output arc. In practice, this is useful, as places with multiple output arcs are hard to implement, cause indeterminism, and are rarely needed.
Actor model is a dataflow model where nodes may have only 2 input edges - one for input messages and one for actor's state. This is a serious restriction if you want to program functions with side-effect and more than one argument.

Advice on starting a large multi-threaded programming project

My company currently runs a third-party simulation program (natural catastrophe risk modeling) that sucks up gigabytes of data off a disk and then crunches for several days to produce results. I will soon be asked to rewrite this as a multi-threaded app so that it runs in hours instead of days. I expect to have about 6 months to complete the conversion and will be working solo.
We have a 24-proc box to run this. I will have access to the source of the original program (written in C++ I think), but at this point I know very little about how it's designed.
I need advice on how to tackle this. I'm an experienced programmer (~ 30 years, currently working in C# 3.5) but have no multi-processor/multi-threaded experience. I'm willing and eager to learn a new language if appropriate. I'm looking for recommendations on languages, learning resources, books, architectural guidelines. etc.
Requirements: Windows OS. A commercial grade compiler with lots of support and good learning resources available. There is no need for a fancy GUI - it will probably run from a config file and put results into a SQL Server database.
Edit: The current app is C++ but I will almost certainly not be using that language for the re-write. I removed the C++ tag that someone added.
Numerical process simulations are typically run over a single discretised problem grid (for example, the surface of the Earth or clouds of gas and dust), which usually rules out simple task farming or concurrency approaches. This is because a grid divided over a set of processors representing an area of physical space is not a set of independent tasks. The grid cells at the edge of each subgrid need to be updated based on the values of grid cells stored on other processors, which are adjacent in logical space.
In high-performance computing, simulations are typically parallelised using either MPI or OpenMP. MPI is a message passing library with bindings for many languages, including C, C++, Fortran, Python, and C#. OpenMP is an API for shared-memory multiprocessing. In general, MPI is more difficult to code than OpenMP, and is much more invasive, but is also much more flexible. OpenMP requires a memory area shared between processors, so is not suited to many architectures. Hybrid schemes are also possible.
This type of programming has its own special challenges. As well as race conditions, deadlocks, livelocks, and all the other joys of concurrent programming, you need to consider the topology of your processor grid - how you choose to split your logical grid across your physical processors. This is important because your parallel speedup is a function of the amount of communication between your processors, which itself is a function of the total edge length of your decomposed grid. As you add more processors, this surface area increases, increasing the amount of communication overhead. Increasing the granularity will eventually become prohibitive.
The other important consideration is the proportion of the code which can be parallelised. Amdahl's law then dictates the maximum theoretically attainable speedup. You should be able to estimate this before you start writing any code.
Both of these facts will conspire to limit the maximum number of processors you can run on. The sweet spot may be considerably lower than you think.
I recommend the book High Performance Computing, if you can get hold of it. In particular, the chapter on performance benchmarking and tuning is priceless.
An excellent online overview of parallel computing, which covers the major issues, is this introduction from Lawerence Livermore National Laboratory.
Your biggest problem in a multithreaded project is that too much state is visible across threads - it is too easy to write code that reads / mutates data in an unsafe manner, especially in a multiprocessor environment where issues such as cache coherency, weakly consistent memory etc might come into play.
Debugging race conditions is distinctly unpleasant.
Approach your design as you would if, say, you were considering distributing your work across multiple machines on a network: that is, identify what tasks can happen in parallel, what the inputs to each task are, what the outputs of each task are, and what tasks must complete before a given task can begin. The point of the exercise is to ensure that each place where data becomes visible to another thread, and each place where a new thread is spawned, are carefully considered.
Once such an initial design is complete, there will be a clear division of ownership of data, and clear points at which ownership is taken / transferred; and so you will be in a very good position to take advantage of the possibilities that multithreading offers you - cheaply shared data, cheap synchronisation, lockless shared data structures - safely.
If you can split the workload up into non-dependent chunks of work (i.e., the data set can be processed in bits, there aren't lots of data dependencies), then I'd use a thread pool / task mechanism. Presumably whatever C# has as an equivalent to Java's java.util.concurrent. I'd create work units from the data, and wrap them in a task, and then throw the tasks at the thread pool.
Of course performance might be a necessity here. If you can keep the original processing code kernel as-is, then you can call it from within your C# application.
If the code has lots of data dependencies, it may be a lot harder to break up into threaded tasks, but you might be able to break it up into a pipeline of actions. This means thread 1 passes data to thread 2, which passes data to threads 3 through 8, which pass data onto thread 9, etc.
If the code has a lot of floating point mathematics, it might be worth looking at rewriting in OpenCL or CUDA, and running it on GPUs instead of CPUs.
For a 6 month project I'd say it definitely pays out to start reading a good book about the subject first. I would suggest Joe Duffy's Concurrent Programming on Windows. It's the most thorough book I know about the subject and it covers both .NET and native Win32 threading. I've written multithreaded programs for 10 years when I discovered this gem and still found things I didn't know in almost every chapter.
Also, "natural catastrophe risk modeling" sounds like a lot of math. Maybe you should have a look at Intel's IPP library: it provides primitives for many common low-level math and signal processing algorithms. It supports multi threading out of the box, which may make your task significantly easier.
There are a lot of techniques that can be used to deal with multithreading if you design the project for it.
The most general and universal is simply "avoid shared state". Whenever possible, copy resources between threads, rather than making them access the same shared copy.
If you're writing the low-level synchronization code yourself, you have to remember to make absolutely no assumptions. Both the compiler and CPU may reorder your code, creating race conditions or deadlocks where none would seem possible when reading the code. The only way to prevent this is with memory barriers. And remember that even the simplest operation may be subject to threading issues. Something as simple as ++i is typically not atomic, and if multiple threads access i, you'll get unpredictable results.
And of course, just because you've assigned a value to a variable, that's no guarantee that the new value will be visible to other threads. The compiler may defer actually writing it out to memory. Again, a memory barrier forces it to "flush" all pending memory I/O.
If I were you, I'd go with a higher level synchronization model than simple locks/mutexes/monitors/critical sections if possible. There are a few CSP libraries available for most languages and platforms, including .NET languages and native C++.
This usually makes race conditions and deadlocks trivial to detect and fix, and allows a ridiculous level of scalability. But there's a certain amount of overhead associated with this paradigm as well, so each thread might get less work done than it would with other techniques. It also requires the entire application to be structured specifically for this paradigm (so it's tricky to retrofit onto existing code, but since you're starting from scratch, it's less of an issue -- but it'll still be unfamiliar to you)
Another approach might be Transactional Memory. This is easier to fit into a traditional program structure, but also has some limitations, and I don't know of many production-quality libraries for it (STM.NET was recently released, and may be worth checking out. Intel has a C++ compiler with STM extensions built into the language as well)
But whichever approach you use, you'll have to think carefully about how to split the work up into independent tasks, and how to avoid cross-talk between threads. Any time two threads access the same variable, you have a potential bug. And any time two threads access the same variable or just another variable near the same address (for example, the next or previous element in an array), data will have to be exchanged between cores, forcing it to be flushed from CPU cache to memory, and then read into the other core's cache. Which can be a major performance hit.
Oh, and if you do write the application in C++, don't underestimate the language. You'll have to learn the language in detail before you'll be able to write robust code, much less robust threaded code.
One thing we've done in this situation that has worked really well for us is to break the work to be done into individual chunks and the actions on each chunk into different processors. Then we have chains of processors and data chunks can work through the chains independently. Each set of processors within the chain can run on multiple threads each and can process more or less data depending on their own performance relative to the other processors in the chain.
Also breaking up both the data and actions into smaller pieces makes the app much more maintainable and testable.
There's plenty of specific bits of individual advice that could be given here, and several people have done so already.
However nobody can tell you exactly how to make this all work for your specific requirements (which you don't even fully know yourself yet), so I'd strongly recommend you read up on HPC (High Performance Computing) for now to get the over-arching concepts clear and have a better idea which direction suits your needs the most.
The model you choose to use will be dictated by the structure of your data. Is your data tightly coupled or loosely coupled? If your simulation data is tightly coupled then you'll want to look at OpenMP or MPI (parallel computing). If your data is loosely coupled then a job pool is probably a better fit... possibly even a distributed computing approach could work.
My advice is get and read an introductory text to get familiar with the various models of concurrency/parallelism. Then look at your application's needs and decide which architecture you're going to need to use. After you know which architecture you need, then you can look at tools to assist you.
A fairly highly rated book which works as an introduction to the topic is "The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Application".
Read about Erlang and the "Actor Model" in particular. If you make all your data immutable, you will have a much easier time parallelizing it.
Most of the other answers offer good advice regarding partitioning the project - look for tasks that can be cleanly executed in parallel with very little data sharing required. Be aware of non-thread safe constructs such as static or global variables, or libraries that are not thread safe. The worst one we've encountered is the TNT library, which doesn't even allow thread-safe reads under some circumstances.
As with all optimisation, concentrate on the bottlenecks first, because threading adds a lot of complexity you want to avoid it where it isn't necessary.
You'll need a good grasp of the various threading primitives (mutexes, semaphores, critical sections, conditions, etc.) and the situations in which they are useful.
One thing I would add, if you're intending to stay with C++, is that we have had a lot of success using the boost.thread library. It supplies most of the required multi-threading primitives, although does lack a thread pool (and I would be wary of the unofficial "boost" thread pool one can locate via google, because it suffers from a number of deadlock issues).
I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now.
You can either use C# that you're more familiar with or you can use managed C++.
At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For (or ForEach) and/or PLINQ where possible.
If you do this, a lot of the heavy lifting will be done for you in a very efficient way. It's the direction that Microsoft is going to increasingly support.
2: I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now. At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For and/or PLINQ where possible. If you do this, a lot of the heavy lifting will be done for you in a very efficient way. 1: http://msdn.microsoft.com/en-us/library/dd321424%28VS.100%29.aspx
Sorry i just want to add a pessimistic or better realistic answer here.
You are under time pressure. 6 month deadline and you don't even know for sure what language is this system and what it does and how it is organized. If it is not a trivial calculation then it is a very bad start.
Most importantly: You say you have never done mulitithreading programming before. This is where i get 4 alarm clocks ringing at once. Multithreading is difficult and takes a long time to learn it when you want to do it right - and you need to do it right when you want to win a huge speed increase. Debugging is extremely nasty even with good tools like Total Views debugger or Intels VTune.
Then you say you want to rewrite the app in another lanugage - well this isn't as bad as you have to rewrite it anyway. THe chance to turn a single threaded Program into a well working multithreaded one without total redesign is almost zero.
But learning multithreading and a new language (what is your C++ skills?) with a timeline of 3 month (you have to write a throw away prototype - so i cut the timespan into two halfs) is extremely challenging.
My advise here is simple and will not like it: Learn multithreadings now - because it is a required skill set in the future - but leave this job to someone who already has experience. Well unless you don't care about the program being successfull and are just looking for 6 month payment.
If it's possible to have all the threads working on disjoint sets of process data, and have other information stored in the SQL database, you can quite easily do it in C++, and just spawn off new threads to work on their own parts using the Windows API. The SQL server will handle all the hard synchronization magic with its DB transactions! And of course C++ will perform a lot faster than C#.
You should definitely revise C++ for this task, and understand the C++ code, and look for efficiency bugs in the existing code as well as adding the multi-threaded functionality.
You've tagged this question as C++ but mentioned that you're a C# developer currently, so I'm not sure if you'll be tackling this assignment from C++ or C#. Anyway, in case you're going to be using C# or .NET (including C++/CLI): I have the following MSDN article bookmarked and would highly recommend reading through it as part of your prep work.
Calling Synchronous Methods Asynchronously
Whatever technology your going to write this, take a look a this must read book on concurrency "Concurrent programming in Java" and for .Net I highly recommend the retlang library for concurrent app.
I don't know if it was mentioned yet, but if I were in your shoes, what I would be doing right now (aside from reading every answer posted here) is writing a multiple threaded example application in your favorite (most used) language.
I don't have extensive multithreaded experience. I've played around with it in the past for fun but I think gaining some experience with a throw-away application will suit your future efforts.
I wish you luck in this endeavor and I must admit I wish I had the opportunity to work on something like this...

does a disaster proof language exist?

When creating system services which must have a high reliability, I often end up writing the a lot of 'failsafe' mechanisms in case of things like: communications which are gone (for instance communication with the DB), what would happen if the power is lost and the service restarts.... how to pick up the pieces and continue in a correct way (and remembering that while picking up the pieces the power could go out again...), etc etc
I can imagine for not too complex systems, a language which would cater for this would be very practical. So a language which would remember it's state at any given moment, no matter if the power gets cut off, and continues where it left off.
Does this exist yet? If so, where can I find it? If not, why can't this be realized? It would seem to me very handy for critical systems.
p.s. In case the DB connection is lost, it would signal that a problem arose, and manual intervention is needed. The moment he connection is restored, it would continue where it left off.
EDIT:
Since the discussion seems to have died off let me add a few points(while waiting before I can add a bounty to the question)
The Erlang response seems to be top rated right now. I'm aware of Erlang and have read the pragmatic book by Armstrong (the principal creator). It's all very nice (although functional languages make my head spin with all the recursion), but the 'fault tolerant' bit doesn't come automatically. Far from it. Erlang offers a lot of supervisors en other methodologies to supervise a process, and restart it if necessary. However, to properly make something which works with these structures, you need to be quite the erlang guru, and need to make your software fit all these frameworks. Also, if the power drops, the programmer too has to pick up the pieces and try to recover the next time the program restarts
What I'm searching is something far simpler:
Imagine a language (as simple as PHP for instance), where you can do things like do DB queries, act on it, perform file manipulations, perform folder manipulations, etc.
It's main feature however should be: If the power dies, and the thing restarts it takes of where it left off (So it not only remembers where it was, it will remember the variable states as well). Also, if it stopped in the middle of a filecopy, it will also properly resume. etc etc.
Last but not least, if the DB connection drops and can't be restored, the language just halts, and signals (syslog perhaps) for human intervention, and then carries on where it left off.
A language like this would make a lot of services programming a lot easier.
EDIT:
It seems (judging by all the comments and answers) that such a system doesn't exist. And probably will not in the near foreseeable future due to it being (near?) impossible to get right.
Too bad.... again I'm not looking for this language (or framework) to get me to the moon, or use it to monitor someones heartrate. But for small periodic services/tasks which always end up having loads of code handling bordercases (powerfailure somewhere in the middle, connections dropping and not coming back up),...where a pause here,...fix the issues,....and continue where you left off approach would work well.
(or a checkpoint approach as one of the commenters pointed out (like in a videogame). Set a checkpoint.... and if the program dies, restart here the next time.)
Bounty awarded:
At the last possible minute when everyone was coming to the conclusion it can't be done, Stephen C comes with napier88 which seems to have the attributes I was looking for.
Although it is an experimental language, it does prove it can be done and it is a something which is worth investigating more.
I'll be looking at creating my own framework (with persistent state and snapshots perhaps) to add the features I'm looking for in .Net or another VM.
Everyone thanks for the input and the great insights.
Erlang was designed for use in Telecommunication systems, where high-rel is fundamental. I think they have standard methodology for building sets of communicating processes in which failures can be gracefully handled.
ERLANG is a concurrent functional language, well suited for distributed, highly concurrent and fault-tolerant software. An important part of Erlang is its support for failure recovery. Fault tolerance is provided by organising the processes of an ERLANG application into tree structures. In these structures, parent processes monitor failures of their children and are responsible for their restart.
Software Transactional Memory (STM) combined with nonvolatile RAM would probably satisfy the OP's revised question.
STM is a technique for implementating "transactions", e.g., sets of actions that are done effectively as an atomic operation, or not at all. Normally the purpose of STM is to enable highly parallel programs to interact over shared resources in a way which is easier to understand than traditional lock-that-resource programming, and has arguably lower overhead by virtue of having a highly optimistic lock-free style of programming.
The fundamental idea is simple: all reads and writes inside a "transaction" block are recorded (somehow!); if any two threads conflict on the these sets (read-write or write-write conflicts) at the end of either of their transactions, one is chosen as the winner and proceeds, and the other is forced to roll back his state to the beginning of the transaction and re-execute.
If one insisted that all computations were transactions, and the state at the beginning(/end) of each transaction was stored in nonvolatile RAM (NVRAM), a power fail could be treated as a transaction failure resulting in a "rollback". Computations would proceed only from transacted states in a reliable way. NVRAM these days can be implemented with Flash memory or with battery backup. One might need a LOT of NVRAM, as programs have a lot of state (see minicomputer story at end). Alternatively, committed state changes could be written to log files that were written to disk; this is the standard method used by most databases and by reliable filesystems.
The current question with STM is, how expensive is it to keep track of the potential transaction conflicts? If implementing STM slows the machine down by an appreciable amount, people will live with existing slightly unreliable schemes rather than give up that performance. So far the story isn't good, but then the research is early.
People haven't generally designed languages for STM; for research purposes, they've mostly
enhanced Java with STM (see Communications of ACM article in June? of this year). I hear MS has an experimental version of C#. Intel has an experimental version for C and C++.
THe wikipedia page has a long list. And the functional programming guys
are, as usual, claiming that the side-effect free property of functional programs makes STM relatively trivial to implement in functional languages.
If I recall correctly, back in the 70s there was considerable early work in distributed operating systems, in which processes (code+state) could travel trivally from machine to machine. I believe several such systems explicitly allowed node failure, and could restart a process in a failed node from save state in another node. Early key work was on the
Distributed Computing System by Dave Farber. Because designing languages back in the 70s was popular, I recall DCS had it had its own programming language but I don't remember the name. If DCS didn't allow node failure and restart, I'm fairly sure the follow on research systems did.
EDIT: A 1996 system which appears on first glance to have the properties you desire is
documented here.
Its concept of atomic transactions is consistent with the ideas behind STM.
(Goes to prove there isn't a lot new under the sun).
A side note: Back in in 70s, Core Memory was still king. Core, being magnetic, was nonvolatile across power fails, and many minicomputers (and I'm sure the mainframes) had power fail interrupts that notified the software some milliseconds ahead of loss of power. Using that, one could easily store the register state of the machine and shut it down completely. When power was restored, control would return to a state-restoring point, and the software could proceed. Many programs could thus survive power blinks and reliably restart. I personally built a time-sharing system on a Data General Nova minicomputer; you could actually have it running 16 teletypes full blast, take a power hit, and come back up and restart all the teletypes as if nothing happened. The change from cacophony to silence and back was stunning, I know, I had to repeat it many times to debug the power-failure management code, and it of course made great demo (yank the plug, deathly silence, plug back in...). The name of the language that did this, was of course Assembler :-}
From what I know¹, Ada is often used in safety critical (failsafe) systems.
Ada was originally targeted at
embedded and real-time systems.
Notable features of Ada include:
strong typing, modularity mechanisms
(packages), run-time checking,
parallel processing (tasks), exception
handling, and generics. Ada 95 added
support for object-oriented
programming, including dynamic
dispatch.
Ada supports run-time checks in order
to protect against access to
unallocated memory, buffer overflow
errors, off-by-one errors, array
access errors, and other detectable
bugs. These checks can be disabled in
the interest of runtime efficiency,
but can often be compiled efficiently.
It also includes facilities to help
program verification.
For these
reasons, Ada is widely used in
critical systems, where any anomaly
might lead to very serious
consequences, i.e., accidental death
or injury. Examples of systems where
Ada is used include avionics, weapon
systems (including thermonuclear
weapons), and spacecraft.
N-Version programming may also give you some helpful background reading.
¹That's basically one acquaintance who writes embedded safety critical software
I doubt that the language features you are describing are possible to achieve.
And the reason for that is that it would be very hard to define common and general failure modes and how to recover from them. Think for a second about your sample application - some website with some logic and database access. And lets say we have a language that can detect power shutdown and subsequent restart, and somehow recover from it. The problem is that it is impossible to know for the language how to recover.
Let's say your app is an online blog application. In that case it might be enough to just continue from the point we failed and all be ok. However consider similar scenario for an online bank. Suddenly it's no longer smart to just continue from the same point. For example if I was trying to withdraw some money from my account, and the computer died right after the checks but before it performed the withdrawal, and it then goes back one week later it will give me the money even though my account is in the negative now.
In other words, there is no single correct recovery strategy, so this is not something that can be implemented into the language. What language can do is to tell you when something bad happens - but most languages already support that with exception handling mechanisms. The rest is up to application designers to think about.
There are a lot of technologies that allow designing fault tolerant applications. Database transactions, durable message queues, clustering, hardware hot swapping and so on and on. But it all depends on concrete requirements and how much the end user is willing to pay for it all.
There is an experimental language called Napier88 that (in theory) has some attributes of being disaster-proof. The language supports Orthogonal Persistence, and in some implementations this extends (extended) to include the state of the entire computation. Specifically, when the Napier88 runtime system check-pointed a running application to the persistent store, the current thread state would be included in the checkpoint. If the application then crashed and you restarted it in the right way, you could resume the computation from the checkpoint.
Unfortunately, there are a number of hard issues that need to be addressed before this kind of technology is ready for mainstream use. These include figuring out how to support multi-threading in the context of orthogonal persistence, figuring out how to allow multiple processes share a persistent store, and scalable garbage collection of persistent stores.
And there is the problem of doing Orthogonal Persistence in a mainstream language. There have been attempts to do OP in Java, including one that was done by people associated with Sun (the Pjama project), but there is nothing active at the moment. The JDO / Hibernate approaches are more favoured these days.
I should point out that Orthogonal Persistence isn't really disaster-proof in the large sense. For instance, it cannot deal with:
reestablishment of connections, etc with "outside" systems after a restart,
application bugs that cause corruption of persisted data, or
loss of data due to something bringing down the system between checkpoints.
For those, I don't believe there are general solutions that would be practical.
The majority of such efforts - termed 'fault tolerance' - are around the hardware, not the software.
The extreme example of this is Tandem, whose 'nonstop' machines have complete redundancy.
Implementing fault tolerance at a hardware level is attractive because a software stack is typically made from components sourced from different providers - your high availability software application might be installed along side some decidedly shaky other applications and services on top of an operating system that is flaky and using hardware device drivers that are decidedly fragile..
But at a language level, almost all languages offer the facilities for proper error checking. However, even with RAII, exceptions, constraints and transactions, these code-paths are rarely tested correctly and rarely tested together in multiple-failure scenerios, and its usually in the error handling code that the bugs hide. So its more about programmer understanding, discipline and trade-offs than about the languages themselves.
Which brings us back to the fault tolerance at the hardware level. If you can avoid your database link failing, you can avoid exercising the dodgy error handling code in the applications.
No, a disaster-proof language does not exist.
Edit:
Disaster-proof implies perfection. It brings to mind images of a process which applies some intelligence to resolve unknown, unspecified and unexpected conditions in a logical manner. There is no manner by which a programming language can do this. If you, as the programmer, can not figure out how your program is going to fail and how to recover from it then your program isn't going to be able to do so either.
Disaster from an IT perspective can arise in so many fashions that no one process can resolve all of those different issues. The idea that you could design a language to address all of the ways in which something could go wrong is just incorrect. Due to the abstraction from the hardware many problems don't even make much sense to address with a programming language; yet they are still 'disasters'.
Of course, once you start limiting the scope of the problem; then we can begin talking about developing a solution to it. So, when we stop talking about being disaster-proof and start speaking about recovering from unexpected power surges it becomes much easier to develop a programming language to address that concern even when, perhaps, it doesn't make much sense to handle that issue at such a high level of the stack. However, I will venture a prediction that once you scope this down to realistic implementations it becomes uninteresting as a language since it has become so specific. i.e. Use my scripting language to run batch processes overnight that will recover from unexpected power surges and lost network connections (with some human assistance); this is not a compelling business case to my mind.
Please don't misunderstand me. There are some excellent suggestions within this thread but to my mind they do not rise to anything even remotely approaching disaster-proof.
Consider a system built from non-volatile memory. The program state is persisted at all times, and should the processor stop for any length of time, it will resume at the point it left when it restarts. Therefore, your program is 'disaster proof' to the extent that it can survive a power failure.
This is entirely possible, as other posts have outlined when talking about Software Transactional Memory, and 'fault tolerance' etc. Curious nobody mentioned 'memristors', as they would offer a future architecture with these properties and one that is perhaps not completely von Neumann architecture too.
Now imagine a system built from two such discrete systems - for a straightforward illustration, one is a database server and the other an application server for an online banking website.
Should one pause, what does the other do? How does it handle the sudden unavailability of it's co-worker?
It could be handled at the language level, but that would mean lots of error handling and such, and that's tricky code to get right. That's pretty much no better than where we are today, where machines are not check-pointed but the languages try and detect problems and ask the programmer to deal with them.
It could pause too - at the hardware level they could be tied together, such that from a power perspective they are one system. But that's hardly a good idea; better availability would come from a fault-tolerant architecture with backup systems and such.
Or we could use persistant message queues between the two machines. However, at some point these messages get processed, and they could at that point be too old! Only application logic can really work what to do in that circumstances, and there we are back to languages delegating to the programmer again.
So it seems that the disaster-proofing is better in the current form - uninterrupted power supplies, hot backup servers ready to go, multiple network routes between hosts, etc. And then we only have to hope that our software is bug-free!
Precise answer:
Ada and SPARK were designed for maximum fault-tolerance and to move all bugs possible to compile-time rather than runtime. Ada was designed by the US Dept of Defense for military and aviation systems, running on embedded devices in such things as airplanes. Spark is its descendant. There's another language used in the early US space program, HAL/S geared to handling HARDWARE failure and memory corruption due to cosmic rays.
Practical answer:
I've never met anyone who can code Ada/Spark. For most users the best answer is SQL variants on a DBMS with automatic failover and clustering of servers. Integrity checks guarantee safety. Something like T-SQL or PL/SQL has full transactional security, is Turing-complete, and is pretty tolerant of problems.
Reason there isn't a better answer:
For performance reasons, you can't provide durability for every program operation. If you did, the processing would slow to the speed of your fastest nonvolative storage. At best, your performance will drop by a thousand or million fold, because of how much slower ANYTHING is than CPU caches or RAM.
It would be the equivalent of going from a Core 2 Duo CPU to the ancient 8086 CPU -- at most you could do a couple hundred operations per second. Except, this would be even SLOWER.
In cases where frequent power cycling or hardware failures exist, you use something like a DBMS, which guarantees ACID for every important operation. Or, you use hardware that has fast, nonvolatile storage (flash, for example) -- this is still much slower, but if the processing is simple, this is OK.
At best your language gives you good compile-time safety checks for bugs, and will throw exceptions rather than crashing. Exception handling is a feature of half the languages in use now.
There are several commercially avaible frameworks Veritas, Sun's HA , IBMs HACMP etc. etc.
which will automatically monitor processes and start them on another server in event of failure.
There is also expensive hardware like HPs Tandem Nonstop range which can survive internal hardware failures.
However sofware is built by peoples and peoples love to get it wrong. Consider the cautionary tale of the IEFBR14 program shipped with IBMs MVS. It basically a NOP dummy program which allows the declarative bits of JCL to happen without really running a program. This is the entire original source code:-
IEFBR14 START
BR 14 Return addr in R14 -- branch at it
END
Nothing code be simpler? During its long life this program has actually acummulated a bug bug report and is now on version 4.
Thats 1 bug to three lines of code, the current version is four times the size of the original.
Errors will always creep in, just make sure you can recover from them.
This question forced me to post this text
(Its quoted from HGTTG from Douglas Adams:)
Click, hum.
The huge grey Grebulon reconnaissance ship moved silently through the black void. It was travelling at fabulous, breathtaking speed, yet appeared, against the glimmering background of a billion distant stars to be moving not at all. It was just one dark speck frozen against an infinite granularity of brilliant night.
On board the ship, everything was as it had been for millennia, deeply dark and Silent.
Click, hum.
At least, almost everything.
Click, click, hum.
Click, hum, click, hum, click, hum.
Click, click, click, click, click, hum.
Hmmm.
A low level supervising program woke up a slightly higher level supervising program deep in the ship's semi-somnolent cyberbrain and reported to it that whenever it went click all it got was a hum.
The higher level supervising program asked it what it was supposed to get, and the low level supervising program said that it couldn't remember exactly, but thought it was probably more of a sort of distant satisfied sigh, wasn't it? It didn't know what this hum was. Click, hum, click, hum. That was all it was getting.
The higher level supervising program considered this and didn't like it. It asked the low level supervising program what exactly it was supervising and the low level supervising program said it couldn't remember that either, just that it was something that was meant to go click, sigh every ten years or so, which usually happened without fail. It had tried to consult its error look-up table but couldn't find it, which was why it had alerted the higher level supervising program to the problem .
The higher level supervising program went to consult one of its own look-up tables to find out what the low level supervising program was meant to be supervising.
It couldn't find the look-up table .
Odd.
It looked again. All it got was an error message. It tried to look up the error message in its error message look-up table and couldn't find that either. It allowed a couple of nanoseconds to go by while it went through all this again. Then it woke up its sector function supervisor.
The sector function supervisor hit immediate problems. It called its supervising agent which hit problems too. Within a few millionths of a second virtual circuits that had lain dormant, some for years, some for centuries, were flaring into life throughout the ship. Something, somewhere, had gone terribly wrong, but none of the supervising programs could tell what it was. At every level, vital instructions were missing, and the instructions about what to do in the event of discovering that vital instructions were missing, were also missing.
Small modules of software — agents — surged through the logical pathways, grouping, consulting, re-grouping. They quickly established that the ship's memory, all the way back to its central mission module, was in tatters. No amount of interrogation could determine what it was that had happened. Even the central mission module itself seemed to be damaged.
This made the whole problem very simple to deal with. Replace the central mission module. There was another one, a backup, an exact duplicate of the original. It had to be physically replaced because, for safety reasons, there was no link whatsoever between the original and its backup. Once the central mission module was replaced it could itself supervise the reconstruction of the rest of the system in every detail, and all would be well.
Robots were instructed to bring the backup central mission module from the shielded strong room, where they guarded it, to the ship's logic chamber for installation.
This involved the lengthy exchange of emergency codes and protocols as the robots interrogated the agents as to the authenticity of the instructions. At last the robots were satisfied that all procedures were correct. They unpacked the backup central mission module from its storage housing, carried it out of the storage chamber, fell out of the ship and went spinning off into the void.
This provided the first major clue as to what it was that was wrong.
Further investigation quickly established what it was that had happened. A meteorite had knocked a large hole in the ship. The ship had not previously detected this because the meteorite had neatly knocked out that part of the ship's processing equipment which was supposed to detect if the ship had been hit by a meteorite.
The first thing to do was to try to seal up the hole. This turned out to be impossible, because the ship's sensors couldn't see that there was a hole, and the supervisors which should have said that the sensors weren't working properly weren't working properly and kept saying that the sensors were fine. The ship could only deduce the existence of the hole from the fact that the robots had clearly fallen out of it, taking its spare brain, which would have enabled it to see the hole, with them.
The ship tried to think intelligently about this, failed, and then blanked out completely for a bit. It didn't realise it had blanked out, of course, because it had blanked out. It was merely surprised to see the stars jump. After the third time the stars jumped the ship finally realised that it must be blanking out, and that it was time to take some serious decisions.
It relaxed.
Then it realised it hadn't actually taken the serious decisions yet and panicked. It blanked out again for a bit. When it awoke again it sealed all the bulkheads around where it knew the unseen hole must be.
It clearly hadn't got to its destination yet, it thought, fitfully, but since it no longer had the faintest idea where its destination was or how to reach it, there seemed to be little point in continuing. It consulted what tiny scraps of instructions it could reconstruct from the tatters of its central mission module.
"Your !!!!! !!!!! !!!!! year mission is to !!!!! !!!!! !!!!! !!!!!, !!!!! !!!!! !!!!! !!!!!, land !!!!! !!!!! !!!!! a safe distance !!!!! !!!!! ..... ..... ..... .... , land ..... ..... ..... monitor it. !!!!! !!!!! !!!!!..."
All of the rest was complete garbage.
Before it blanked out for good the ship would have to pass on those instructions, such as they were, to its more primitive subsidiary systems.
It must also revive all of its crew.
There was another problem. While the crew was in hibernation, the minds of all of its members, their memories, their identities and their understanding of what they had come to do, had all been transferred into the ship's central mission module for safe keeping. The crew would not have the faintest idea of who they were or what they were doing there. Oh well.
Just before it blanked out for the final time, the ship realised that its engines were beginning to give out too.
The ship and its revived and confused crew coasted on under the control of its subsidiary automatic systems, which simply looked to land wherever they could find to land and monitor whatever they could find to monitor.
Try taking an existing open source interpreted language and see if you could adapt its implementation to include some of these features. Python's default C implementation embeds an internal lock (called the GIL, Global Interpreter Lock) that is used to "handle" concurrency among Python threads by taking turns every 'n' VM instructions. Perhaps you could hook into this same mechanism to checkpoint the code state.
For a program to continue where it left off if the machine loses power, not only would it need to save state to somewhere, the OS would also have to "know" to resume it.
I suppose implementing a "hibernate" feature in a language could be done, but having that happen constantly in the background so it's ready in the event anything bad happens sounds like the OS' job, in my opinion.
It's main feature however should be: If the power dies, and the thing restarts it takes of where it left off (So it not only remembers where it was, it will remember the variable states as well). Also, if it stopped in the middle of a filecopy, it will also properly resume. etc etc.
... ...
I've looked at erlang in the past. However nice it's fault tolerant features it has... It doesn't survive a powercut. When the code restarts you'll have to pick up the pieces
If such a technology existed, I'd be VERY interested in reading about it. That said, The Erlang solution would be having multiple nodes--ideally in different locations--so that if one location went down, the other nodes could pick up the slack. If all of your nodes were in the same location and on the same power source (not a very good idea for distributed systems), then you'd be out of luck as you mentioned in a comment follow-up.
The Microsoft Robotics Group has introduced a set of libraries that appear to be applicable to your question.
What is Concurrency and Coordination
Runtime (CCR)?
Concurrency and Coordination Runtime
(CCR) provides a highly concurrent
programming model based on
message-passing with powerful
orchestration primitives enabling
coordination of data and work without
the use of manual threading, locks,
semaphores, etc. CCR addresses the
need of multi-core and concurrent
applications by providing a
programming model that facilitates
managing asynchronous operations,
dealing with concurrency, exploiting
parallel hardware and handling partial
failure.
What is Decentralized Software
Services (DSS)?
Decentralized Software Services (DSS)
provides a lightweight, state-oriented
service model that combines
representational state transfer (REST)
with a formalized composition and
event notification architecture
enabling a system-level approach to
building applications. In DSS,
services are exposed as resources
which are accessible both
programmatically and for UI
manipulation. By integrating service
composition, structured state
manipulation, and event notification
with data isolation, DSS provides a
uniform model for writing highly
observable, loosely coupled
applications running on a single node
or across the network.
Most of the answers given are general purpose languages. You may want to look into more specialized languages that are used in embedded devices. The robot is a good example to think about. What would you want and/or expect a robot to do when it recovered from a power failure?
In the embedded world, this can be implemented through a watchdog interrupt and a battery-backed RAM. I've written such myself.
Depending upon your definition of a disaster, it can range from 'difficult' to 'practicly impossible' to delegate this responsibility to the language.
Other examples given include persisting the current state of the application to NVRAM after each statement is executed. This only works so long as the computer doesn't get destroyed.
How would a language level feature know to restart the application on a new host?
And in the situation of restoring the application to a host - what if significant time had passed and assumptions/checks made previously were now invalid?
T-SQL, PL/SQL and other transactional languages are probably as close as you'll get to 'disaster proof' - they either succeed (and the data is saved), or they don't. Excluding disabling transactional isolation, it's difficult (but probably not impossible if you really try hard) to get into 'unknown' states.
You can use techniques like SQL Mirroring to ensure that writes are saved in atleast two locations concurrently before a transaction is committed.
You still need to ensure you save your state every time it's safe (commit).
If I understand your question correctly, I think that you are asking whether it's possible to guarantee that a particular algorithm (that is, a program plus any recovery options provided by the environment) will complete (after any arbitrary number of recoveries/restarts).
If this is correct, then I would refer you to the halting problem:
Given a description of a program and a finite input, decide whether the program finishes running or will run forever, given that input.
I think that classifying your question as an instance of the halting problem is fair considering that you would ideally like the language to be "disaster proof" -- that is, imparting a "perfectness" to any flawed program or chaotic environment.
This classification reduces any combination of environment, language, and program down to "program and a finite input".
If you agree with me, then you'll be disappointed to read that the halting problem is undecidable. Therefore, no "disaster proof" language or compiler or environment could be proven to be so.
However, it is entirely reasonable to design a language that provides recovery options for various common problems.
In the case of power failure.. sounds like to me: "When your only tool is a hammer, every problem looks like a nail"
You don't solve power failure problems within a program. You solve this problem with backup power supplies, batteries, etc.
If the mode of failure is limited to hardware failure, VMware Fault Tolerance claims similar thing that you want. It runs a pair of virtual machines across multiple clusters, and using what they call vLockstep, the primary vm sends all states to the secondary vm real-time, so in case of primary failure, the execution transparently flips to the secondary.
My guess is that this wouldn't help communication failure, which is more common than hardware failure. For serious high availability, you should consider distributed systems like Birman's process group approach (paper in pdf format, or book Reliable Distributed Systems: Technologies, Web Services, and Applications ).
The closest approximation appears to be SQL. It's not really a language issue though; it's mostly a VM issue. I could imagine a Java VM with these properties; implementing it would be another matter.
A quick&dirty approximation is achieved by application checkpointing. You lose the "die at any moment" property, but it's pretty close.
I think its a fundemental mistake for recovery not to be a salient design issue. Punting responsibility exclusivly to the environment leads to a generally brittle solution intolerant of internal faults.
If it were me I would invest in reliable hardware AND design the software in a way that it was able to recover automatically from any possible condition. Per your example database session maintenance should be handled automatically by a sufficiently high level API. If you have to manually reconnect you are likely using the wrong API.
As others have pointed out procedure languages embedded in modern RDBMS systems are the best you are going to get without use of an exotic language.
VMs in general are designed for this sort of thing. You could use a VM vendors (vmware..et al) API to control periodic checkpointing within your application as appropriate.
VMWare in particular has a replay feature (Enhanced Execution Record) which records EVERYTHING and allows point in time playback. Obviously there is a massive performance hit with this approach but it would meet the requirements. I would just make sure your disk drives have a battery backed write cache.
You would most likely be able to find similiar solutions for java bytecode run inside a java virtual machine. Google fault tolerant JVM and virtual machine checkpointing.
If you do want the program information saved, where would you save it?
It would need to be saved e.g. to disk. But this wouldn't help you if the disk failed, so already it's not disaster-proof.
You are only going to get a certain level of granularity in your saved state. If you want something like tihs, then probably the best approach is to define your granularity level, in terms of what constitutes an atomic operation and save state to the database before each atomic operation. Then, you can restore to the point of that level atomic operation.
I don't know of any language that would do this automatically, sincethe cost of saving state to secondary storage is extremely high. Therefore, there is a tradeoff between level of granularity and efficiency, which would be hard to define in an arbitrary application.
First, implement a fault tolerant application. One where, where, if you have 8 features and 5 failure modes, you have done the analysis and test to demonstrate that all 40 combinations work as intended (and as desired by the specific customer: no two will likely agree).
second, add a scripting language on top of the supported set of fault-tolerant features. It needs to be as near to stateless as possible, so almost certainly something non-Turing-complete.
finally, work out how to handle restoration and repair of scripting language state adapted to each failure mode.
And yes, this is pretty much rocket science.
Windows Workflow Foundation may solve your problem. It's .Net based and is designed graphically as a workflow with states and actions.
It allows for persistence to the database (either automatically or when prompted). You could do this between states/actions. This Serialises the entire instance of your workflow into the database. It will be rehydrated and execution will continue when any of a number of conditions is met (certain time, rehydrated programatically, event fires, etc...)
When a WWF host starts, it checks the persistence DB and rehydrates any workflows stored there. It then continues to execute from the point of persistence.
Even if you don't want to use the workflow aspects, you can probably still just use the persistence service.
As long as your steps were atomic this should be sufficient - especially since I'm guessing you have a UPS so could monitor for UPS events and force persistence if a power issue is detected.
If I were going about solving your problem, I would write a daemon (probably in C) that did all database interaction in transactions so you won't get any bad data inserted if it gets interrupted. Then have the system start this daemon at startup.
Obviously developing web stuff in C is quite slower than doing it in a scripting language, but it will perform better and be more stable (if you write good code of course :).
Realistically, I'd write it in Ruby (or PHP or whatever) and have something like Delayed Job (or cron or whatever scheduler) run it every so often because I wouldn't need stuff updating ever clock cycle.
Hope that makes sense.
To my mind, the concept of failure recover is, most of the time, a business problem, not a hardware or language problem.
Take an example : you have one UI Tier and one subsystem.
The subsystem is not very reliable but the client on the UI tier should percieve it as if it was.
Now, imagine that somehow your sub system crash, do you really think that the language you imagine, can think for you how to handle the UI Tier depending on this sub system ?
Your user should be explicitly aware that the subsystem is not reliable, if you use messaging to provide high reliability, the client MUST know that (if he isn't aware, the UI can just freeze waiting a response which can eventually come 2 weeks later). If he should be aware of this, this means that any abstrations to hide it will eventually leak.
By client, I mean end user. And the UI should reflect this unreliability and not hide it, a computer cannot think for you in that case.
"So a language which would remember it's state at any given moment, no matter if the power gets cut off, and continues where it left off."
"continues where it left off" is often not the correct recovery strategy. No language or environment in the world is going to attempt to guess how to recover from a particular fault automatically. The best it can do is provide you with tools to write your own recovery strategy in a way that doesn't interfere with your business logic, e.g.
Exception handling (to fail fast and still ensure consistency of state)
Transactions (to roll back incompleted changes)
Workflows (to define recovery routines that are called automatically)
Logging (for tracking down the cause of a fault)
AOP/dependency injection (to avoid having to manually insert code to do all the above)
These are very generic tools and are available in lots of languages and environments.

Resources