Mixing Erlang and Haskell

Mixing Erlang and Haskell - haskell

If you've bought into the functional programming paradigm, the chances are that you like both Erlang and Haskell. Both have purely functional cores and other goodness such as lightweight threads that make them a good fit for a multicore world. But there are some differences too.
Erlang is a commercially proven fault-tolerant language with a mature distribution model. It has a seemingly unique feature in its ability to upgrade its version at runtime via hot code loading. (Way cool!)
Haskell, on the otherhand, has the most sophisticated type system of any mainstream language. (Where I define 'mainstream' to be any language that has a published O'Reilly book so Haskell counts.) Its straightline single threaded performance looks superior to Erlang's and its lightweight threads look even lighter too.
I am trying to put together a development platform for the rest of my coding life and was wondering whether it was possible to mix Erlang and Haskell to achieve a best of breed platform. This question has two parts:
I'd like to use Erlang as a kind of fault tolerant MPI to glue GHC runtime instances together. There would be one Erlang process per GHC runtime. If "the impossible happened" and the GHC runtime died, then the Erlang process would detect that somehow and die too. Erlang's hot code loading and distribution features would just continue to work. The GHC runtime could be configured to use just one core, or all cores on the local machine, or any combination in between. Once the Erlang library was written, the rest of the Erlang level code should be purely boilerplate and automatically generated on a per application basis. (Perhaps by a Haskell DSL for example.) How does one achieve at least some of these things?
I'd like Erlang and Haskell to be able to share the same garabage collector. (This is a much further out idea than 1.) Languages that run on the JVM and the CLR achieve greater mass by sharing a runtime. I understand there are technical limitations to running Erlang (hot code loading) and Haskell (higher kinded polymorphism) on either the JVM or the CLR. But what about unbundling just the garbage collector? (Sort of the start of a runtime for functional languages.) Allocation would obviously still have to be really fast, so maybe that bit needs to be statically linked in. And there should be some mechansim to distinguish the mutable heap from the immutable heap (incuding lazy write once memory) as GHC needs this. Would it be feasible to modify both HIPE and GHC so that the garbage collectors could share a heap?
Please answer with any experiences (positive or negative), ideas or suggestions. In fact, any feedback (short of straight abuse!) is welcome.
Update
Thanks for all 4 replies to date - each taught me at least one useful thing that I did not know.
Regarding the rest of coding life thing - I included it slightly tongue in cheek to spark debate, but it is actually true. There is a project that I have in mind that I intend to work on until I die, and it needs a stable platform.
In the platform I have proposed above, I would only write Haskell, as the boilerplate Erlang would be automatically generated. So how long will Haskell last? Well Lisp is still with us and doesn't look like it is going away anytime soon. Haskell is BSD3 open source and has achieved critical mass. If programming itself is still around in 50 years time, I would expect Haskell, or some continuous evolution of Haskell, will still be here.
Update 2 in response to rvirding's post
Agreed - implementing a complete "Erskell/Haslang" universal virtual machine might not be absolutely impossible, but it would certainly be very difficult indeed. Sharing just the garbage collector level as something like a VM, while still difficult, sounds an order of magnitude less difficult to me though. At the garbage collection model, functional languages must have a lot in common - the unbiquity of immutable data (including thunks) and the requirement for very fast allocation. So the fact that commonality is bundled tightly with monolithic VMs seems kind of odd.
VMs do help achieve critical mass. Just look at how 'lite' functional languages like F# and Scala have taken off. Scala may not have the absolute fault tolerance of Erlang, but it offers an escape route for the very many folks who are tied to the JVM.
While having a single heap makes
message passing very fast it
introduces a number of other problems,
mainly that doing GC becomes more
difficult as it has to be interactive
and globally non-interruptive so you
can't use the same simpler algorithms
as the per-process heap model.
Absolutely, that makes perfect sense to me. The very smart people on the GHC development team appear to be trying to solve part of the problem with a parallel "stop the world" GC.
http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel-gc/par-gc-ismm08.pdf
(Obviously "stop the world" would not fly for general Erlang given its main use case.) But even in the use cases where "stop the world" is OK, their speedups do not appear to be universal. So I agree with you, it is unlikely that there is a universally best GC, which is the reason I specified in part 1. of my question that
The GHC runtime could be configured to
use just one core, or all cores on the
local machine, or any combination in
between.
In that way, for a given use case, I could, after benchmarking, choose to go the Erlang way, and run one GHC runtime (with a singlethreaded GC) plus one Erlang process per core and let Erlang copy memory between cores for good locality.
Alternatively, on a dual processor machine with 4 cores per processor with good memory bandwidth on the processor, benchmarking might suggest that I run one GHC runtime (with a parallel GC) plus one Erlang process per processor.
In both cases, if Erlang and GHC could share a heap, the sharing would probably be bound to a single OS thread running on a single core somehow. (I am getting out of my depth here, which is why I asked the question.)
I also have another agenda - benchmarking functional languages independently of GC. Often I read of results of benchmarks of OCaml v GHC v Erlang v ... and wonder how much the results are confounded by the different GCs. What if choice of GC could be orthogonal to choice of functional language? How expensive is GC anyway? See this devil advocates blog post
http://john.freml.in/garbage-collection-harmful
by my Lisp friend John Fremlin, which he has, charmingly, given his post title "Automated garbage collection is rubbish". When John claims that GC is slow and hasn't really sped up that much, I would like to be able to counter with some numbers.

A lot of Haskell and Erlang people are interested in the model where Erlang supervises distribution, while Haskell runs the shared memory nodes in parallel doing all the number crunching/logic.
A start towards this is the haskell-erlang library: http://hackage.haskell.org/package/erlang
And we have similar efforts in Ruby land, via Hubris: http://github.com/mwotton/Hubris/tree/master
The question now is to find someone to actually push through the Erlang / Haskell interop to find out the tricky issues.

You're going to have an interesting time mixing GC between Haskell and Erlang. Erlang uses a per-process heap and copies data between processes -- as Haskell doesn't even have a concept of processes, I'm not sure how you would map this "universal" GC between the two. Furthermore, for best performance, Erlang uses a variety of allocators, each with slightly tweaked behaviours that I'm sure would affect the GC sub-system.
As with all things in software, abstraction comes at a cost. In this case, I rather suspect you'd have to introduce so many layers to get both languages over their impedance mismatch that you'd wind up with a not very performant (or useful) common VM.
Bottom line -- embrace the difference! There are huge advantages to NOT running everything in the same process, particularly from a reliability standpoint. Also, I think it's a little naive to expect one language/VM to last you for the rest of your life (unless you plan on a.) living a short time or b.) becoming some sort of code monk that ONLY works on a single project). Software development is all about mental agility and being willing to use the best available tools to build fast, reliable code.

Although this is a pretty old thread, if readers are still interested then it's worth taking a look at Cloud Haskell, which brings Erlang style concurrency and distribution to the GHC stable.
The forthcoming distributed-process-platform library adds support for OTP-esque constructs like gen_servers, supervision trees and various other "haskell flavoured" abstractions borrowed from and inspired by Erlang/OTP.

You could use an OTP gen_supervisor process to monitor Haskell instances that you spawn with open_port(). Depending on how the "port" exited, you would then be able to restart it or decide that it stopped on purpose and let the corresponding Erlang process die, too.
Fugheddaboudit. Even these language-independent VMs you speak of have trouble with data passed between languages sometimes. You should just serialize data between the two somehow: database, XML-RPC, something like that.
By the way, the idea of a single platform for the rest of your life is probably impractical, too. Computing technology and fashion change too often to expect that you can keep using just one language forever. Your very question points this out: no one language does everything we might wish, even today.

As dizzyd mentioned in his comment not all data in messages is copied, large binaries exist outside of the process heaps and are not copied.
Using a different memory structure to avoid having separate per-process heaps is certainly possible and has been done in a number of earlier implementations. While having a single heap makes message passing very fast it introduces a number of other problems, mainly that doing GC becomes more difficult as it has to be interactive and globally non-interruptive so you can't use the same simpler algorithms as the per-process heap model.
As long as we use have immutable data-structures there is no problem with robustness and safety. Deciding on which memory and GC models to use is a big trade-off, and unfortunately there universally best model.
While Haskell and Erlang are both functional languages they are in many respects very different languages and have very different implementations. It would difficult to come up with an "Erskell" (or Haslang) machine which could handle both languages efficiently. I personally think it is much better to keep them separate and to make sure you have a really good interface between them.

The CLR supports tail call optimization with an explicit tail opcode (as used by F#), which the JVM doesn't (yet) have an equivalent, which limits the implementation of such a style of language. The use of separate AppDomains does allow the CLR to hot-swap code (see e.g. this blog post showing how it can be done).
With Simon Peyton Jones working just down the corridor from Don Syme and the F# team at Microsoft Research, it would be a great disappointment if we didn't eventually see an IronHaskell with some sort of official status. An IronErlang would be an interesting project -- the biggest piece of work would probably be porting the green-threading scheduler without getting as heavyweight as the Windows Workflow engine, or having to run a BEAM VM on top the CLR.

Related

How Haskell handles parallel computing on a multicore machine/cluster

I'm considering a new language to learn those days to be used in high performance computing on a cluster of computers we have, among those languages, I'm considering Haskell.
I have read some about Haskell, but still have questions about using Haskell in high performance and distributed computing, which the language is known for, but I read some debates about Haskell is not good for those type of systems due to laziness, I can summarize my questions in the following lines:
Haskell uses green threads, which is great for handling big number of concurrent connections, but what happens when one of tasks takes longer than average and blocks the rest, does the whole thread block (Node.js style), forward the next task to another processor/thread (Golang), use reductions technique (Erlang), which kicks the task out of processing context after a pre-determined number of ticks, or else?
In a distributed computing environment, what happens to lazily-evaluated functions, do they have to be forced strict?
If one function/module requires strict evaluation, but it depends on other lazy functions/modules, shall I modify the code of other functions/modules to make them strict as well, or the compiler will handle this to me and force everything in that chain to strict or lazy.
When processing a very large sequence of data, how does Haskell handle parallel processing, is it by following some kind of implicit map-reduce technique, or I have do it by myself.
Is there a clustering abstract in the language, that handles the computing power for me, that automatically forwards the next task to the free processor wherever it is, be it on the same computer or another computer in the same cluster.
How does Haskell ensure fair-share of work is evenly distributed to all the available cores on the same computer or on the available cluster.

GHC uses a pool of available work (called sparks) and a work-stealing system: when a thread runs out of work, it will look for work in the pool or on the work queues of other threads that it can steal.
There is no built-in support for distributed computing as there is in (say) Erlang. The semantics are whatever your implementation defines. There are existing implementations like Cloud Haskell that you can look at for examples.
Neither. Haskell will automatically do whatever work is necessary to provide a value that is demanded and no more.
Haskell (and GHC in particular) does not do anything to automatically parallelize evaluation because there is no known universal strategy for parallelizing that is strictly better than not parallelizing. See Why is there no implicit parallelism in Haskell? for more info.
No. See (2).
For the same machine, it uses the pool of sparks and the work-stealing system described above. There is no notion of "clustering".
For an overview of parallel and concurrent programming in Haskell, see the free book of the same name by Simon Marlow, a primary author of GHC's runtime system.

Multithreading
As far as SMP parallelism† is concerned, Haskell is very effective. It's not quite automatic, but the parallel library makes it really easy to parallelise just about anything. Because the sparks are so cheap, you can be pretty careless and just ask for lots of parallelism; the runtime will then know what to do.
Unlike in most other languages, it is not a big problem if you have highly branched data structures, tricky dynamic algorithms etc. – thanks to the purely functional paradigm, parallel Haskell never needs to worry about locks when accessed data is shared.
I think the biggest caveat is memory: GHC's garbage collector is not concurrent, and the functional style is rather allocation-happy.
Apart from that, it's possible to write programs that look like they're parallel, but really don't do any work at all but just start and immediately return because of laziness. Some testing and experience is still necessary. But laziness and parallelism are not incompatible; at least not if you make sure you have big enough “chunks” of strictness in it. And forcing something strict is largely trivial.
Simpler, common parallelism tasks (which could be expressed in a map-reduce manner, or the classic array-vector stuff – the ones which are also easy in many languages) can generally be handled even easier in Haskell with libraries that parallelise the data structures; the best-known of these is repa.
Distributed computing
There has been quite some work on Cloud Haskell, which is basically Erlang in library form. This kind of task is less straightforward: the idea of any explicit message sending is a bit against Haskell's grain, and many aspects of the workflow become more cumbersome if the language is so heavily focused on its strong static typing (which is in Haskell otherwise often a huge bonus that doesn't just improve safety and performance but also makes it easier to write).
I think it's not far off to use Haskell in a distributed concurrent manner, but we can't say it's mature in that role yet. For distributed concurrent tasks, Erlang itself is certainly the way to go.
Clusters
Honestly, Haskell won't help you much at all here. A cluster is of course in principle a special case of a distributed setup, so you could employ Cloud Haskell; but in practice the needs are very different. The HPC world today (and probably quite some time into the future) hinges on MPI, and though there is a bit of existing work on MPI bindings, I haven't found them usable, at least not just like that.
MPI is definitely also quite against Haskell's grain, what with it's FORTRAN-oriented array centrism, weird ways of handling types and so on. But unless you go nuts with Haskell's cool features (though often it is so tempting!) there is no reason you couldn't write typical number-crunching code also in Haskell. The only problem is again support/maturity, but it's a considerable problem; so for cluster computing I'd recommend C++, Python or Julia instead.
An interesting alternative is to generate MPI-parallelised C or C++ code from Haskell. Paraiso is one nice project that does this.
Pipe dreams
I have often though about what could be done to make the distributed computing feasible in idiomatic Haskell. In principle I believe laziness could be a big help there. The model I'd envision is to let all machines compute independently the same program, but make use of the fact that Haskell evaluation has generally no predetermined order. The order would be randomised on each machine. Also the runtime would track how long some computation branch took to complete, and how big the result is. If a result is deemed both expensive and compact enough to warrant it, it would then be broadcast to the other nodes, together with some suitable hash that would allow them to shortcut that computation.
Such a system would never be quite as efficient as a hand-optimised MPI application, but it could at least offer the same asymptotics in many cases. And it could handle vastly more complex algorithms with ease.
But again, that's totally just my vague hopes for the not-so-near future.
†You said concurrency (which isn't so much about computation as about interaction), but it seems your question is in essence about pure computations?

Haskell for mission-critical systems [duplicate]

I've been curious to understand if it is possible to apply the power of Haskell to embedded realtime world, and in googling have found the Atom package. I'd assume that in the complex case the code might have all the classical C bugs - crashes, memory corruptions, etc, which would then need to be traced to the original Haskell code that
caused them. So, this is the first part of the question: "If you had the experience with Atom, how did you deal with the task of debugging the low-level bugs in compiled C code and fixing them in Haskell original code ?"
I searched for some more examples for Atom, this blog post mentions the resulting C code 22KLOC (and obviously no code:), the included example is a toy. This and this references have a bit more practical code, but this is where this ends. And the reason I put "sizable" in the subject is, I'm most interested if you might share your experiences of working with the generated C code in the range of 300KLOC+.
As I am a Haskell newbie, obviously there may be other ways that I did not find due to my unknown unknowns, so any other pointers for self-education in this area would be greatly appreciated - and this is the second part of the question - "what would be some other practical methods (if) of doing real-time development in Haskell?". If the multicore is also in the picture, that's an extra plus :-)
(About usage of Haskell itself for this purpose: from what I read in this blog post, the garbage collection and laziness in Haskell makes it rather nondeterministic scheduling-wise, but maybe in two years something has changed. Real world Haskell programming question on SO was the closest that I could find to this topic)
Note: "real-time" above is would be closer to "hard realtime" - I'm curious if it is possible to ensure that the pause time when the main task is not executing is under 0.5ms.

At Galois we use Haskell for two things:
Soft real time (OS device layers, networking), where 1-5 ms response times are plausible. GHC generates fast code, and has plenty of support for tuning the garbage collector and scheduler to get the right timings.
for true real time systems EDSLs are used to generate code for other languages that provide stronger timing guarantees. E.g. Cryptol, Atom and Copilot.
So be careful to distinguish the EDSL (Copilot or Atom) from the host language (Haskell).
Some examples of critical systems, and in some cases, real-time systems, either written or generated from Haskell, produced by Galois.
EDSLs
Copilot: A Hard Real-Time Runtime Monitor -- a DSL for real-time avionics monitoring
Equivalence and Safety Checking in Cryptol -- a DSL for cryptographic components of critical systems
Systems
HaLVM -- a lightweight microkernel for embedded and mobile applications
TSE -- a cross-domain (security level) network appliance

It will be a long time before there is a Haskell system that fits in small memory and can guarantee sub-millisecond pause times. The community of Haskell implementors just doesn't seem to be interested in this kind of target.
There is healthy interest in using Haskell or something Haskell-like to compile down to something very efficient; for example, Bluespec compiles to hardware.
I don't think it will meet your needs, but if you're interested in functional programming and embedded systems you should learn about Erlang.

Andrew,
Yes, it can be tricky to debug problems through the generated code back to the original source. One thing Atom provides is a means to probe internal expressions, then leaves if up to the user how to handle these probes. For vehicle testing, we build a transmitter (in Atom) and stream the probes out over a CAN bus. We can then capture this data, formated it, then view it with tools like GTKWave, either in post-processing or realtime. For software simulation, probes are handled differently. Instead of getting probe data from a CAN protocol, hooks are made to the C code to lift the probe values directly. The probe values are then used in the unit testing framework (distributed with Atom) to determine if a test passes or fails and to calculate simulation coverage.

I don't think Haskell, or other Garbage Collected languages are very well-suited to hard-realtime systems, as GC's tend to amortize their runtimes into short pauses.
Writing in Atom is not exactly programming in Haskell, as Haskell here can be seen as purely a preprocessor for the actual program you are writing.
I think Haskell is an awesome preprocessor, and using DSEL's like Atom is probably a great way to create sizable hard-realtime systems, but I don't know if Atom fits the bill or not. If it doesn't, I'm pretty sure it is possible (and I encourage anyone who does!) to implement a DSEL that does.
Having a very strong pre-processor like Haskell for a low-level language opens up a huge window of opportunity to implement abstractions through code-generation that are much more clumsy when implemented as C code text generators.

I've been fooling around with Atom. It is pretty cool, but I think it is best for small systems. Yes it runs in trucks and buses and implements real-world, critical applications, but that doesn't mean those applications are necessarily large or complex. It really is for hard-real-time apps and goes to great lengths to make every operation take the exact same amount of time. For example, instead of an if/else statement that conditionally executes one of two code branches that might differ in running time, it has a "mux" statement that always executes both branches before conditionally selecting one of the two computed values (so the total execution time is the same whichever value is selected). It doesn't have any significant type system other than built-in types (comparable to C's) that are enforced through GADT values passed through the Atom monad. The author is working on a static verification tool that analyzes the output C code, which is pretty cool (it uses an SMT solver), but I think Atom would benefit from more source-level features and checks. Even in my toy-sized app (LED flashlight controller), I've made a number of newbie errors that someone more experienced with the package might avoid, but that resulted in buggy output code that I'd rather have been caught by the compiler instead of through testing. On the other hand, it's still at version 0.1.something so improvements are undoubtedly coming.

Why did you decide "against" using Erlang?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Have you actually "tried" (means programmed in, not just read an article on it) Erlang and decided against it for a project? If so, why? Also, if you have opted to go back to your old language, or to use another functional language like F#, Haskell, Clojure, Scala, or something else then this counts too, and state why.

I returned to Haskell for my personal projects from Erlang for the simple virtue of Haskell's amazing type system. Erlang gives you a ton of tools to handle when things go wrong. Haskell gives you tools to keep you from going wrong in the first place.
When working in a language with a strong type system you are effectively proving free theorems about your code every time you compile.
You also get a bunch of overloading sugar from Haskell's typeclass machinery, but that is largely secondary to me -- even if it does allow me to express a number of abstractions that would be terribly verbose or non-idiomatic and unusable in Erlang (e.g. Haskell's category-extras).
I love Erlang, I love its channels and its effortless scalability. I turn to it when these are the things I need. Haskell isn't a panacea. I give up a better operational understanding of space consumption. I give up the magical one pass garbage collector. I give up OTP patterns and all that effortless scalability.
But its hard for me to give up the security blanket that, as is commonly said, in Haskell, if it typechecks, it is probably correct.

We use Haskell, OCaml and (now) F# so for us it has nothing to do with lack of C-like syntax. Rather we skip Erlang because:
It's dynamically typed (we're fans of Haskell's type system)
Doesn't provide a 'real' string type (I understand why, but it's annoying that this hasn't been corrected at the language level yet)
Tends to have poor (incomplete or unmaintained) database drivers
It isn't batteries included and doesn't appear to have a community working on correcting this. If it does, it isn't highly visible. Haskell at least has Hackage, and I'd guess that's what has us choosing that language over any other. In Windows environments F# is about to have the ultimate advantage here.
There are probably other reasons I can't think of right now, but these are the major points.

The best reason to avoid Erlang is when you cannot commit to the functional way of programming.
I read an anti-Erlang blog rant a few weeks ago, and one of the author's criticisms of Erlang is that he couldn't figure out how to make a function return a different value each time he called it with the same arguments. What he really hadn't figured out is that Erlang is that way on purpose. That's how Erlang manages to run so well on multiple processors without explicit locking. Purely functional programming is side-effect-free programming. You can arm-twist Erlang into working like our ranting blogger wanted, adding side effects, but in doing so you throw away the value Erlang offers.
Pure functional programming is not the only right way to program. Not everything needs to be mathematically rigorous. If you determine your application would be best written in a language that misuses the term "function", better cross Erlang off your list.

I have used Erlang in a few project already. I often use it for restful services. Where I don't use it however is for complex front end web applications where tools like Ruby on Rails are far better. But for the powerbroker behind the scenes I know of no better tool than Erlang.
I also use a few applications written in Erlang. I use CouchDB and RabbitMQ a bit and I have set up a few EJabberd servers. These applications are the most powerful, easiest and flexible tools in their field.
Not wanting to use Erlang because it does not use JVM is in my mind pretty silly. JVM is not some magical tool that is the best in doing everything in the world. In my mind the ability to choose from an arsenal of different tools and not being stuck in a single language or framework is what separates experts from code monkeys.
PS: After reading my comment back in context I noticed it looked like I was calling oxbow_lakes a code monkey. I really wasn't and apologize if he took it like that. I was generalizing about types of programmers and I would never call an individual such a negative name based on one comment by him. He is probably a good programmer even though I encourage him to not make the JVM some sort of a deal breaker.

Whilst I haven't, others on the internet have, e.g.
We investigated the relative merits of
C++ and Erlang in the implementation
of a parallel acoustic ray tracing
algorithm for the U.S. Navy. We found
a much smaller learning curve and
better debugging environment for
parallel Erlang than for
pthreads-based C++ programming. Our
C++ implementation outperformed the
Erlang program by at least 12x.
Attempts to use Erlang on the IBM Cell
BE microprocessor were frustrated by
Erlang's memory footprint. (Source)
And something closer to my heart, which I remember reading back in the aftermath of the ICFP contest:
The coding was very straightforward,
translating pseudocode into C++. I
could have used Java or C#, but I'm at
the point where programming at a high
level in C++ is just as easy, and I
wanted to retain the option of quickly
dropping down into some low-level
bit-twiddling if it came down to it.
Erlang is my other favorite language
for hacking around in, but was worried
about running into some performance
problem that I couldn't extricate
myself from. (Source)

For me, the fact that Erlang is dynamically typed is something that makes me wary. Although I do use dynamically typed languages because some of them are just so very problem-oriented (take Python, I solve a lot of problems with it), I wish they were statically typed instead.
That said, I actually intended to give Erlang a try for some time, and I’ve just started downloading the source. So your “question” achieved something after all. ;-)

I know Erlang since university, but have never used it in my own projects so far. Mainly because I'm mostly developing desktop applications, and Erlang is not a good language for making nice GUIs. But I will soon implement a server application, and I will give Erlang a try, because that's what it's good for. But I'm worring that I need more librarys, so maybe I'll try with Java instead.

A number of reasons:
Because it looks alien from anyone used to the C family of languages
Because I wanted to be able to run on the Java Virtual Machine to take advantage of tools I knew and understood (like JConsole) and the years of effort which have gone into JIT and GC.
Because I didn't want to have to rewrite all the (Java) libraries I've built up over the years.
Because I have no idea about the Erlang "ecosystem" (database access, configuration, build etc).
Basically I am familiar with Java, its platform and ecosystem and I have invested much effort into building stuff which runs on the JVM. It was easier by far to move to scala

I Decided against using Erlang for my project that was going to be run with a lot of shared data on a single multi-processor system and went with Clojure becuase Clojure really gets shared-memory-concurrency. When I have worked on distributed data storage systems Erlang was a great fit because Erlang really shines at distributed message passing systems. I compare the project to the best feature in the language and choose accordingly

Used it for a message gateway for a proprietary, multi-layered, binary protocol. OTP patterns for servers and relationships between services as well as binary pattern matching made the development process very easy. For such a use case I'd probably favor Erlang over other languages again.

The JVM is not a tool, it is a platform. Although I am all in favour of choosing the best tool for the job the platform is mostly already determined. Unless I am developing something standalone, from scratch and without the desire to reuse any existing code/library (three aspects that are unlikely in isolation already) I may be free to choose the platform.
I do use multiple tools and languages but I mainly targetg the JVM platform. That precludes Erlang for most if not all of my projects, as interesting as some of it concepts are.
Silvio

While I liked many design aspects of the Erlang runtime and the OTP platform, I found it to be a pretty annoying program language to develop in. The commas and periods are totally lame, and often require re-writing the last character of many lines of code just to change one line. Also, some operations that are simple in Ruby or Clojure are tedious in Erlang, for example string handling.
For distributed systems relying on a shared database the Mnesia system is really powerful and probably a good option, but I program in a language to learn and to have fun, and Erlang's annoying factor started to outweigh the fun factor once I had gotten past the basic bank account tutorials and started writing plugins for an XMPP server.

I love Erlang from the concurrency standpoint. Erlang really did concurrency right. I didn't end up using erlang primarily because of syntax.
I'm not a functional programmer by trade. I generally use C++, so I'm covet my ability to switch between styles (OOP, imperative, meta, etc). It felt like Erlang was forcing me to worship the sacred cow of immutable-data.
I love it's approach to concurrency, simple, beautiful, scalable, powerful. But the whole time I was programming in Erlang I kept thinking, man I'd much prefer a subset of Java that disallowed data sharing between thread and used Erlangs concurrency model. I though Java would have the best bet of restricting the language the feature set compatible with Erlang's processes and channels.
Just recently I found that the D Programing language offers Erlang style concurrency with familiar c style syntax and multi-paradigm language. I haven't tried anything massively concurrent with D yet, so I can't say if it's a perfect translation.
So professionally I use C++ but do my best to model massively concurrent applications as I would in Erlang. At some point I'd like to give D's concurrency tools a real test drive.

I am not going to even look at Erlang.
Two blog posts nailed it for me:
Erlang machinery walks the whole list to figure out whether they have a message to process, and the only way to get message means walking the whole list (I suspect that filtering messages by pid also involves walking the whole message list)
http://www.lshift.net/blog/2010/02/28/memory-matters-even-in-erlang
There are no miracles, indeed, Erlang does not provide too many services to deal with unavoidable overloads - e.g. it is still left to the application programmer to deal checking for available space in the message queue (supposedly by walking the queue to figure out the current length and I suppose there are no built-in mechanisms to ensure some fairness between senders).
erlang - how to limit message queue or emulate it?
Both (1) and (2) are way below naive on my book, and I am sure there are more software "gems" of similar nature sitting inside Erlang machinery.
So, no Erlang for me.
It seems that once you have to deal with a large system that requires high performance under overload C++ + Boost is still the only game in town.
I am going to look at D next.

I wanted to use Erlang for a project, because of it's amazing scalability with number of CPU'S. (We use other languages and occasionally hit the wall, leaving us with having to tweak the app)
The problem was that we must deliver our application on several platforms: Linux, Solaris and AIX, and unfortunately there is no Erlang install for AIX at the moment.
Being a small operation precludes the effort in porting and maintaining an AIX version of Erlang, and asking our customers to use Linux for part of our application is a no go.
I am still hoping that an AIX Erlang will arrive so we can use it.

Advice on starting a large multi-threaded programming project

My company currently runs a third-party simulation program (natural catastrophe risk modeling) that sucks up gigabytes of data off a disk and then crunches for several days to produce results. I will soon be asked to rewrite this as a multi-threaded app so that it runs in hours instead of days. I expect to have about 6 months to complete the conversion and will be working solo.
We have a 24-proc box to run this. I will have access to the source of the original program (written in C++ I think), but at this point I know very little about how it's designed.
I need advice on how to tackle this. I'm an experienced programmer (~ 30 years, currently working in C# 3.5) but have no multi-processor/multi-threaded experience. I'm willing and eager to learn a new language if appropriate. I'm looking for recommendations on languages, learning resources, books, architectural guidelines. etc.
Requirements: Windows OS. A commercial grade compiler with lots of support and good learning resources available. There is no need for a fancy GUI - it will probably run from a config file and put results into a SQL Server database.
Edit: The current app is C++ but I will almost certainly not be using that language for the re-write. I removed the C++ tag that someone added.

Numerical process simulations are typically run over a single discretised problem grid (for example, the surface of the Earth or clouds of gas and dust), which usually rules out simple task farming or concurrency approaches. This is because a grid divided over a set of processors representing an area of physical space is not a set of independent tasks. The grid cells at the edge of each subgrid need to be updated based on the values of grid cells stored on other processors, which are adjacent in logical space.
In high-performance computing, simulations are typically parallelised using either MPI or OpenMP. MPI is a message passing library with bindings for many languages, including C, C++, Fortran, Python, and C#. OpenMP is an API for shared-memory multiprocessing. In general, MPI is more difficult to code than OpenMP, and is much more invasive, but is also much more flexible. OpenMP requires a memory area shared between processors, so is not suited to many architectures. Hybrid schemes are also possible.
This type of programming has its own special challenges. As well as race conditions, deadlocks, livelocks, and all the other joys of concurrent programming, you need to consider the topology of your processor grid - how you choose to split your logical grid across your physical processors. This is important because your parallel speedup is a function of the amount of communication between your processors, which itself is a function of the total edge length of your decomposed grid. As you add more processors, this surface area increases, increasing the amount of communication overhead. Increasing the granularity will eventually become prohibitive.
The other important consideration is the proportion of the code which can be parallelised. Amdahl's law then dictates the maximum theoretically attainable speedup. You should be able to estimate this before you start writing any code.
Both of these facts will conspire to limit the maximum number of processors you can run on. The sweet spot may be considerably lower than you think.
I recommend the book High Performance Computing, if you can get hold of it. In particular, the chapter on performance benchmarking and tuning is priceless.
An excellent online overview of parallel computing, which covers the major issues, is this introduction from Lawerence Livermore National Laboratory.

Your biggest problem in a multithreaded project is that too much state is visible across threads - it is too easy to write code that reads / mutates data in an unsafe manner, especially in a multiprocessor environment where issues such as cache coherency, weakly consistent memory etc might come into play.
Debugging race conditions is distinctly unpleasant.
Approach your design as you would if, say, you were considering distributing your work across multiple machines on a network: that is, identify what tasks can happen in parallel, what the inputs to each task are, what the outputs of each task are, and what tasks must complete before a given task can begin. The point of the exercise is to ensure that each place where data becomes visible to another thread, and each place where a new thread is spawned, are carefully considered.
Once such an initial design is complete, there will be a clear division of ownership of data, and clear points at which ownership is taken / transferred; and so you will be in a very good position to take advantage of the possibilities that multithreading offers you - cheaply shared data, cheap synchronisation, lockless shared data structures - safely.

If you can split the workload up into non-dependent chunks of work (i.e., the data set can be processed in bits, there aren't lots of data dependencies), then I'd use a thread pool / task mechanism. Presumably whatever C# has as an equivalent to Java's java.util.concurrent. I'd create work units from the data, and wrap them in a task, and then throw the tasks at the thread pool.
Of course performance might be a necessity here. If you can keep the original processing code kernel as-is, then you can call it from within your C# application.
If the code has lots of data dependencies, it may be a lot harder to break up into threaded tasks, but you might be able to break it up into a pipeline of actions. This means thread 1 passes data to thread 2, which passes data to threads 3 through 8, which pass data onto thread 9, etc.
If the code has a lot of floating point mathematics, it might be worth looking at rewriting in OpenCL or CUDA, and running it on GPUs instead of CPUs.

For a 6 month project I'd say it definitely pays out to start reading a good book about the subject first. I would suggest Joe Duffy's Concurrent Programming on Windows. It's the most thorough book I know about the subject and it covers both .NET and native Win32 threading. I've written multithreaded programs for 10 years when I discovered this gem and still found things I didn't know in almost every chapter.
Also, "natural catastrophe risk modeling" sounds like a lot of math. Maybe you should have a look at Intel's IPP library: it provides primitives for many common low-level math and signal processing algorithms. It supports multi threading out of the box, which may make your task significantly easier.

There are a lot of techniques that can be used to deal with multithreading if you design the project for it.
The most general and universal is simply "avoid shared state". Whenever possible, copy resources between threads, rather than making them access the same shared copy.
If you're writing the low-level synchronization code yourself, you have to remember to make absolutely no assumptions. Both the compiler and CPU may reorder your code, creating race conditions or deadlocks where none would seem possible when reading the code. The only way to prevent this is with memory barriers. And remember that even the simplest operation may be subject to threading issues. Something as simple as ++i is typically not atomic, and if multiple threads access i, you'll get unpredictable results.
And of course, just because you've assigned a value to a variable, that's no guarantee that the new value will be visible to other threads. The compiler may defer actually writing it out to memory. Again, a memory barrier forces it to "flush" all pending memory I/O.
If I were you, I'd go with a higher level synchronization model than simple locks/mutexes/monitors/critical sections if possible. There are a few CSP libraries available for most languages and platforms, including .NET languages and native C++.
This usually makes race conditions and deadlocks trivial to detect and fix, and allows a ridiculous level of scalability. But there's a certain amount of overhead associated with this paradigm as well, so each thread might get less work done than it would with other techniques. It also requires the entire application to be structured specifically for this paradigm (so it's tricky to retrofit onto existing code, but since you're starting from scratch, it's less of an issue -- but it'll still be unfamiliar to you)
Another approach might be Transactional Memory. This is easier to fit into a traditional program structure, but also has some limitations, and I don't know of many production-quality libraries for it (STM.NET was recently released, and may be worth checking out. Intel has a C++ compiler with STM extensions built into the language as well)
But whichever approach you use, you'll have to think carefully about how to split the work up into independent tasks, and how to avoid cross-talk between threads. Any time two threads access the same variable, you have a potential bug. And any time two threads access the same variable or just another variable near the same address (for example, the next or previous element in an array), data will have to be exchanged between cores, forcing it to be flushed from CPU cache to memory, and then read into the other core's cache. Which can be a major performance hit.
Oh, and if you do write the application in C++, don't underestimate the language. You'll have to learn the language in detail before you'll be able to write robust code, much less robust threaded code.

One thing we've done in this situation that has worked really well for us is to break the work to be done into individual chunks and the actions on each chunk into different processors. Then we have chains of processors and data chunks can work through the chains independently. Each set of processors within the chain can run on multiple threads each and can process more or less data depending on their own performance relative to the other processors in the chain.
Also breaking up both the data and actions into smaller pieces makes the app much more maintainable and testable.

There's plenty of specific bits of individual advice that could be given here, and several people have done so already.
However nobody can tell you exactly how to make this all work for your specific requirements (which you don't even fully know yourself yet), so I'd strongly recommend you read up on HPC (High Performance Computing) for now to get the over-arching concepts clear and have a better idea which direction suits your needs the most.

The model you choose to use will be dictated by the structure of your data. Is your data tightly coupled or loosely coupled? If your simulation data is tightly coupled then you'll want to look at OpenMP or MPI (parallel computing). If your data is loosely coupled then a job pool is probably a better fit... possibly even a distributed computing approach could work.
My advice is get and read an introductory text to get familiar with the various models of concurrency/parallelism. Then look at your application's needs and decide which architecture you're going to need to use. After you know which architecture you need, then you can look at tools to assist you.
A fairly highly rated book which works as an introduction to the topic is "The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Application".

Read about Erlang and the "Actor Model" in particular. If you make all your data immutable, you will have a much easier time parallelizing it.

Most of the other answers offer good advice regarding partitioning the project - look for tasks that can be cleanly executed in parallel with very little data sharing required. Be aware of non-thread safe constructs such as static or global variables, or libraries that are not thread safe. The worst one we've encountered is the TNT library, which doesn't even allow thread-safe reads under some circumstances.
As with all optimisation, concentrate on the bottlenecks first, because threading adds a lot of complexity you want to avoid it where it isn't necessary.
You'll need a good grasp of the various threading primitives (mutexes, semaphores, critical sections, conditions, etc.) and the situations in which they are useful.
One thing I would add, if you're intending to stay with C++, is that we have had a lot of success using the boost.thread library. It supplies most of the required multi-threading primitives, although does lack a thread pool (and I would be wary of the unofficial "boost" thread pool one can locate via google, because it suffers from a number of deadlock issues).

I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now.
You can either use C# that you're more familiar with or you can use managed C++.
At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For (or ForEach) and/or PLINQ where possible.
If you do this, a lot of the heavy lifting will be done for you in a very efficient way. It's the direction that Microsoft is going to increasingly support.
2: I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now. At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For and/or PLINQ where possible. If you do this, a lot of the heavy lifting will be done for you in a very efficient way. 1: http://msdn.microsoft.com/en-us/library/dd321424%28VS.100%29.aspx

Sorry i just want to add a pessimistic or better realistic answer here.
You are under time pressure. 6 month deadline and you don't even know for sure what language is this system and what it does and how it is organized. If it is not a trivial calculation then it is a very bad start.
Most importantly: You say you have never done mulitithreading programming before. This is where i get 4 alarm clocks ringing at once. Multithreading is difficult and takes a long time to learn it when you want to do it right - and you need to do it right when you want to win a huge speed increase. Debugging is extremely nasty even with good tools like Total Views debugger or Intels VTune.
Then you say you want to rewrite the app in another lanugage - well this isn't as bad as you have to rewrite it anyway. THe chance to turn a single threaded Program into a well working multithreaded one without total redesign is almost zero.
But learning multithreading and a new language (what is your C++ skills?) with a timeline of 3 month (you have to write a throw away prototype - so i cut the timespan into two halfs) is extremely challenging.
My advise here is simple and will not like it: Learn multithreadings now - because it is a required skill set in the future - but leave this job to someone who already has experience. Well unless you don't care about the program being successfull and are just looking for 6 month payment.

If it's possible to have all the threads working on disjoint sets of process data, and have other information stored in the SQL database, you can quite easily do it in C++, and just spawn off new threads to work on their own parts using the Windows API. The SQL server will handle all the hard synchronization magic with its DB transactions! And of course C++ will perform a lot faster than C#.
You should definitely revise C++ for this task, and understand the C++ code, and look for efficiency bugs in the existing code as well as adding the multi-threaded functionality.

You've tagged this question as C++ but mentioned that you're a C# developer currently, so I'm not sure if you'll be tackling this assignment from C++ or C#. Anyway, in case you're going to be using C# or .NET (including C++/CLI): I have the following MSDN article bookmarked and would highly recommend reading through it as part of your prep work.
Calling Synchronous Methods Asynchronously

Whatever technology your going to write this, take a look a this must read book on concurrency "Concurrent programming in Java" and for .Net I highly recommend the retlang library for concurrent app.

I don't know if it was mentioned yet, but if I were in your shoes, what I would be doing right now (aside from reading every answer posted here) is writing a multiple threaded example application in your favorite (most used) language.
I don't have extensive multithreaded experience. I've played around with it in the past for fun but I think gaining some experience with a throw-away application will suit your future efforts.
I wish you luck in this endeavor and I must admit I wish I had the opportunity to work on something like this...

Trivial mathematical problems as language benchmarks

Why do people insist on using trivial mathematical problems like finding numbers in the Fibonacci sequence for language benchmarks? Don't these usually get optimized to relativistic speeds? Isn't the brunt of the bottlenecks usually in I/O, system API calls, operations on strings and structures, processing large quantities of data, abstract object-oriented stuff, etc?

It is a throwback to the old days, when compiler technology for what we would now call basic math was still evolving rapidly.
Now, compiler evolution is more focused on exploiting new instructions for niche operations, 64-bit math, and so on.
Micro-benchmarks such as the ones you mention were useful, though, when evaluating the efficiency of the hotspot compiler when Java was first launched, and in evaluating the efficiency of .NET versus C/C++.
Your suggestion that I/O and system calls are the likely bottlenecks is correct, at least for some space of problems. But I notice you suggested string operations. One person's irrelevant micro-benchmark is another person's critical performance metric.
EDIT: ps, I also remember using linpack and other micro-benchmarks to compare versions of the JVM, and to compare vendors of the JVM. From v4 to v5 there was a big jump in perf, I guess the JIT compiler got more effective. Also, IBM's JVM was ahead of Sun's at that time, on Windows-x86.

Because if you want to benchmark the language/compiler, these "math problems" are good indicators of the "bare speed" of the generated code. Either they use the iterative solution, which is a tight loop and indicates how well can the compiler push the instructions to the processor, or they use the recursive solution, which indicates how does it handle recursive calls of short functions (inlining, tail-recursion etc.) (although the Ackermann function is usually used for that too).
Usually, the benchmark suite for the language contain tests benchmarking other parts as well - eg. gzip compression, text searching, object creation, virtual function call, exception throw/catch benchmarks.
The other things you've noticed, syscalls and IO are usually not included because
syscalls are in fact not that slow - applications don't spend significant porion of the time in the kernel, except for test specifically targeted at them or when something is seriously wrong with the program
syscall and IO performance does not depend on the language, but rather on the OS & hardware

I'd think a simple, well-established algorithm would remove the possibility that the benchmark is biased (whether through ignorance or malice) to favor one language. It is very difficult to write a complex program in two different languages exactly the same. Testing something like the efficiency of a multithreaded application in c# vs java, for example, would require developers skilled in multithreaded development both languages, and there would still be questions as to whether the benchmark app properly represents the general case, or if it is misrepresenting a special case that only one language handles well.

Back when the sieve of eratosthanes was a popular benchmark for C compilers, I thought it would be funny if one of the compiler authors would recognize the sieve code and replace it with a pre-computed lookup.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string