What happened to libgreen? - rust

As far as I understand libgreen is not a part of Rust standard library anymore. Also I can't find a separate libgreen package. There are a few alternatives - coroutine, which does not provide actual green threads for now, and green-rs, which is broken. Do I right understand that for now there is no lightweight Go-like processes in Rust?

You are correct that there's no lightweight tasking library in std (or the rest of the main distribution), that green doesn't compile and that coroutine doesn't seem to fully handle the threading aspect yet. I do not know of any other library in this space.
As for what happened: the RFC linked to by that issue—RFC 230—is the canonical source of information. The summary is that it was found that the method by which green threading/IO was handled (std tried to abstract across both models, allowing them to be used interoperably automagically) was not worth the downsides. Now, std aims to just provide a minimum base-line of useful support: for IO/threading, that means "thin", safe wrappers for operating system functionality.

Read this https://aturon.github.io/blog/2016/08/11/futures/ and also:
Steve Klabnik's response in the comments:
In the beginning, Rust had only green threads. Eventually, it was
decided that a systems language without systems threads is... strange.
So we needed to add them. Why not add choice? Since the interfaces
could be the same, why not abstract over them, and you could just
choose which one you wanted?
At the same time, the problems with green threads by default were
becoming issues. Segmented stacks cause slow C interop. You need a
runtime to manage them, etc. Furthermore, the overall abstraction was
causing an unacceptable cost. The green threads weren't very green.
Plus, with the need to actually release someday looming, decisions
needed to be made regarding tradeoffs. And since Rust is supposed to
be a systems language, having 1:1 threads and basically no runtime
makes more sense than N:M threads and a runtime. . So libgreen was
removed, the interface was re-done to be 1:1 thread centric.
The 'release someday looming' is a big part of it. We want to be
really stable with Rust, and with all the things to do to actually
ship a 1.0, we didn't want to crystallize an interface we weren't
happy with. Heck, we pulled out a lot of libraries that are even less
important for similar reasons, like rand. Engineering is all about
tradeoffs, and we decided to choose minimalism.
mio is a non starter for us, as are most of the other async i/o frameworks for Rust, because we need Windows and besides we don't want
to get locked into an expensive to replace library which may get
orphaned.
Totally understood here, especially in the general case. In the
specific case, mio is going to either have Windows support, or a
windows-specific version of mio is going to be released, with a
higher-level package providing the features for all platforms. And in
this case, it's maintained by one of the people who's currently using
Rust heavily in production, so it's not likely to go away anytime
soon. But, unless you're actively involved, it's hard to know things
like that, which is, of itself an issue.
One of the reasons we were comfortable removing libgreen is that you
can write your own libraries to do different kinds of IO. 1.0 is a
strong core that we feel good about stabilizing forever, not the final
bit. Libraries like https://github.com/carllerche/mio can test out
different ways of handling things like async IO, and, when they're
mature enough, we can always pull them back in the standard library if
need be. But in the meantime, it's just one line to your Cargo.toml to
add them in.
And such text from reddit:
Unfortunately they ended up canning the greenlet support because
theirs were slower than kernel threads which in turn demonstrates
someone didn’t understand how to get a language compiler to generate
stackless coroutines effectively (not surprising, the number of
engineers wired the right way is not many in this world, but see
http://www.reddit.com/r/rust/comments/2l0a4b/do_rust_web_servers_use_libuv_through_libgreen_or/
for more detail). And they canned the async i/o because libuv is
“slow” (which it is only because it is single threaded only, plus
forces a malloc + free per async operation as the buffers must last
until completion occurs, plus it enforces a penalty over synchronous
i/o see
http://blog.kazuhooku.com/2014/09/the-reasons-why-i-stopped-using-libuv.html),
which was a real shame - they should have taken the opportunity to
replace libuv with something better (hint: ASIO + AFIO, and yes I know
they are both C++, but Rust could do with much better C++ interop than
the presently none it currently has) instead of canning
always-async-everything in what could have been an amazing step up
from C++ with most of the benefits of Erlang without the disadvantages
of Erlang.

For newcomers, there is now may, a crate that implements green threads similar to goroutines.

Related

Does pthreads provide any advantages over GCD?

Having recently learned Grand Central Dispatch, I've found multithreaded code to be pretty intuitive(with GCD). I like the fact that no locks are required(and the fact that it uses lockless data structures internally), and that the API is very simple.
Now, I'm beginning to learn pthreads, and I can't help but be a little overwhelmed with the complexity. Thread joins, mutexes, condition variables- all of these things aren't necessary in GCD, but have a lot of API calls in pthreads.
Does pthreads provide any advantages over GCD? Is it more efficient? Are there normal-use cases where pthreads can do things that GCD can not do(excluding kernel-level software)?
In terms of cross-platform compatibility, I'm not too concerned. After all, libdispatch is open source, Apple has submtited their closure changes as patches to GCC, clang supports closures, and already(e.x. FreeBSD), we're starting to see some non-Apple implementations of GCD. I'm mostly interested in use of the API(specific examples would be great!).
I am coming from the other direction: started using pthreads in my application, which I recently replaced with C++11's std::thread. Now, I am playing with higher-level constructs like the pseudo-boost threadpool, and even more abstract, Intel's Threading Building Blocks. I would consider GCD to be at or even higher than TBB.
A few comments:
imho, pthread is not more complex than GCD: at its basic core, pthread actually contains very few commands (just a handful: using just the ones mentioned in the OP will give you 95%+ of the functionality that you ever need). Like any lower-level library, it's how you put them together and how you use it which gives you its power. Don't forget that the ultimately, libraries like GCD and TBB will call a threading library like pthreads or std::thread.
sometimes, it's not what you use, but how you use it, which determines success vs failure. As proponents of the library, TBB or GCD will tell you about all the benefits of using their libraries, but until you try them out in a real application context, all of it is of theoretical benefit. For example, when I read about how easy it was to use a finely-grained parallel_for, I immediately used it in a task for which I thought could benefit from parallelism. Naturally, I, too, was drawn by the fact that TBB would handle all the details about optimal loading balancing and thread allocation. The result? TBB took five times longer than the single-threaded version! But I do not blame TBB: in retrospect, this is obviously a case of a misuse of the parallel_for: when I read the fine-print, I discovered the overhead involved in using parallel_for and posited that in my case, the costs of context-switching and added function calls outweighed the benefits of using multiple threads. So you must profile your case to see which one will run faster. You may have to reorganize your algorithm to use less threading overhead.
why does this happen? How can pthread or no threads be faster than a GCD or a TBB? When a designer designs GCD or TBB, he must make assumptions about the environment in which tasks will run. In fact, the library must be general enough that it can handle strange, unforseen use-cases by the developer. These general implementations will not come for free. On the plus-side, a library will query the hardware and the current running environment to do a better job of load-balancing. Will it work to your benefit? The only way to know is to try it out.
is there any benefit to learning lower-level libraries like std::thread when higher-level libraries are available? The answer is a resounding YES. The advantage of using higher-level libraries is, abstraction from the implementation details. The disadvantage of using higher-level libraries is also abstraction from the implementation details. When using pthreads, I am supremely aware of shared state and lifetimes of objects, because if I let my guard down, especially in a medium to large size project, I can very easily get race conditions or memory faults. Do these problems go away when I use a higher-level library? Not really. It seems like I don't need to think about them, but in fact, if I get sloppy with those details, the library implementation will also crash. So you will find that if you understand the lower-level constructs, all those libraries actually make sense, because at some point, you will be thinking about implementing them yourself, if you use the lower-level calls. Of course, at that point, it's usually better to use a time-tested and debugged library call.
So, let's break down the possible implementations:
TBB/GCD library calls: greatest benefit is for beginners of threading. They have lower barriers to entry compared to learning lower level libraries. However, they also ignore/hide some of the traps of using multi-threading. Dynamic load balancing will make your application more portable without additional coding on your part.
pthread and std::thread calls: there are actually very few calls to learn, but to use them correctly takes attention to detail and deep awareness of how your application works. If you can understand threads at this level, the APIs of higher-level libraries will certainly make more sense.
single-threaded algorithm: let us not forget the benefits of a simple single-threaded segment. For most applications, a single-thread is easier to understand and much less error-prone than multi-threading. In fact, in many cases, it may be the appropriate design choice. The fact of the matter is, a real application goes through various multi-threading phases and single-threading phases: there may be no need to be multi-threaded all the time.
Which one is fastest? The surprising truth is, it could be any of the three of the above. To get speed benefits of multi-threading, you may need to drastically reorganize your algorithms. Whether or not the benefits outweigh the costs is highly case-dependent.
Oh, and the OP asked about cases where a thread_pool is not appropriate. Easy case: if you have a tight loop that does not require many cycles per loop to compute, using thread_pool may cost more than the benefits without serious reworking. Also be aware of the overhead of function calls like lambda through thread pools vs the use of a single tight loop.
For most applications, multi-threading is a kind of optimization, so do it at the right time and in the right places.
That overwhelming feeling that you are experiencing.. that's exactly why GCD was invented.
At the most basic level there are threads, pthreads is a POSIX API for threads so you can write code in any compliant OS and expect it to work. GCD is built on top of threads (although I'm not sure if they actually used pthreads as the API). I believe GCD only works on OS X and iOS — that in a nutshell is its main disadvantage.
Note that projects that make heavy use of threads and require high performance implement their own version of thread pools. GCD allows you to avoid (re)inventing the wheel for the umpteenth time.
GCD is an Apple technology, and not the most cross platform compatible; pthread are available on just about everything from OSX, Linux, Unix, Windows.. including this toaster
GCD is optimized for thread pool parallelism. Pthreads are (as you said) very complex building blocks for parallelism, you are left to develop your own models. I highly recommend picking up a book on the topic if you're interested in learning more about pthreads and different models of parallelism.
As any declarative/assisted approach like openmp or Intel TBB GCD should be very good at embarrassingly parallel problems and will probably easily beat naïve manually pthread-ed parallel sort. I would suggest you still learn pthreads though. You'll understand concurrency better, you'd be able to apply right tool in each particular situation, and if for nothing else - there's ton of pthread-based code out there - you'd be able to read "legacy" code.
Usual: 1 task per Pthread implementations use mutexes (an OS feature).
GCD:
1 task per block, grouped into queues. 1 thread per virtual CPU can get a queue and run without mutexes through all the tasks. This reduces thread management overhead and mutex overhead, which should increase performance.
GCD abstracts threads and gives you dispatch queues. It creates threads as it deems necessary taking into account the number of processor cores available.
GCD is Open Source and is available through the libdispatch library. FreeBSD includes libdispatch as of 8.1. GCD and C Blocks are mayor contributions from Apple to the C programming community. I would never use any OS that doesn't support GCD.

Are there any practical alternatives to threads?

While reading up on SQLite, I stumbled upon this quote in the FAQ: "Threads are evil. Avoid them."
I have a lot of respect for SQLite, so I couldn't just disregard this. I got thinking what else I could, according to the "avoid them" policy, use instead in order to parallelize my tasks. As an example, the application I'm currently working on requires a user interface that is always responsive, and needs to poll several websites from time to time (a process which takes at least 30 seconds for each website).
So I opened up the PDF linked from that FAQ, and essentially it seems that the paper suggests several techniques to be applied together with threads, such as barriers or transactional memory - rather than any techniques to replace threads altogether.
Given that these techniques do not fully dispense with threads (unless I misunderstood what the paper is saying), I can see two options: either the SQLite FAQ does not literally mean what it says, or there exist practical approaches that actually avoid the use of threads altogether. Are there any?
Just a quick note on tasklets/cooperative scheduling as an alternative - this looks great in small examples, but I wonder whether a large-ish UI-heavy application can be practically parallelized in a solely cooperative way. If you have done this successfully or know of such examples this certainly qualifies as a valid answer!
Note: This answer no longer accurately reflects what I think about this subject. I don't like its overly dramatic, somewhat nasty tone. Also, I am not so certain that the quest for provably correct software has been so useless as I seemed to think back then. I am leaving this answer up because it is accepted, and up-voted, and to edit it into something I currently believe would pretty much vandalize it.
I finally got around to reading the paper. Where do I start?
The author is singing an old song, which goes something like this: "If you can't prove the program is correct, we're all doomed!" It sounds best when screamed loudly accompanied by over modulated electric guitars and a rapid drum beat. Academics started singing that song when computer science was in the domain of mathematics, a world where if you don't have a proof, you don't have anything. Even after the first computer science department was cleaved from the mathematics department, they kept singing that song. They are singing that song today, and nobody is listening. Why? Because the rest of us are busy creating useful things, good things out of software that can't be proved correct.
The presence of threads makes it even more difficult to prove a program correct, but who cares? Even without threads, only the most trivial of programs can be proved correct. Why do I care if my non-trivial program, which could not be proved correct, is even more unprovable after I use threading? I don't.
If you weren't sure the author was living in an academic dreamworld, you can be sure of it after he maintains that the coordination language he suggests as an alternative to threads could best be expressed with a "visual syntax" (drawing graphs on the screen). I've never heard that suggestion before, except every year of my career. A language that can only be manipulated by GUI and does not play with any of the programmer's usual tools is not an improvement. The author goes on to cite UML as a shining example of a visual syntax which is "routinely combined with C++ and Java." Routinely in what world?
In the mean time, I and many other programmers go on using threads without all that much trouble. How to use threads well and safely is pretty much a solved problem, as long as you don't get all hung up on provability.
Look. Threading is a big kid's toy, and you do need to know some theory and usage patterns to use them well. Just as with databases, distributed processing, or any of the other beyond-grade-school devices that programmers successfully use every day. But just because you can't prove it correct doesn't mean it's wrong.
The statement in the SQLite FAQ, as I read it, is just a comment on how difficult threading can be to the uninitiated. It is the author's opinion, and it might be a valid one. But saying you should never use threads is throwing the baby out with the bath water, in my opinion. Threads are a tool. Like all tools, they can be used and they can be abused. I can read his paper and be convinced that threads are the devil, but I have used them successfully, without killing kittens.
Keep in mind that SQLite is written to be as lightweight and easy to understand (from a coding standpoint) as possible, so I would imagine that threading is kind of the antithesis to this lightweight approach.
Also, SQLite is not meant to be used in a highly-concurrent environment. If you have one of these, you might be better off working with a more enterprisey database like Postgres.
Evil, but a necessary evil. High level abstractions of threads (Tasks in .NET for example) are becoming more common but for the most part the industry is not trying to find a way to avoid threads, just making it easier to deal with the complexities that come with any kind of concurrent programming.
One trend I've noticed, at least in the Cocoa domain, is help from the framework. Apple has gone to great lengths to help developers with the relatively difficult concept of concurrent programming. Some things I've seen:
Different granularity of threading. Cocoa supports everything from posix threads (low level) to object oriented threading with NSLock and NSThread, to high level parellelism such as NSOperation. Depending on your task, using a high level tool like NSOperation is easier and gets the job done.
Threading behind the scenes via an API. Lots of the UI and animation stuff in cocoa is hidden behind an API. You are responsible for calling an API method and providing an asynchronous callback this executed when the secondary thread completes (for example the end of some animation).
openMP. There are tools like openMP that allow you to provide pragmas that describe to the compiler that some task may be safely parelellized. For example iterating a set of items in an independent way.
It seems like a big push in this industry is to make things simple for the Application developers and leave the gory thread details to the system developers and framework developers. There is a push in academia for formalizing parellel patterns. As mentioned you cant always avoid threading, but there are an increasing number of tools in your arsenal to make it as painless as possible.
If you really want to live without threads, you can, so long as you don't call any functions that can potentially block. This may not be possible.
One alternative is to implement the tasks you would have made into threads as finite state machines. Basically, the task does what it can do immediately, then goes to its next state, waiting for an event, such as input arriving on a file or a timer going off. X Windows, as well as most GUI toolkits, support this style. When something happens, they call a callback, which does what it needs to do and returns. For a FSM, the callback checks to see what state the task is in and what the event is to determine what to do immediately and what the next state will be.
Say you have an app that needs to accept socket connections, and for each connection, parse command lines, execute some code, and return the results. A task would then be what listens to a socket. When select() (or Gtk+, or whatever) tells you the socket has something to read, you read it into a buffer, then check to see if you have enough input buffered to do something. If so, you advance to a "start doing something" state, otherwise you stay in the "reading a line" state. (What you "do" could be multiple states.) When done, your task drops the line from the buffer and goes back to the "reading a line" state. No threads or preemption needed.
This lets you act multithreaded by way of being event-driven. If your state machines are complicated, however, your code can get hard to maintain pretty fast, and you'll need to work up some kind of FSM-management library to separate the grunt work of running the FSM from the code that actually does things.
P.S. Another way to get threads without really using threads is the GNU Pth library. It doesn't do preemption, but it is another option if you really don't want to deal with threads.
Another approach to this may be to use a different concurrency model rather than avoid multithreading altogether (you have to utilize all these CPU cores in parallel somehow).
Take a look at mechanisms used in Clojure (e.g. agents, software transactional memory).
Software Transactional Memory (STM) is a good alternative concurrency control. It scales well with multiple processors and do not have most of the problems of conventional concurrency control mechanisms. It is implemented as part of the Haskell language. It worths giving a try. Although, I do not know how this is applicable in the context of SQLite.
Alternatives to threads:
coroutines
goroutines
mapreduce
workerpool
apple's grand central dispatch+lambdas
openCL
erlang
(interesting to note that half of those technologies were invented or popularised by google.)
Another thing is many web frameworks transparently use multiple threads/processes for handling requests, and usually in such a way that mostly eliminates the problems associated with multithreading (for the user of the framework), or at least makes the threading rather invisible. The web being stateless, the only shared state is session state (which isn't really a problem since by definition, a single session isn't going to be doing concurrent things), and data in a database that already has its multithreading nonsense sorted out for you.
It's somewhat important to note though that these are all abstractions. The underlying implementations of these things still use threads. But this is still incredibly useful. In the same way you wouldn't use assembler to write a web application, you wouldn't use threads directly to write any important application. Designing an application to use threads is too complicated to leave for a human to deal with.
Threading is not the only model of concurrency. The actors model (Erlang, Scala) is an example of a somewhat different approach.
http://www.scala-lang.org/node/242
If your task is really, really easily isolatable, you can use processes instead of threads, like Chrome does for its tabs.
Otherwise, inside a single process, there is no way to achieve real parallelism without threads, because you need at least two coroutines if you want two things to happen at the same time (assuming you're having multiple processors/cores at hand, of course; otherwise real parallelism is simply not possible).
The complexity of threading a program is always relative to the degree of isolation of the tasks the threads will perform. There's no trouble in running several threads if you know for sure these will never use the same variables. Then again, multiple high-level constructs exist in modern languages to help synchronize access to shared resources.
It's really a matter of application. If your task is simple enough to fit in some kind of high-level Task object (depends on your development platform; your mileage may vary), then using a task queue is your best bet. My rule of the thumb is that if you can't find a cool name to your thread, then its task is not important enough to justify a thread (instead of task going on an operation queue).
Threads give you the opportunity to do some evil things, specifically sharing state among different execution paths. But they offer a lot of convenience; you don't have to do expensive communication across process boundaries. Plus, they come with less overhead. So I think they're perfectly fine, used correctly.
I think the key is to share as little data as possible among the threads; just stick to synchronization data. If you try to share more than that, you have to engage in complex code that is hard to get right the first time around.
One method of avoiding threads is multiplexing - in essence you make a lightweight mechanism similar to threads which you manage yourself.
Thing is this is not always viable. In your case the 30s polling time per website - can it be split into 60 0.5s pieces, in between which you can stuff calls to the UI? If not, sorry.
Threads aren't evil, they are just easy to shoot your foot with. If doing Query A takes 30s and then doing Query B takes another 30s, doing them simultaneously in threads will take 120s instead of 60 due to thread overhead, fighting for disk access and various bottlenecks.
But if Operation A consists of 5s of activity and 55 seconds of waiting, mixed randomly, and Operation B takes 60s of actual work, doing them in threads will take maybe 70s, compared to plain 120 when you execute them in sequence.
The rule of thumb is: threads should idle and wait most of the time. They are good for I/O, slow reads, low-priority work and so on. If you want performance, use multiplexing, which requires more work but is faster, more efficient and has way less caveats. (synchronizing threads and avoiding race conditions is a whole different chapter of thread headaches...)

Advice on starting a large multi-threaded programming project

My company currently runs a third-party simulation program (natural catastrophe risk modeling) that sucks up gigabytes of data off a disk and then crunches for several days to produce results. I will soon be asked to rewrite this as a multi-threaded app so that it runs in hours instead of days. I expect to have about 6 months to complete the conversion and will be working solo.
We have a 24-proc box to run this. I will have access to the source of the original program (written in C++ I think), but at this point I know very little about how it's designed.
I need advice on how to tackle this. I'm an experienced programmer (~ 30 years, currently working in C# 3.5) but have no multi-processor/multi-threaded experience. I'm willing and eager to learn a new language if appropriate. I'm looking for recommendations on languages, learning resources, books, architectural guidelines. etc.
Requirements: Windows OS. A commercial grade compiler with lots of support and good learning resources available. There is no need for a fancy GUI - it will probably run from a config file and put results into a SQL Server database.
Edit: The current app is C++ but I will almost certainly not be using that language for the re-write. I removed the C++ tag that someone added.
Numerical process simulations are typically run over a single discretised problem grid (for example, the surface of the Earth or clouds of gas and dust), which usually rules out simple task farming or concurrency approaches. This is because a grid divided over a set of processors representing an area of physical space is not a set of independent tasks. The grid cells at the edge of each subgrid need to be updated based on the values of grid cells stored on other processors, which are adjacent in logical space.
In high-performance computing, simulations are typically parallelised using either MPI or OpenMP. MPI is a message passing library with bindings for many languages, including C, C++, Fortran, Python, and C#. OpenMP is an API for shared-memory multiprocessing. In general, MPI is more difficult to code than OpenMP, and is much more invasive, but is also much more flexible. OpenMP requires a memory area shared between processors, so is not suited to many architectures. Hybrid schemes are also possible.
This type of programming has its own special challenges. As well as race conditions, deadlocks, livelocks, and all the other joys of concurrent programming, you need to consider the topology of your processor grid - how you choose to split your logical grid across your physical processors. This is important because your parallel speedup is a function of the amount of communication between your processors, which itself is a function of the total edge length of your decomposed grid. As you add more processors, this surface area increases, increasing the amount of communication overhead. Increasing the granularity will eventually become prohibitive.
The other important consideration is the proportion of the code which can be parallelised. Amdahl's law then dictates the maximum theoretically attainable speedup. You should be able to estimate this before you start writing any code.
Both of these facts will conspire to limit the maximum number of processors you can run on. The sweet spot may be considerably lower than you think.
I recommend the book High Performance Computing, if you can get hold of it. In particular, the chapter on performance benchmarking and tuning is priceless.
An excellent online overview of parallel computing, which covers the major issues, is this introduction from Lawerence Livermore National Laboratory.
Your biggest problem in a multithreaded project is that too much state is visible across threads - it is too easy to write code that reads / mutates data in an unsafe manner, especially in a multiprocessor environment where issues such as cache coherency, weakly consistent memory etc might come into play.
Debugging race conditions is distinctly unpleasant.
Approach your design as you would if, say, you were considering distributing your work across multiple machines on a network: that is, identify what tasks can happen in parallel, what the inputs to each task are, what the outputs of each task are, and what tasks must complete before a given task can begin. The point of the exercise is to ensure that each place where data becomes visible to another thread, and each place where a new thread is spawned, are carefully considered.
Once such an initial design is complete, there will be a clear division of ownership of data, and clear points at which ownership is taken / transferred; and so you will be in a very good position to take advantage of the possibilities that multithreading offers you - cheaply shared data, cheap synchronisation, lockless shared data structures - safely.
If you can split the workload up into non-dependent chunks of work (i.e., the data set can be processed in bits, there aren't lots of data dependencies), then I'd use a thread pool / task mechanism. Presumably whatever C# has as an equivalent to Java's java.util.concurrent. I'd create work units from the data, and wrap them in a task, and then throw the tasks at the thread pool.
Of course performance might be a necessity here. If you can keep the original processing code kernel as-is, then you can call it from within your C# application.
If the code has lots of data dependencies, it may be a lot harder to break up into threaded tasks, but you might be able to break it up into a pipeline of actions. This means thread 1 passes data to thread 2, which passes data to threads 3 through 8, which pass data onto thread 9, etc.
If the code has a lot of floating point mathematics, it might be worth looking at rewriting in OpenCL or CUDA, and running it on GPUs instead of CPUs.
For a 6 month project I'd say it definitely pays out to start reading a good book about the subject first. I would suggest Joe Duffy's Concurrent Programming on Windows. It's the most thorough book I know about the subject and it covers both .NET and native Win32 threading. I've written multithreaded programs for 10 years when I discovered this gem and still found things I didn't know in almost every chapter.
Also, "natural catastrophe risk modeling" sounds like a lot of math. Maybe you should have a look at Intel's IPP library: it provides primitives for many common low-level math and signal processing algorithms. It supports multi threading out of the box, which may make your task significantly easier.
There are a lot of techniques that can be used to deal with multithreading if you design the project for it.
The most general and universal is simply "avoid shared state". Whenever possible, copy resources between threads, rather than making them access the same shared copy.
If you're writing the low-level synchronization code yourself, you have to remember to make absolutely no assumptions. Both the compiler and CPU may reorder your code, creating race conditions or deadlocks where none would seem possible when reading the code. The only way to prevent this is with memory barriers. And remember that even the simplest operation may be subject to threading issues. Something as simple as ++i is typically not atomic, and if multiple threads access i, you'll get unpredictable results.
And of course, just because you've assigned a value to a variable, that's no guarantee that the new value will be visible to other threads. The compiler may defer actually writing it out to memory. Again, a memory barrier forces it to "flush" all pending memory I/O.
If I were you, I'd go with a higher level synchronization model than simple locks/mutexes/monitors/critical sections if possible. There are a few CSP libraries available for most languages and platforms, including .NET languages and native C++.
This usually makes race conditions and deadlocks trivial to detect and fix, and allows a ridiculous level of scalability. But there's a certain amount of overhead associated with this paradigm as well, so each thread might get less work done than it would with other techniques. It also requires the entire application to be structured specifically for this paradigm (so it's tricky to retrofit onto existing code, but since you're starting from scratch, it's less of an issue -- but it'll still be unfamiliar to you)
Another approach might be Transactional Memory. This is easier to fit into a traditional program structure, but also has some limitations, and I don't know of many production-quality libraries for it (STM.NET was recently released, and may be worth checking out. Intel has a C++ compiler with STM extensions built into the language as well)
But whichever approach you use, you'll have to think carefully about how to split the work up into independent tasks, and how to avoid cross-talk between threads. Any time two threads access the same variable, you have a potential bug. And any time two threads access the same variable or just another variable near the same address (for example, the next or previous element in an array), data will have to be exchanged between cores, forcing it to be flushed from CPU cache to memory, and then read into the other core's cache. Which can be a major performance hit.
Oh, and if you do write the application in C++, don't underestimate the language. You'll have to learn the language in detail before you'll be able to write robust code, much less robust threaded code.
One thing we've done in this situation that has worked really well for us is to break the work to be done into individual chunks and the actions on each chunk into different processors. Then we have chains of processors and data chunks can work through the chains independently. Each set of processors within the chain can run on multiple threads each and can process more or less data depending on their own performance relative to the other processors in the chain.
Also breaking up both the data and actions into smaller pieces makes the app much more maintainable and testable.
There's plenty of specific bits of individual advice that could be given here, and several people have done so already.
However nobody can tell you exactly how to make this all work for your specific requirements (which you don't even fully know yourself yet), so I'd strongly recommend you read up on HPC (High Performance Computing) for now to get the over-arching concepts clear and have a better idea which direction suits your needs the most.
The model you choose to use will be dictated by the structure of your data. Is your data tightly coupled or loosely coupled? If your simulation data is tightly coupled then you'll want to look at OpenMP or MPI (parallel computing). If your data is loosely coupled then a job pool is probably a better fit... possibly even a distributed computing approach could work.
My advice is get and read an introductory text to get familiar with the various models of concurrency/parallelism. Then look at your application's needs and decide which architecture you're going to need to use. After you know which architecture you need, then you can look at tools to assist you.
A fairly highly rated book which works as an introduction to the topic is "The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Application".
Read about Erlang and the "Actor Model" in particular. If you make all your data immutable, you will have a much easier time parallelizing it.
Most of the other answers offer good advice regarding partitioning the project - look for tasks that can be cleanly executed in parallel with very little data sharing required. Be aware of non-thread safe constructs such as static or global variables, or libraries that are not thread safe. The worst one we've encountered is the TNT library, which doesn't even allow thread-safe reads under some circumstances.
As with all optimisation, concentrate on the bottlenecks first, because threading adds a lot of complexity you want to avoid it where it isn't necessary.
You'll need a good grasp of the various threading primitives (mutexes, semaphores, critical sections, conditions, etc.) and the situations in which they are useful.
One thing I would add, if you're intending to stay with C++, is that we have had a lot of success using the boost.thread library. It supplies most of the required multi-threading primitives, although does lack a thread pool (and I would be wary of the unofficial "boost" thread pool one can locate via google, because it suffers from a number of deadlock issues).
I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now.
You can either use C# that you're more familiar with or you can use managed C++.
At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For (or ForEach) and/or PLINQ where possible.
If you do this, a lot of the heavy lifting will be done for you in a very efficient way. It's the direction that Microsoft is going to increasingly support.
2: I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now. At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For and/or PLINQ where possible. If you do this, a lot of the heavy lifting will be done for you in a very efficient way. 1: http://msdn.microsoft.com/en-us/library/dd321424%28VS.100%29.aspx
Sorry i just want to add a pessimistic or better realistic answer here.
You are under time pressure. 6 month deadline and you don't even know for sure what language is this system and what it does and how it is organized. If it is not a trivial calculation then it is a very bad start.
Most importantly: You say you have never done mulitithreading programming before. This is where i get 4 alarm clocks ringing at once. Multithreading is difficult and takes a long time to learn it when you want to do it right - and you need to do it right when you want to win a huge speed increase. Debugging is extremely nasty even with good tools like Total Views debugger or Intels VTune.
Then you say you want to rewrite the app in another lanugage - well this isn't as bad as you have to rewrite it anyway. THe chance to turn a single threaded Program into a well working multithreaded one without total redesign is almost zero.
But learning multithreading and a new language (what is your C++ skills?) with a timeline of 3 month (you have to write a throw away prototype - so i cut the timespan into two halfs) is extremely challenging.
My advise here is simple and will not like it: Learn multithreadings now - because it is a required skill set in the future - but leave this job to someone who already has experience. Well unless you don't care about the program being successfull and are just looking for 6 month payment.
If it's possible to have all the threads working on disjoint sets of process data, and have other information stored in the SQL database, you can quite easily do it in C++, and just spawn off new threads to work on their own parts using the Windows API. The SQL server will handle all the hard synchronization magic with its DB transactions! And of course C++ will perform a lot faster than C#.
You should definitely revise C++ for this task, and understand the C++ code, and look for efficiency bugs in the existing code as well as adding the multi-threaded functionality.
You've tagged this question as C++ but mentioned that you're a C# developer currently, so I'm not sure if you'll be tackling this assignment from C++ or C#. Anyway, in case you're going to be using C# or .NET (including C++/CLI): I have the following MSDN article bookmarked and would highly recommend reading through it as part of your prep work.
Calling Synchronous Methods Asynchronously
Whatever technology your going to write this, take a look a this must read book on concurrency "Concurrent programming in Java" and for .Net I highly recommend the retlang library for concurrent app.
I don't know if it was mentioned yet, but if I were in your shoes, what I would be doing right now (aside from reading every answer posted here) is writing a multiple threaded example application in your favorite (most used) language.
I don't have extensive multithreaded experience. I've played around with it in the past for fun but I think gaining some experience with a throw-away application will suit your future efforts.
I wish you luck in this endeavor and I must admit I wish I had the opportunity to work on something like this...

Future Protections in Managed Languages and Runtimes

In the future, will managed runtimes provide additional protections against subtle data corruption issues?
Managed runtimes such as Java and the .NET CLR reduce or eliminate the possibility of many memory corruption bugs common in native languages like C#. Nonetheless, they are surprisingly not immune from all memory corruption problems. One intuitively expects that a method that validates its input, has no bugs, and robustly handles exceptions will always transform its object from one valid state to another, but this is not the case. (It is more accurate to say that it is not the case using prevailing programming conventions--object implementors need to go out of their way to avoid the problems I describe.)
Consider the following scenarios:
Threading. The caller might share the object with other threads and make concurrent calls on it. If the object does not implement locking, the fields might be corrupted. (Perhaps--unless notified that the object is thread-safe--runtimes should use an interlock on every method call to throw an exception if any method on the same object executing concurrently on another thread. This would be a protection feature and, just like other well-accepted safety features of managed runtimes, it has some cost.)
Re-entrancy. The method makes a callout to an arbitrary function (such as an event handler) that ultimately calls methods on the object that are not designed to be called at that point. This is even trickier than thread safety and many class libraries do not get this right. (Worse yet, class libraries are known to poorly document what re-entrancy is allowed.)
For all of these cases, it can be argued that thorough documentation is a solution. However, documentation also can prescribe how to allocate and deallocate memory in unmanaged languages. We know from experience (e.g., with memory allocation) that the difference between documentation and language/runtime enforcement is night and day.
What can we expect from languages and runtimes in the future to protect us from these problems and other subtle problems like them?
I think languages and runtimes will keep moving forward, keep abstracting away issues from the developer, and keep making our lives easier and more productive.
Take your example - threading. There are some great new features on the horizon in the .NET world to simplify the threading model we use daily. STM.NET may eventually make shared state much, much safer to handle, for example. The parallel extensions in .NET 4 make life very easy for threading compared to current technologies.
I think that transactional memory is promising for addressing some of these issues. I'm not sure if this answers your question in some way but this is an interesting topic in any event:
http://en.wikipedia.org/wiki/Software_transactional_memory
There was an episode of Software Engineering Radio on the topic a year or so ago maybe.
First of all, "managed" is a bit of a misnomer: languages like OCaml, Haskell, and SML achieve such protections and safety while being fully compiled. All relevant "management" occurs at compile time through static analysis, which aids optimization and speed.
Anyway, to answer your question: if you look at languages like Erlang and Haskell, state is isolated and immutable by default. With kind of system, threading and reentrancy is safe by default, and because you have to go out of your way to break these rules, it is obvious to see where unsafe code can arise.
By starting with safe defaults but leaving room for advanced unsafe usage, you get the best of both worlds. It seems reasonable that future systems that are safe by your definition may follow some of these practices as well.
What can we expect in the future?
Nothing. Thread-state and re-entrancy are not problems I see tools/runtimes solving. Instead I think in the future people will move to styles that avoid programming with mutable state to bypass these issues. Languages and libraries can help make these styles of programming more attractive, but the tools are not the solution - changing the way we write code is the solution.

switch to parallel coding

we all writing code for single processor.
i wonder when we all are able to write code on multi processors?
what do we need (software tools, logic, algorithms) for this switching?
edit: in my view, as we do many task parallely, same way we need to convert those real life solutions(algorithms) to computer lang. just as OOPs coding did for procedural coding. OOPs is more real life coding style than procedural one. so i hope for that kind of solutions.
I think the most important requirement is a good language that has native constructs that support parallelism or one that can automatically generate parallel code. There are quite a few languages that fit that description, but none of them is popular enough to really be considered for mainstream use. That, in turn is caused by several things:
By their very nature, these languages are very different from today's imperative languages, and are therefor harder to learn (or at least seem that way).
They often lack good tools and libraries, making them unusable for any "real" project.
Of course, if it were more popular more people would be willing to learn it and there would be more support, so it's a kind of cycle that's pretty hard to break out of. I guess all we can do is hope. :)
An example of a language designed with heavy parallelization in mind is Erlang - and it's actually used in commercial projects.
What we need are natural abstractions for highly-concurrent algorithms. Actors (think: Erlang) go a long way in this direction, but they aren't a one-size-fits-all solution. Some more specific abstractions like fork/join or map/reduce can be even easier to apply to common problems.
The trick with all of these concurrency abstractions is they require functional-style programming. Concurrency doesn't mesh well with shared mutable state. As they say, "Locks considered harmful". Since most developers come from a strictly imperative background, switching to a shared-nothing continuation passing approach is often extremely challenging.
Incidentally, with respect to concurrency abstractions, Clojure has some very interesting features in this direction. Not only does it have sort-of actors, but it also defines a transactional memory model (think: databases) along with a global, atomic references mechanism. These two features allow concurrent operations to share "mutable" state without ever having to worry about locking or race conditions.
In the end, it comes down to education. Much of the needed theoretical work into concurrency abstractions has already been done, we just need to accept it. Unfortunately, as Erlang and Haskell prove, sometimes the best ideas remain relegated to an extremely fringe demographic. Hopefully efforts like Scala and Clojure will succeed in bringing the more advanced abstractions into the mainstream by sneaking them onto an existing, well-supported platform (the JVM).
Unfortunately for massive concurrent programming - unless there is a breakthrough in compilers to help, we will be throwing out a lot of what we know about algorithms (I think Don Knuth even said that). Read about Erlang for a glimpse of this possible future.
There are several tools/languages that are popular or are gaining popularity. If you use FORTRAN, C, or C++, you can use OpenMP (not too hard to implement) or the Message Passing Interface (MPI) libraries (powerful and greatest speedup potential, but also complex and difficult). OpenMP uses preprocessor directives to mark areas that can be parallelized, especially loops. MPI uses messages that pass data back and forth between processes, and the greatest difficulty is keeping everything synchronized without hitting bottlenecks and keeping processes waiting. I would say MPI is definitely on the way out, however. It's become clear in the scientific/high-performance computing communities that the speedup is rarely worth the additional development time.
As for up and coming languages, check out Fortress. It's still being designed, but the goal is to create a language even easier for scientific computing than FORTRAN. Programs will be specified in a very high level mathematical syntax. Additionally, parallelism will be implicit; the programmer will have to work to do things in serial. Plus, it's being championed by Sun and is based on java, so it will be portable.
There is no simple answer, and in many ways even the complex answers are currently inadequate or incomplete. You'll get a better answer if you are more specific about the replies you want: pointers to dev libraries and tools, instructional materials, pointers to current research projects and issues in this area, or something else?
The most important requirement is to be able to split your problem into smaller problems that can be solved independently of each other. Once you've worked out how you're going to do that, everything else is easier to think about and further questions of implementation (e.g. "parts of my calculation depend on other parts - how do I wait for them to have finished?") become concrete, specific things you can research or ask here about.
for java you can now look to Parallel Java Library or DPJ(deterministic Parallel Java!)
It will offer you great help in extracting parallelism from codes!!

Resources