Setting different thread number in construct heuristics and local search - multithreading

My Optaplanner version is 7.24.0
Can I set the thread number like bellows?
Construct Heuristics : single thread
Local Search: multi thread (for example 4)
Because, some data shows different result in construct heuristics.
I found that the data is affected by thread number.
(It shows always the same result every time in single thread.)
Best regards

It is not possible to use different move thread count in CH and LS. If you need that, you will have to run the solver twice - first just the CH, then just the LS.
That said, some comments:
What you're describing sounds like a bug. Either a bug in OptaPlanner, or a bug in your domain. Construction Heuristics is a deterministic algorithm.
Are you really using OptaPlanner 7.24? That release is, by now, nearly 4 years old. You are missing out on so many performance improvements and bugfixes, and I don't even remember anymore if the behavior you are describing was something we've seen or fixed since. I strongly recommend you upgrade to the latest version; you'll be surprised by how easy it is, and by how much you gain.

Related

Does a plain read of a variable that is updated with interlocked functions always return the latest value?

If you only change a MyInt: Integer variable in one or more threads with one of the interlocked functions, lets say InterlockedIncrement, can we guarantee that after the InterlockedIncrement is executed, a plain read of the variable in any thread will return the latest updated value? Yes, no and why?
If not, is it possible to achieve that in Delphi? Note that I'm talking about only one variable, no need to worry about consistency about two or more variables.
The root problems and doubt seems essentially equal to the one in this SO post, but it is targeted at C# there, and I'm using Delphi 2007, so no access to volatile, neither of newer versions of Delphi as well. In that discussion, two major problems that seems to affect Delphi as well were raised:
The cache of the processor reading the variable may not be updated.
The compiler may optimize the code in a way that causes problems to read.
If this is really a problem, I'm very worried to use even a simple counter with InterlockedIncrement, or solutions like the lock-free initialization proposed in here, and would go to just plain Critical Sections of MultiReaderSingleWritter for safety.
Initial analysis
This is what I've found so far, but fell free to address the problems in other ways if appropriate, or even raising other unknown problems so the objective of the question can be achieved:
For the problem 1, I expected that the "full-fence" would also force the cache of other processors to be updated... but reading around it seems to not be the case. It looks that the cache would only be updated if a "read barrier" (or whatever it is called) would be called on the processor what will read the variable. If this is true, is there a way to call such "read barrier" in Delphi, just before reading the variable? Full-fence seems to imply both read and write barriers, so that would also be ok. Since that there is no InterlockedRead function according to the discussion in the first post, could we try (just speculating) to workaround using something like InterlockedCompareExchange (ugh... writing the variable to be able to read it, smells bad), or maybe "lock" low level assembly calls (that could be encapsulated)?
For the problem 2, Delphi optimizations would impact in this matter? Any way to avoid it?
Edit: The solution must work in D2007, but I'd like, preferably, to not make harder a possible future migration to newer Delphi, and use the same piece of code in ARM as well (this became clear to me after David's comments). So, if possible, it would be nice if solution is not coupled with x86/64 memory model. Would be nice if I need only to replace the plain Windows.pas interlocked functions to whatever provides the same interlocked functionality in newer Delphi/ARM, without the need to review the logic for ARM (one less concern).
But, Do the interlocked functions have enough abstraction from CPU architecture in this case? Problem 1 suggests that it doesn't, but I'm not sure if it would affect ARM Delphi. Any way around it, that keeps it simple and still allow relevant better performance over critical sections and similar sync objects?

Why do already popped scopes affect the check-sat time in subsequent scopes?

General problem
I've noticed several times that push-pop scopes that have already been popped appear to affect the time that a check-sat in a subsequent scope needs.
That is, assume a program with multiple (potentially arbitrarily nested) push-pop scopes, each of which contain a check-sat command. Furthermore, assume that the second check-sat takes 10s, whereas the first one takes only 0.1s.
...
(push)
(assert (not P))
(check-sat) ; Could be sat, unsat or unknown
(pop)
...
(push)
(assert (not Q))
(check-sat) ; Could be sat, unsat or unknown
(pop)
After commenting the first push-pop scope, the second check-sat only takes 1s. Why is that?
AFAIK, Z3 switches to incremental solvers if push-pop scopes are used. Is there a (conceptual) reason for why they might behave this way?
I was told once that Z3 attributes symbols by importance, which affects proof search (and the importance of a symbol can also change during proof search). Could that be the reason? Is it possible to reset importance (in between scopes)?
Could it be a bug? I found this post where Leonardo mentions a bug that seems related (his answer is from 2012, though).
Particular instance
I unfortunately only have fairly long (automatically generated) SMTLib files that illustrate the behaviour, one of which you can find as this gist. It uses quantifiers and uninterpreted functions, but neither mbqi nor arrays or bit vectors. The example consists of 148 nested push-pop scopes and 89 check-sats, and it takes Z3 4.3.2 about 8s to process it. The last check-sat (prefixed by an echo) takes by far the longest.
I randomly commented several push-pop scopes (one at a time, never the last one, make sure you don't comment symbol declarations), and in most cases, the overall run time went down to less than 1s. That is, the last check-sat was done significantly faster.
To provide further details, I compared a run of Z3 with all scopes (slow, 8s) with a run of Z3 where the scope marked by [XXX] had been commented (fast, 0.3s). The results can be seen in this diff (left is slow, right is fast).
The diff shows that all check-sats behave the same (I faked the commented one by echo'ing "unsat"), from which I conclude that the commented scope affects the proof search, but not its final outcome.
I also tried to make some sense out of differences in the obtained statistics, but I know very little about how to interpret the statistics correctly. Here are the few statistics I could make sense of:
grobner (383 vs 36) and nonlinear-horner (342 vs 25), so it seems that the slower run performs more arithmetic-related operations. The commented scope is indeed about to non-linear arithmetic (and so are lots of others), but the particular proof in the commented scope should be "trivial", it is essentially showing that x != 0 for an x about which 0 < x has been assumed explicitly.
memory (40 vs 7), which I interpret as indicating that Z3 explores a larger search space in the slow version of the program
quant-instantiations (43k vs 51k), which surprised me because the significantly faster run nevertheless triggered notably more quantifier instantiations.
I'm not exactly sure whether this is an observation or a question? Yes, Z3 will behave differently for different inputs and push/pop are not "innocent", i.e., they will have a major impact on the performance. This is most obvious if they can be removed completely, because that allows Z3 to switch to completely different sub-solvers that don't support incrementality (but are often faster). For example, for purely bit-blasted formulas without scoping, Z3 will use a fast, new SAT solver, but if push/pop are required, it falls back to a much older and slower SAT solver (the implementations of those two solvers are completely disjoint).
Further, removing some of the scopes between others may have a huge impact too, because it allows Z3 to keep more intermediate lemmas, as well as heuristic states. If it is for some reason desired that each query behaves as if no others were there, then it's perhaps best to simply generate independent queries and to start Z3 from scratch on each of them.
More information about the specific issues mentioned:
"Heuristic states" means all kinds of heuristic data Z3 uses, there is a whole plethora of different heuristics at work, not just one particular one like the symbol ordering. Whether it is "good" to keep this information between queries depends entirely on your problems - the heuristics work on some problems, but not on all, as is the nature of heuristics. The whole concept of incrementality is built upon this though: if the heuristics wouldn't help, then we'd be better off running independent queries. However, in some applications, resetting Z3 sometimes is better than no resets or independent queries, for instance, when you have a huge number of tiny queries.
Conceptual reason for switching to a different solver: the first one doesn't support the features you need. See combined_solver.cpp, function check_sat. If solver1 is not used (e.g., if assumptions are provided or if incremental mode is enabled) then solver2 will be used.
combined_solver.solver2_timeout will put a timeout solver2. What happens when solver2 times out is set by the option combined_solver.solver2_unknown. So, yes, you can run solver1 after solver2, but solver1 is also allowed to fail, i.e., return unknown. Looking at the code, it may well be unsound (e.g., ignoring assumptions) if that's used.
Re: the bug report that was mentioned: That was a soundness bug, not a performance bug; one of the solvers said SAT and the other said UNSAT.

Software to Tune/Calibrate Properties for Heuristic Algorithms

Today I read that there is a software called WinCalibra (scroll a bit down) which can take a text file with properties as input.
This program can then optimize the input properties based on the output values of your algorithm. See this paper or the user documentation for more information (see link above; sadly doc is a zipped exe).
Do you know other software which can do the same which runs under Linux? (preferable Open Source)
EDIT: Since I need this for a java application: should I invest my research in java libraries like gaul or watchmaker? The problem is that I don't want to roll out my own solution nor I have time to do so. Do you have pointers to an out-of-the-box applications like Calibra? (internet searches weren't successfull; I only found libraries)
I decided to give away the bounty (otherwise no one would have a benefit) although I didn't found a satisfactory solution :-( (out-of-the-box application)
Some kind of (Metropolis algorithm-like) probability selected random walk is a possibility in this instance. Perhaps with simulated annealing to improve the final selection. Though the timing parameters you've supplied are not optimal for getting a really great result this way.
It works like this:
You start at some point. Use your existing data to pick one that look promising (like the highest value you've got). Set o to the output value at this point.
You propose a randomly selected step in the input space, assign the output value there to n.
Accept the step (that is update the working position) if 1) n>o or 2) the new value is lower, but a random number on [0,1) is less than f(n/o) for some monotonically increasing f() with range and domain on [0,1).
Repeat steps 2 and 3 as long as you can afford, collecting statistics at each step.
Finally compute the result. In your case an average of all points is probably sufficient.
Important frill: This approach has trouble if the space has many local maxima with deep dips between them unless the step size is big enough to get past the dips; but big steps makes the whole thing slow to converge. To fix this you do two things:
Do simulated annealing (start with a large step size and gradually reduce it, thus allowing the walker to move between local maxima early on, but trapping it in one region later to accumulate precision results.
Use several (many if you can afford it) independent walkers so that they can get trapped in different local maxima. The more you use, and the bigger the difference in output values, the more likely you are to get the best maxima.
This is not necessary if you know that you only have one, big, broad, nicely behaved local extreme.
Finally, the selection of f(). You can just use f(x) = x, but you'll get optimal convergence if you use f(x) = exp(-(1/x)).
Again, you don't have enough time for a great many steps (though if you have multiple computers, you can run separate instances to get the multiple walkers effect, which will help), so you might be better off with some kind of deterministic approach. But that is not a subject I know enough about to offer any advice.
There are a lot of genetic algorithm based software that can do exactly that. Wrote a PHD about it a decade or two ago.
A google for Genetic Algorithms Linux shows a load of starting points.
Intrigued by the question, I did a bit of poking around, trying to get a better understanding of the nature of CALIBRA, its standing in academic circles and the existence of similar software of projects, in the Open Source and Linux world.
Please be kind (and, please, edit directly, or suggest editing) for the likely instances where my assertions are incomplete, inexact and even flat-out incorrect. While working in related fields, I'm by no mean an Operational Research (OR) authority!
[Algorithm] Parameter tuning problem is a relatively well defined problem, typically framed as one of a solution search problem whereby, the combination of all possible parameter values constitute a solution space and the parameter tuning logic's aim is to "navigate" [portions of] this space in search of an optimal (or locally optimal) set of parameters.
The optimality of a given solution is measured in various ways and such metrics help direct the search. In the case of the Parameter Tuning problem, the validity of a given solution is measured, directly or through a function, from the output of the algorithm [i.e. the algorithm being tuned not the algorithm of the tuning logic!].
Framed as a search problem, the discipline of Algorithm Parameter Tuning doesn't differ significantly from other other Solution Search problems where the solution space is defined by something else than the parameters to a given algorithm. But because it works on algorithms which are in themselves solutions of sorts, this discipline is sometimes referred as Metaheuristics or Metasearch. (A metaheuristics approach can be applied to various algorihms)
Certainly there are many specific features of the parameter tuning problem as compared to the other optimization applications but with regard to the solution searching per-se, the approaches and problems are generally the same.
Indeed, while well defined, the search problem is generally still broadly unsolved, and is the object of active research in very many different directions, for many different domains. Various approaches offer mixed success depending on the specific conditions and requirements of the domain, and this vibrant and diverse mix of academic research and practical applications is a common trait to Metaheuristics and to Optimization at large.
So... back to CALIBRA...
From its own authors' admission, Calibra has several limitations
Limit of 5 parameters, maximum
Requirement of a range of values for [some of ?] the parameters
Works better when the parameters are relatively independent (but... wait, when that is the case, isn't the whole search problem much easier ;-) )
CALIBRA is based on a combination of approaches, which are repeated in a sequence. A mix of guided search and local optimization.
The paper where CALIBRA was presented is dated 2006. Since then, there's been relatively few references to this paper and to CALIBRA at large. Its two authors have since published several other papers in various disciplines related to Operational Research (OR).
This may be indicative that CALIBRA hasn't been perceived as a breakthrough.
State of the art in that area ("parameter tuning", "algorithm configuration") is the SPOT package in R. You can connect external fitness functions using a language of your choice. It is really powerful.
I am working on adapters for e.g. C++ and Java that simplify the experimental setup, which requires some getting used to in SPOT. The project goes under name InPUT, and a first version of the tuning part will be up soon.

Linux kernel scheduling

I wish to know how Old Linux scheduling algorithm SJF (shortest job first) calculates the process runtime ?
This problem actually is one of the major reasons why it is rarely used in common environments, since SJF algorithm requires accurate estimate of the runtime of all processes, which is only given in specialized environments.
In common situations you can only get estimated and inaccurate length of process running time, for example, by recording the length of previous CPU bursts of the same process, and use mathematical approximation methods to calculate how long it will run next time.
If you have some bandwidth to burn, you might be able to find the actual code here. Start at 2.0, where I think you'll find it as experimental.
SJF was (IIRC) extremely short lived, for the exact reasons that ZelluX noted.
I think your only hope to understand the method behind its madness lives in the code at this point. You may be able to build it and get it to boot in a simulator.
Edit:
I'm now not completely sure if it ever did go into mainline. If you can't find it, don't blame me :)

Advice on starting a large multi-threaded programming project

My company currently runs a third-party simulation program (natural catastrophe risk modeling) that sucks up gigabytes of data off a disk and then crunches for several days to produce results. I will soon be asked to rewrite this as a multi-threaded app so that it runs in hours instead of days. I expect to have about 6 months to complete the conversion and will be working solo.
We have a 24-proc box to run this. I will have access to the source of the original program (written in C++ I think), but at this point I know very little about how it's designed.
I need advice on how to tackle this. I'm an experienced programmer (~ 30 years, currently working in C# 3.5) but have no multi-processor/multi-threaded experience. I'm willing and eager to learn a new language if appropriate. I'm looking for recommendations on languages, learning resources, books, architectural guidelines. etc.
Requirements: Windows OS. A commercial grade compiler with lots of support and good learning resources available. There is no need for a fancy GUI - it will probably run from a config file and put results into a SQL Server database.
Edit: The current app is C++ but I will almost certainly not be using that language for the re-write. I removed the C++ tag that someone added.
Numerical process simulations are typically run over a single discretised problem grid (for example, the surface of the Earth or clouds of gas and dust), which usually rules out simple task farming or concurrency approaches. This is because a grid divided over a set of processors representing an area of physical space is not a set of independent tasks. The grid cells at the edge of each subgrid need to be updated based on the values of grid cells stored on other processors, which are adjacent in logical space.
In high-performance computing, simulations are typically parallelised using either MPI or OpenMP. MPI is a message passing library with bindings for many languages, including C, C++, Fortran, Python, and C#. OpenMP is an API for shared-memory multiprocessing. In general, MPI is more difficult to code than OpenMP, and is much more invasive, but is also much more flexible. OpenMP requires a memory area shared between processors, so is not suited to many architectures. Hybrid schemes are also possible.
This type of programming has its own special challenges. As well as race conditions, deadlocks, livelocks, and all the other joys of concurrent programming, you need to consider the topology of your processor grid - how you choose to split your logical grid across your physical processors. This is important because your parallel speedup is a function of the amount of communication between your processors, which itself is a function of the total edge length of your decomposed grid. As you add more processors, this surface area increases, increasing the amount of communication overhead. Increasing the granularity will eventually become prohibitive.
The other important consideration is the proportion of the code which can be parallelised. Amdahl's law then dictates the maximum theoretically attainable speedup. You should be able to estimate this before you start writing any code.
Both of these facts will conspire to limit the maximum number of processors you can run on. The sweet spot may be considerably lower than you think.
I recommend the book High Performance Computing, if you can get hold of it. In particular, the chapter on performance benchmarking and tuning is priceless.
An excellent online overview of parallel computing, which covers the major issues, is this introduction from Lawerence Livermore National Laboratory.
Your biggest problem in a multithreaded project is that too much state is visible across threads - it is too easy to write code that reads / mutates data in an unsafe manner, especially in a multiprocessor environment where issues such as cache coherency, weakly consistent memory etc might come into play.
Debugging race conditions is distinctly unpleasant.
Approach your design as you would if, say, you were considering distributing your work across multiple machines on a network: that is, identify what tasks can happen in parallel, what the inputs to each task are, what the outputs of each task are, and what tasks must complete before a given task can begin. The point of the exercise is to ensure that each place where data becomes visible to another thread, and each place where a new thread is spawned, are carefully considered.
Once such an initial design is complete, there will be a clear division of ownership of data, and clear points at which ownership is taken / transferred; and so you will be in a very good position to take advantage of the possibilities that multithreading offers you - cheaply shared data, cheap synchronisation, lockless shared data structures - safely.
If you can split the workload up into non-dependent chunks of work (i.e., the data set can be processed in bits, there aren't lots of data dependencies), then I'd use a thread pool / task mechanism. Presumably whatever C# has as an equivalent to Java's java.util.concurrent. I'd create work units from the data, and wrap them in a task, and then throw the tasks at the thread pool.
Of course performance might be a necessity here. If you can keep the original processing code kernel as-is, then you can call it from within your C# application.
If the code has lots of data dependencies, it may be a lot harder to break up into threaded tasks, but you might be able to break it up into a pipeline of actions. This means thread 1 passes data to thread 2, which passes data to threads 3 through 8, which pass data onto thread 9, etc.
If the code has a lot of floating point mathematics, it might be worth looking at rewriting in OpenCL or CUDA, and running it on GPUs instead of CPUs.
For a 6 month project I'd say it definitely pays out to start reading a good book about the subject first. I would suggest Joe Duffy's Concurrent Programming on Windows. It's the most thorough book I know about the subject and it covers both .NET and native Win32 threading. I've written multithreaded programs for 10 years when I discovered this gem and still found things I didn't know in almost every chapter.
Also, "natural catastrophe risk modeling" sounds like a lot of math. Maybe you should have a look at Intel's IPP library: it provides primitives for many common low-level math and signal processing algorithms. It supports multi threading out of the box, which may make your task significantly easier.
There are a lot of techniques that can be used to deal with multithreading if you design the project for it.
The most general and universal is simply "avoid shared state". Whenever possible, copy resources between threads, rather than making them access the same shared copy.
If you're writing the low-level synchronization code yourself, you have to remember to make absolutely no assumptions. Both the compiler and CPU may reorder your code, creating race conditions or deadlocks where none would seem possible when reading the code. The only way to prevent this is with memory barriers. And remember that even the simplest operation may be subject to threading issues. Something as simple as ++i is typically not atomic, and if multiple threads access i, you'll get unpredictable results.
And of course, just because you've assigned a value to a variable, that's no guarantee that the new value will be visible to other threads. The compiler may defer actually writing it out to memory. Again, a memory barrier forces it to "flush" all pending memory I/O.
If I were you, I'd go with a higher level synchronization model than simple locks/mutexes/monitors/critical sections if possible. There are a few CSP libraries available for most languages and platforms, including .NET languages and native C++.
This usually makes race conditions and deadlocks trivial to detect and fix, and allows a ridiculous level of scalability. But there's a certain amount of overhead associated with this paradigm as well, so each thread might get less work done than it would with other techniques. It also requires the entire application to be structured specifically for this paradigm (so it's tricky to retrofit onto existing code, but since you're starting from scratch, it's less of an issue -- but it'll still be unfamiliar to you)
Another approach might be Transactional Memory. This is easier to fit into a traditional program structure, but also has some limitations, and I don't know of many production-quality libraries for it (STM.NET was recently released, and may be worth checking out. Intel has a C++ compiler with STM extensions built into the language as well)
But whichever approach you use, you'll have to think carefully about how to split the work up into independent tasks, and how to avoid cross-talk between threads. Any time two threads access the same variable, you have a potential bug. And any time two threads access the same variable or just another variable near the same address (for example, the next or previous element in an array), data will have to be exchanged between cores, forcing it to be flushed from CPU cache to memory, and then read into the other core's cache. Which can be a major performance hit.
Oh, and if you do write the application in C++, don't underestimate the language. You'll have to learn the language in detail before you'll be able to write robust code, much less robust threaded code.
One thing we've done in this situation that has worked really well for us is to break the work to be done into individual chunks and the actions on each chunk into different processors. Then we have chains of processors and data chunks can work through the chains independently. Each set of processors within the chain can run on multiple threads each and can process more or less data depending on their own performance relative to the other processors in the chain.
Also breaking up both the data and actions into smaller pieces makes the app much more maintainable and testable.
There's plenty of specific bits of individual advice that could be given here, and several people have done so already.
However nobody can tell you exactly how to make this all work for your specific requirements (which you don't even fully know yourself yet), so I'd strongly recommend you read up on HPC (High Performance Computing) for now to get the over-arching concepts clear and have a better idea which direction suits your needs the most.
The model you choose to use will be dictated by the structure of your data. Is your data tightly coupled or loosely coupled? If your simulation data is tightly coupled then you'll want to look at OpenMP or MPI (parallel computing). If your data is loosely coupled then a job pool is probably a better fit... possibly even a distributed computing approach could work.
My advice is get and read an introductory text to get familiar with the various models of concurrency/parallelism. Then look at your application's needs and decide which architecture you're going to need to use. After you know which architecture you need, then you can look at tools to assist you.
A fairly highly rated book which works as an introduction to the topic is "The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Application".
Read about Erlang and the "Actor Model" in particular. If you make all your data immutable, you will have a much easier time parallelizing it.
Most of the other answers offer good advice regarding partitioning the project - look for tasks that can be cleanly executed in parallel with very little data sharing required. Be aware of non-thread safe constructs such as static or global variables, or libraries that are not thread safe. The worst one we've encountered is the TNT library, which doesn't even allow thread-safe reads under some circumstances.
As with all optimisation, concentrate on the bottlenecks first, because threading adds a lot of complexity you want to avoid it where it isn't necessary.
You'll need a good grasp of the various threading primitives (mutexes, semaphores, critical sections, conditions, etc.) and the situations in which they are useful.
One thing I would add, if you're intending to stay with C++, is that we have had a lot of success using the boost.thread library. It supplies most of the required multi-threading primitives, although does lack a thread pool (and I would be wary of the unofficial "boost" thread pool one can locate via google, because it suffers from a number of deadlock issues).
I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now.
You can either use C# that you're more familiar with or you can use managed C++.
At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For (or ForEach) and/or PLINQ where possible.
If you do this, a lot of the heavy lifting will be done for you in a very efficient way. It's the direction that Microsoft is going to increasingly support.
2: I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now. At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For and/or PLINQ where possible. If you do this, a lot of the heavy lifting will be done for you in a very efficient way. 1: http://msdn.microsoft.com/en-us/library/dd321424%28VS.100%29.aspx
Sorry i just want to add a pessimistic or better realistic answer here.
You are under time pressure. 6 month deadline and you don't even know for sure what language is this system and what it does and how it is organized. If it is not a trivial calculation then it is a very bad start.
Most importantly: You say you have never done mulitithreading programming before. This is where i get 4 alarm clocks ringing at once. Multithreading is difficult and takes a long time to learn it when you want to do it right - and you need to do it right when you want to win a huge speed increase. Debugging is extremely nasty even with good tools like Total Views debugger or Intels VTune.
Then you say you want to rewrite the app in another lanugage - well this isn't as bad as you have to rewrite it anyway. THe chance to turn a single threaded Program into a well working multithreaded one without total redesign is almost zero.
But learning multithreading and a new language (what is your C++ skills?) with a timeline of 3 month (you have to write a throw away prototype - so i cut the timespan into two halfs) is extremely challenging.
My advise here is simple and will not like it: Learn multithreadings now - because it is a required skill set in the future - but leave this job to someone who already has experience. Well unless you don't care about the program being successfull and are just looking for 6 month payment.
If it's possible to have all the threads working on disjoint sets of process data, and have other information stored in the SQL database, you can quite easily do it in C++, and just spawn off new threads to work on their own parts using the Windows API. The SQL server will handle all the hard synchronization magic with its DB transactions! And of course C++ will perform a lot faster than C#.
You should definitely revise C++ for this task, and understand the C++ code, and look for efficiency bugs in the existing code as well as adding the multi-threaded functionality.
You've tagged this question as C++ but mentioned that you're a C# developer currently, so I'm not sure if you'll be tackling this assignment from C++ or C#. Anyway, in case you're going to be using C# or .NET (including C++/CLI): I have the following MSDN article bookmarked and would highly recommend reading through it as part of your prep work.
Calling Synchronous Methods Asynchronously
Whatever technology your going to write this, take a look a this must read book on concurrency "Concurrent programming in Java" and for .Net I highly recommend the retlang library for concurrent app.
I don't know if it was mentioned yet, but if I were in your shoes, what I would be doing right now (aside from reading every answer posted here) is writing a multiple threaded example application in your favorite (most used) language.
I don't have extensive multithreaded experience. I've played around with it in the past for fun but I think gaining some experience with a throw-away application will suit your future efforts.
I wish you luck in this endeavor and I must admit I wish I had the opportunity to work on something like this...

Resources