complex threading with OpenMP

complex threading with OpenMP - multithreading

I need to switch from boost::thread to OpenMP because boss says so.
The problem is quiet simple: the result of a simulation is written to disk every 5 iteration (int it = 5,10,15...). For the sake of simplicity, suppose I have an 8-core CPU. I created 9 threads; thread 0 is used for IO, other 8 for computation. When (it%5 == 0), I check thread 0 to see if it has finished. If yes, I create another thread, call 0, and ask it to write the result to disk. If not, all threads have to wait. Usually, the time it takes to write out a result is less than 5 iterations, so I effectively "hide" the IO cost.
I have spent a few hours looking into OpenMP and I guess the same algorithm can be done with the "task" construct but I don't see how I can synchronize the threads. OpenMP experts please help. Thanks.
The current pseudo code look like this
boost::thread pool[9];
for(int it=0;it<1000;it++)
{
- simulate using pool[1,8]
- if(it%5 == 0)
+ check pool[0]
+ if finished: create new thread, assign to pool[0], write data out
+ if not, wait
}

Intel has a very nice answer to the OpenMP vs Threads dilemma, I would defer to them and ask your boss to get some education.
OpenMP is very loop oriented, you parallize the entire loop rather than have synchronised threads.

Your design seems right to me overall: having a separate thread for I/O and a thread pool for computations is right. You might possibly replace Boost threads in pool[1..8] with OpenMP for the computational part, but I would not go beyond that. If you can't use Boost, use POSIX threads.

Related

Thread synchronisation for very short tasks

I have a C++ application running on winapi. Portability is not an issue. All I want is maximum performance. I have a basic understanding of multithreading and synchronization issues, but limited experience with the multitude of options ranging from winapi over C++ threads to third party libraries.
In the performance critical core of my application I identified a loop, which could be parallelized. I managed to split the loop into 4 parts which do not depend on each other. I would like to delegate the job to 4 threads running in parallel. The main thread should wait until all 4 threads have done their job, before it continues.
Sounds very simple. However, currently the loop takes only about 10 microseconds when running on one thread. I'm afraid that synchronization methods which cause a switch to the kernel (events, mutexes, etc.) would produce more overhead than the parallelization could save. SRWLocks + condition variables claim to be very lightweight, but I didn't find a way to solve my synchronization with these tools.
Of course I could test all kinds of synchronization APIs, but I'm sure this has been done before.
So my question is: Is there a reasonable way to synchronize very short tasks and if so, what are the appropriate tools?

If you simply need to wait for threads to complete you would use WaitForMultipleObjects on the thread handles. The other direct option would be to use a synchronization barrier, a primitive that allows a group of threads to halt until all members of the group have reached the barrier, but that is generally for the case where there is more work for the spawned threads to perform after being released.
Your question of whether this would actually be of benefit in your particular case is one that can only be answered through implementation and timing. And note that if you are going to perform this testing it should be done on a release build with optimizations enabled. It may well be the case that if the amount of work to perform is short enough that the time involved in thread management dwarfs any benefit.

The update algorithm consists of two steps. Each of these steps can be applied to the knots in arbitrary order, but step 1 must be completed before step 2 can start. I can portion the whole net into four (or more) parts and delegate each part to a separate thread. My problem is: Each thread has to pause after step 1 and wait until all threads have finished their job. Then each thread makes step 2, wait for completion of the other threads and so on.
You want to break the work into a large number of small chunks and have a fixed pool of threads take chunks of work. Do not make 8 threads on an 8 core machine and split the work into 8 chunks. That algorithm will work poorly if, for one reason or another, only 7 of those cores winds up doing work for you. Your algorithm will need twice as long as the second half of the time only one core is working.
The easy way is to have an extra dispatch thread. Just keep a "work unit" count somewhere protected by a mutex. When a thread finishes a work unit, have it decrement the "work unit" count. When it hits zero, broadcast a condition variable. That will wake the dispatch thread which will then do whatever it takes to get the worker threads going again. It can start them by setting the "work unit" count to the right level and broadcasting another condition variable that the worker threads wait for.
You can also just keep a count of which node needs to be done next and the number of nodes currently doing work. That will require synchronization after each thread though (to figure out which node to do next) and it may make more sense to have each thread grab some number of nodes, iterate over them, and then synchronize to grab another few nodes.
Avoid breaking the work into large chunks early. That can lead to the problem where you have 8 cores but 2 large work units left at some point. Remember, many modern CPUs run their cores at different speeds based on temperature and power measurements.

Releasing multiple locks without causing priority inversion

Short version: How do I release multiple locks from a single thread, without being preempted halfway through?
I have a program which is designed to run on an N-core machine. It consists of one main thread and N worker threads. Each thread (including the main thread) has a semaphore it can block on. Normally, each worker thread is blocked on decrementing its semaphore, and the main thread is running. Every now and then, though, the main thread should wake up the worker threads to do their thing for a certain amount of time, then block on its own semaphore waiting for them all to go back to sleep. Like so:
def main_thread(n):
for i = 1 to n:
worker_semaphore[i] = semaphore(0)
spawn_thread(worker_thread, i)
main_semaphore = semaphore(0)
while True:
...do some work...
workers_to_wake = foo()
for i in workers_to_wake:
worker_semaphore[i].increment() # wake up worker n
for i in workers_to_wake:
main_semaphore.decrement() # wait for all workers
def worker_thread(i):
while True:
worker_semaphore(i).decrement() # wait to be woken
...do some work...
main_semaphore.increment() # report done with step
All well and good. The problem is, one of the woken workers may end up preempting the main thread halfway through waking the workers: This can happen, for instance, when the Windows scheduler decides to boost that worker's priority. This doesn't lead to deadlock, but it is inefficient, because the remainder of the threads stay asleep until the preempting worker finishes its work. It's basically priority inversion, with the main thread waiting on one of the workers, and some of the worker threads waiting on the main thread.
I can probably figure out OS- and scheduler-specific hacks for this, such as disabling priority boosting under Windows, and fiddling about with thread priorities and processor affinities, but I'd like something cross-platform-ish and robust and clean. So: How can I wake up a bunch of threads atomically?

TL; DR
If you really have to get as much as you can out of your workers, just use an event semaphore, a control block and a barrier instead of your semaphores. Note however, that this is a more fragile solution and so you need to balance any potential gains against this downside.
Context
First I need to summarize the broader context in our discussion...
You have a Windows graphical application. It has a desired frame rate and so you need the main thread to run at that rate, scheduling all your workers at precisely timed intervals so that they have completed their work within the refresh interval. This means you have very tight constraints on the start and execution times for each thread. In addition, your worker threads are not all identical, so you can't just use a single work queue.
The problem
Like any modern operating system, Windows has a variety of synchronization primitives. However, none of these directly provides a mechanism for notifying multiple primitives at once. Looking through other operating systems, I see a similar pattern; they all provide ways of waiting on multiple primitives, but none provide an atomic way of triggering them.
So what can we do instead? The problems you need to solve are:
Precisely timing the start of all required workers.
Prodding the workers that actually need to run in the next frame.
Options
The most obvious solution for issue 1 is just to use a single event semaphore, but you could also use a read/write lock (by acquiring the write lock after the workers have finished and getting the workers to use a read lock). All other options are no longer atomic and so will need further synchronization to force the threads to do what you want - like lossleader's suggestion for locks inside your semaphores.
But we want an optimal solution that reduces context switches as much as possible due to the tight time constraints on your application, so let's see if either of these can be used to solve problem 2... How can you pick which worker threads should run from the main if we just have an event semaphore or read/write lock?
Well... A read/write lock is a great way for one thread to write some critical data to a control block and for many others to read from it. Why not just have a simple array of boolean flags (one for each worker thread) that your main thread updates each frame? Sadly you still need to stop execution of the workers until the timer pops. In short we're back at the semaphore and lock solution again.
However, owing to the nature of your application, you can make one more step. You can rely on the fact that you know your workers are not running outside of your time slicing and use an event semaphore as a crude form of lock instead.
A final optimization (if your environment supports them), is to use a barrier instead of the main semaphore. You know that all n threads need to be idle before you can continue, so just insist on it.
A solution
Applying the above, your pseudo-code would then look something like this:
def main_thread(n):
main_event = event()
for i = 1 to n:
worker_scheduled[i] = False
spawn_thread(worker_thread, i)
main_barrier = barrier(n+1)
while True:
...do some work...
workers_to_wake = foo()
for i in workers_to_wake:
worker_scheduled[i] = True
main_event.set()
main_barrier.enter() # wait for all workers
main_event.reset()
def worker_thread(i):
while True:
main_event.wait()
if worker_scheduled[i]:
worker_scheduled[i] = False
...do some work...
main_barrier.enter() # report finished for this frame.
main_event.reset() # to catch the case that a worker is scheduled before the main thread
Since there is no explicit policing of the worker_scheduled array, this is a much more fragile solution.
I would therefore personally only use it if I had to squeeze every last ounce of processing out of my CPU, but it sounds like that is exactly what you are looking for.

It is not possible when you use multiple synchronization objects (semaphores) when wake-up algorithm complexity is O(n). There are few ways how to solve it though.
release all at once
I'm not sure whether Python has the necessary method (is your question Python-specific?), but generally, semaphores have operations with argument specifying the number to decrements/increments. Thus, you just put all your threads on the same semaphore and wake them all together. Similar approach is to use conditional variable and notify all.
event loops
If you still want to to be able to control each thread individually but like the approach with one-to-many notification, try libraries for asynchronous I/O like libuv (and its Python counterpart). Here, you can create one single event that wakes all the threads at once and also create for each thread its individual event, then just wait on both (or more) event objects in event loops in each thread.
Another library is pevents which implements WaitForMultipleObjects on top of pthreads' conditional variables.
delegate waking up
Another approach is to replace your O(n) algorithm with tree-like algorithm ( O(log n) ) where each thread wakes up only fixed number of other threads but delegates them to wake-up others. In the edge case, main thread can wake up only one other thread which will wake-up everyone else or start the chain-reaction. It can be useful if you want to reduce latency for the main thread at expense of wake-up latenies of other threads.

Reader/Writer Lock
The solution I would normally use on POSIX systems for a one to many relationship is a reader/writer lock. It is a surprise to me that they aren't a complete universal, but most languages either implement a version, or at least have a package available to implement them on whatever primitives exist, for example, python's prwlock:
from prwlock import RWLock
def main_thread(n):
for i = 1 to n:
worker_semaphore[i] = semaphore(0)
spawn_thread(worker_thread, i)
main_lock = RWLock()
while True:
main_lock.acquire_write()
...do some work...
workers_to_wake = foo()
# The above acquire could be moved as low as here,
# depending on how independent the above processing is..
for i in workers_to_wake:
worker_semaphore[i].increment() # wake up worker n
main_lock.release()
def worker_thread(i):
while True:
worker_semaphore(i).decrement() # wait to be woken
main_lock.acquire_read()
...do some work...
main_lock.release() # report done with step
Barriers
Barriers seem like Python's closest intended built-in mechanism to hold up all the threads until they are all alerted, but:
They are a pretty unusual solution, so they would make your code/experience harder to translate to other languages.
I wouldn't like to use them for this case where the number of threads to wake keeps changing. Given that your n sounds small, I would be tempted to use constant Barrier(n) and notify all threads to check if they are running this cycle. But:
I would be concerned that using a barrier would backfire since any of the threads being held up by something external will hold them all up and even a scheduler with resource dependency boosting might miss this relationship. Needing all n to reach the barrier could only make this worse.

Peter Brittain's solution, plus Anton's suggestion of a "tree-like wakeup", led me to another solution: Chained wakeups. Basically, rather than the main thread doing all the wakeups, it only wakes up one thread; and then each thread is then responsible for waking up the next one. The elegant bit here is that there's only ever one suspended thread ready to run, so threads rarely end up switching cores. In fact, this works fine with strict processor affinities, even if one of the worker threads shares affinity with the main thread.
The other thing I did was to use an atomic counter that worker threads decrement before sleeping; that way, only the last one wakes the main thread, so there's also no chance of the main thread being woken several times just to do more semaphore waiting.
workers_to_wake = []
main_semaphore = semaphore(0)
num_woken_workers = atomic_integer()
def main_thread(n):
for i = 1 to n:
worker_semaphore[i] = semaphore(0)
spawn_thread(worker_thread, i)
main_semaphore = semaphore(0)
while True:
...do some work...
workers_to_wake = foo()
num_woken_workers.atomic_set(len(workers_to_wake)) # set completion countdown
one_to_wake = workers_to_wake.pop()
worker_semaphore[one_to_wake].increment() # wake the first worker
main_semaphore.decrement() # wait for all workers
def worker_thread(i):
while True:
worker_semaphore[i].decrement() # wait to be woken
if workers_to_wake.len() > 0: # more pending wakeups
one_to_wake = workers_to_wake.pop()
worker_semaphore[one_to_wake].increment() # wake the next worker
...do some work...
if num_woken_workers.atomic_decrement() == 0: # see whether we're the last one
main_semaphore.increment() # report all done with step

Which boost multithreading design pattern should I use?

I'm getting started with boost for multi-threading to port my program to window ( from pthread of linux ) . Is there anyone familiar with it ? Any suggestion on which pattern should I use ?
Here are the requirements:
I have many threads most of the time running the same thing with different parameters,
All threads shared a memory location called "critical memory" (an array)
Synchronization has to be done with a "barrier" at certain iteration
requires highest parallelization if possible i.e good scheduling with same priority for all threads ( currently I let CPU does the job, but I find out that boost has threadpool with thread.schedule() not sure if i should use )
For pthread, every thing is function, so I'm not sure if I should convert it to object, what's the advantage then ?. A little bit of confusion after reading this tutorial http://antonym.org/2009/05/threading-with-boost---part-i-creating-threads.html so many options to use...
Thanks in advance

porting should be quite straightforward:
I have many threads most of the time
running the same thing with different
parameters,
create required number of threads with functor that binds your different parameters, like:
boost::thread thr1(boost::bind(your_thread_func, arg1, arg2));
All threads shared a memory location
called "critical memory" (an array)
nothing special here, just use boost::mutex to synchronize access (or another mutex type if you have special requirements)
Synchronization has to be done with a
"barrier" at certain iteration
use boost::barrier: http://www.boost.org/doc/libs/1_45_0/doc/html/thread/synchronization.html#thread.synchronization.barriers
requires highest parallelization if possible i.e good scheduling with same priority for all threads ( currently I let CPU does the job, but I find out that boost has threadpool with thread.schedule() not sure if i should use )
CPU? probably you meant OS scheduler. it's the simplest possible solution and in most cases satisfactory one. threadpool is not a part of boost and tbh I'm not familiar with it. boost thread doesn't have scheduler.
I don't know anything about your task and its parallelization potential, so I'll suppose it can be parallelized to more threads than you have cores. theoretically, to receive highest performance you need to smoothly distribute your work among thread number = number of your cores (including virtual ones). it's not the easiest task and you can use ready-for-use solutions. e.g. Intel Threading Building Blocks (GPL license) or even Boost Asio. Despite its main purpose is network communication, Asio has its dispatcher and you can use it as a thread pool. Just create optimal number of threads (number of cores?), boost::asio::io_service object and run it from all threads. post work to thread pool by io_service::post()

In my opinion, to use win32 port of pthread is more straightforward way to accomplish such tasks.
EDITED:
Last month I converted 1 year old project source code to pthread-w32 from Boost.Thread. Boost.Thread provides lots of good thing but if you are working on the old fashion thread routines Boost.Thread could be too mush hassle.

What are multi-threading DOs and DONTs? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am applying my new found knowledge of threading everywhere and getting lots of surprises
Example:
I used threads to add numbers in an
array. And outcome was different every
time. The problem was that all of my
threads were updating the same
variable and were not synchronized.
What are some known thread issues?
What care should be taken while using
threads?
What are good multithreading resources.
Please provide examples.
sidenote:(I renamed my program thread_add.java to thread_random_number_generator.java:-)

In a multithreading environment you have to take care of synchronization so two threads doesn't clobber the state by simultaneously performing modifications. Otherwise you can have race conditions in your code (for an example see the infamous Therac-25 accident.) You also have to schedule the threads to perform various tasks. You then have to make sure that your synchronization and scheduling doesn't cause a deadlock where multiple threads will wait for each other indefinitely.
Synchronization
Something as simple as increasing a counter requires synchronization:
counter += 1;
Assume this sequence of events:
counter is initialized to 0
thread A retrieves counter from memory to cpu (0)
context switch
thread B retrieves counter from memory to cpu (0)
thread B increases counter on cpu
thread B writes back counter from cpu to memory (1)
context switch
thread A increases counter on cpu
thread A writes back counter from cpu to memory (1)
At this point the counter is 1, but both threads did try to increase it. Access to the counter has to be synchronized by some kind of locking mechanism:
lock (myLock) {
counter += 1;
}
Only one thread is allowed to execute the code inside the locked block. Two threads executing this code might result in this sequence of events:
counter is initialized to 0
thread A acquires myLock
context switch
thread B tries to acquire myLock but has to wait
context switch
thread A retrieves counter from memory to cpu (0)
thread A increases counter on cpu
thread A writes back counter from cpu to memory (1)
thread A releases myLock
context switch
thread B acquires myLock
thread B retrieves counter from memory to cpu (1)
thread B increases counter on cpu
thread B writes back counter from cpu to memory (2)
thread B releases myLock
At this point counter is 2.
Scheduling
Scheduling is another form of synchronization and you have to you use thread synchronization mechanisms like events, semaphores, message passing etc. to start and stop threads. Here is a simplified example in C#:
AutoResetEvent taskEvent = new AutoResetEvent(false);
Task task;
// Called by the main thread.
public void StartTask(Task task) {
this.task = task;
// Signal the worker thread to perform the task.
this.taskEvent.Set();
// Return and let the task execute on another thread.
}
// Called by the worker thread.
void ThreadProc() {
while (true) {
// Wait for the event to become signaled.
this.taskEvent.WaitOne();
// Perform the task.
}
}
You will notice that access to this.task probably isn't synchronized correctly, that the worker thread isn't able to return results back to the main thread, and that there is no way to signal the worker thread to terminate. All this can be corrected in a more elaborate example.
Deadlock
A common example of deadlock is when you have two locks and you are not careful how you acquire them. At one point you acquire lock1 before lock2:
public void f() {
lock (lock1) {
lock (lock2) {
// Do something
}
}
}
At another point you acquire lock2 before lock1:
public void g() {
lock (lock2) {
lock (lock1) {
// Do something else
}
}
}
Let's see how this might deadlock:
thread A calls f
thread A acquires lock1
context switch
thread B calls g
thread B acquires lock2
thread B tries to acquire lock1 but has to wait
context switch
thread A tries to acquire lock2 but has to wait
context switch
At this point thread A and B are waiting for each other and are deadlocked.

There are two kinds of people that do not use multi threading.
1) Those that do not understand the concept and have no clue how to program it.
2) Those that completely understand the concept and know how difficult it is to get it right.

I'd make a very blatant statement:
DON'T use shared memory.
DO use message passing.
As a general advice, try to limit the amount of shared state and prefer more event-driven architectures.

I can't give you examples besides pointing you at Google. Search for threading basics, thread synchronisation and you'll get more hits than you know.
The basic problem with threading is that threads don't know about each other - so they will happily tread on each others toes, like 2 people trying to get through 1 door, sometimes they will pass though one after the other, but sometimes they will both try to get through at the same time and will get stuck. This is difficult to reproduce, difficult to debug, and sometimes causes problems. If you have threads and see "random" failures, this is probably the problem.
So care needs to be taken with shared resources. If you and your friend want a coffee, but there's only 1 spoon you cannot both use it at the same time, one of you will have to wait for the other. The technique used to 'synchronise' this access to the shared spoon is locking. You make sure you get a lock on the shared resource before you use it, and let go of it afterwards. If someone else has the lock, you wait until they release it.
Next problem comes with those locks, sometimes you can have a program that is complex, so much that you get a lock, do something else then access another resource and try to get a lock for that - but some other thread has that 2nd resource, so you sit and wait... but if that 2nd thread is waiting for the lock you hold for the 1st resource.. it's going to sit and wait. And your app just sits there. This is called deadlock, 2 threads both waiting for each other.
Those 2 are the vast majority of thread issues. The answer is generally to lock for as short a time as possible, and only hold 1 lock at a time.

I notice you are writing in java and that nobody else mentioned books so Java Concurrency In Practice should be your multi-threaded bible.

-- What are some known thread issues? --
Race conditions.
Deadlocks.
Livelocks.
Thread starvation.
-- What care should be taken while using threads? --
Using multi-threading on a single-processor machine to process multiple tasks where each task takes approximately the same time isn’t always very effective.For example, you might decide to spawn ten threads within your program in order to process ten separate tasks. If each task takes approximately 1 minute to process, and you use ten threads to do this processing, you won’t have access to any of the task results for the whole 10 minutes. If instead you processed the same tasks using just a single thread, you would see the first result in 1 minute, the next result 1 minute later, and so on. If you can make use of each result without having to rely on all of the results being ready simultaneously, the single
thread might be the better way of implementing the program.
If you launch a large number of threads within a process, the overhead of thread housekeeping and context switching can become significant. The processor will spend considerable time in switching between threads, and many of the threads won’t be able to make progress. In addition, a single process with a large number of threads means that threads in other processes will be scheduled less frequently and won’t receive a reasonable share of processor time.
If multiple threads have to share many of the same resources, you’re unlikely to see performance benefits from multi-threading your application. Many developers see multi-threading as some sort of magic wand that gives automatic performance benefits. Unfortunately multi-threading isn’t the magic wand that it’s sometimes perceived to be. If you’re using multi-threading for performance reasons, you should measure your application’s performance very closely in several different situations, rather than just relying on some non-existent magic.
Coordinating thread access to common data can be a big performance killer. Achieving good performance with multiple threads isn’t easy when using a coarse locking plan, because this leads to low concurrency and threads waiting for access. Alternatively, a fine-grained locking strategy increases the complexity and can also slow down performance unless you perform some sophisticated tuning.
Using multiple threads to exploit a machine with multiple processors sounds like a good idea in theory, but in practice you need to be careful. To gain any significant performance benefits, you might need to get to grips with thread balancing.
-- Please provide examples. --
For example, imagine an application that receives incoming price information from
the network, aggregates and sorts that information, and then displays the results
on the screen for the end user.
With a dual-core machine, it makes sense to split the task into, say, three threads. The first thread deals with storing the incoming price information, the second thread processes the prices, and the final thread handles the display of the results.
After implementing this solution, suppose you find that the price processing is by far the longest stage, so you decide to rewrite that thread’s code to improve its performance by a factor of three. Unfortunately, this performance benefit in a single thread may not be reflected across your whole application. This is because the other two threads may not be able to keep pace with the improved thread. If the user interface thread is unable to keep up with the faster flow of processed information, the other threads now have to wait around for the new bottleneck in the system.
And yes, this example comes directly from my own experience :-)

DONT use global variables
DONT use many locks (at best none at all - though practically impossible)
DONT try to be a hero, implementing sophisticated difficult MT protocols
DO use simple paradigms. I.e share the processing of an array to n slices of the same size - where n should be equal to the number of processors
DO test your code on different machines (using one, two, many processors)
DO use atomic operations (such as InterlockedIncrement() and the like)

YAGNI
The most important thing to remember is: do you really need multithreading?

I agree with pretty much all the answers so far.
A good coding strategy is to minimise or eliminate the amount of data that is shared between threads as much as humanly possible. You can do this by:
Using thread-static variables (although don't go overboard on this, it will eat more memory per thread, depending on your O/S).
Packaging up all state used by each thread into a class, then guaranteeing that each thread gets exactly one state class instance to itself. Think of this as "roll your own thread-static", but with more control over the process.
Marshalling data by value between threads instead of sharing the same data. Either make your data transfer classes immutable, or guarantee that all cross-thread calls are synchronous, or both.
Try not to have multiple threads competing for the exact same I/O "resource", whether it's a disk file, a database table, a web service call, or whatever. This will cause contention as multiple threads fight over the same resource.
Here's an extremely contrived OTT example. In a real app you would cap the number of threads to reduce scheduling overhead:
All UI - one thread.
Background calcs - one thread.
Logging errors to a disk file - one thread.
Calling a web service - one thread per unique physical host.
Querying the database - one thread per independent group of tables that need updating.
Rather than guessing how to do divvy up the tasks, profile your app and isolate those bits that are (a) very slow, and (b) could be done asynchronously. Those are good candidates for a separate thread.
And here's what you should avoid:
Calcs, database hits, service calls, etc - all in one thread, but spun up multiple times "to improve performance".

Don't start new threads unless you really need to. Starting threads is not cheap and for short running tasks starting the thread may actually take more time than executing the task itself. If you're on .NET take a look at the built in thread pool, which is useful in a lot of (but not all) cases. By reusing the threads the cost of starting threads is reduced.
EDIT: A few notes on creating threads vs. using thread pool (.NET specific)
Generally try to use the thread pool. Exceptions:
Long running CPU bound tasks and blocking tasks are not ideal run on the thread pool cause they will force the pool to create additional threads.
All thread pool threads are background threads, so if you need your thread to be foreground, you have to start it yourself.
If you need a thread with different priority.
If your thread needs more (or less) than the standard 1 MB stack space.
If you need to be able to control the life time of the thread.
If you need different behavior for creating threads than that offered by the thread pool (e.g. the pool will throttle creating of new threads, which may or may not be what you want).
There are probably more exceptions and I am not claiming that this is the definitive answer. It is just what I could think of atm.

I am applying my new found knowledge of threading everywhere
[Emphasis added]
DO remember that a little knowledge is dangerous. Knowing the threading API of your platform is the easy bit. Knowing why and when you need to use synchronisation is the hard part. Reading up on "deadlocks", "race-conditions", "priority inversion" will start you in understanding why.
The details of when to use synchronisation are both simple (shared data needs synchronisation) and complex (atomic data types used in the right way don't need synchronisation, which data is really shared): a lifetime of learning and very solution specific.

An important thing to take care of (with multiple cores and CPUs) is cache coherency.

I am surprised that no one has pointed out Herb Sutter's Effective Concurrency columns yet. In my opinion, this is a must read if you want to go anywhere near threads.

a) Always make only 1 thread responsible for a resource's lifetime. That way thread A won't delete a resource thread B needs - if B has ownership of the resource
b) Expect the unexpected

DO think about how you will test your code and set aside plenty of time for this. Unit tests become more complicated. You may not be able to manually test your code - at least not reliably.
DO think about thread lifetime and how threads will exit. Don't kill threads. Provide a mechanism so that they exit gracefully.
DO add some kind of debug logging to your code - so that you can see that your threads are behaving correctly both in development and in production when things break down.
DO use a good library for handling threading rather than rolling your own solution (if you can). E.g. java.util.concurrency
DON'T assume a shared resource is thread safe.
DON'T DO IT. E.g. use an application container that can take care of threading issues for you. Use messaging.

In .Net one thing that surprised me when I started trying to get into multi-threading is that you cannot straightforwardly update the UI controls from any thread other than the thread that the UI controls were created on.
There is a way around this, which is to use the Control.Invoke method to update the control on the other thread, but it is not 100% obvious the first time around!

Don't be fooled into thinking you understand the difficulties of concurrency until you've split your head into a real project.
All the examples of deadlocks, livelocks, synchronization, etc, seem simple, and they are. But they will mislead you, because the "difficulty" in implementing concurrency that everyone is talking about is when it is used in a real project, where you don't control everything.

While your initial differences in sums of numbers are, as several respondents have pointed out, likely to be the result of lack of synchronisation, if you get deeper into the topic, be aware that, in general, you will not be able to reproduce exactly the numeric results you get on a serial program with those from a parallel version of the same program. Floating-point arithmetic is not strictly commutative, associative, or distributive; heck, it's not even closed.
And I'd beg to differ with what, I think, is the majority opinion here. If you are writing multi-threaded programs for a desktop with one or more multi-core CPUs, then you are working on a shared-memory computer and should tackle shared-memory programming. Java has all the features to do this.
Without knowing a lot more about the type of problem you are tackling, I'd hesitate to write that 'you should do this' or 'you should not do that'.

Multithreading in XNA game

Where can I use multithreading in a simple 2D XNA game? Any suggestions would be appreciated

Well, there are many options -
Most games use mutlithreading for things such as:
Physics
Networking
Resource Loading
AI/Logical updates (if you have a lot of computation in the "update" phase of your game)
You really have to think about your specific game architecture, and decide where you'd benefit the most from using multithreading.

Some games use multithreaded renderers as a core design philosophy.
For instance... thread 1 calculates all of the game logic, then sends this information to thread 2. Thread 2 precalculates a display list and passes this to the GPU. Thread 1 ends up running 2 frames behind the GPU, thread 2 runs one frame behind the GPU.
The advantage is really that you can in theory do twice as much work in a frame. Skinning can be done on the CPU and can become "free" in terms of CPU and GPU time. It does require double buffering a large amount of data and careful construction of your engine flow so that all threads stall when (and only when) necessary.
Aside from this, a pretty common technique these days is to have a number of "worker threads" running. Tasks with a common interface can be added to a shared (threadsafe) queue and executed by the worker threads. The main game thread then adds these tasks to the queue before the results are needed and continues with other processing. When the results are eventually required, the main thread has the ability to stall until the worker threads have finished processing all of the required tasks.
For instance, an expensive for loop can be changed to used tasks.
// Single threaded method.
for (i = 0; i < numExpensiveThings; i++)
{
ProcessExpensiveThings (expensiveThings[i]);
}
// Accomplishes the same work, using N worker threads.
for (i = 0; i < numExpensiveThings; i++)
{
AddTask (ProcessExpensiveThingsTask, i);
}
WaitForAll (ProcessExpensiveThingsTask);
You can do this whenever you're guaranteed that ProcessExpensiveThings() is thread-safe with respect to other calls. If you have 80 things at 1ms each and 8 worker threads, you've saved yourself roughly 70ms. (Well, not really, but it's a good hand-wavy approximation.)

There is lots of place to apply to: AI, objects interaction, multiplayer gaming etc. This depends on your concrete game.

Why do you want to use multi-threading?
If it is for practice, a reasonable and easy module to put in its own thread would be the sound system, as communication is primarily one-way.

Multi-threading with GameComponents is meant to be quite straightforward
e.g.
http://roecode.wordpress.com/2008/02/01/xna-framework-gameengine-development-part-8-multi-threading-gamecomponents/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string