Difference between thread-isolated and semaphore-isolated calls - multithreading

I was going through the Netflix opensource feature Hystrix...
I saw a statement
"Today tens of billions of thread-isolated, and hundreds of billions of semaphore-isolated calls are executed via Hystrix every day at Netflix"
Would like to know the difference between these different type of calls..

First we need to see the different between thread and semaphore. In general, calling thread is more expensive than semaphore because of the overhead. So for a large number of requests/second, then semaphore will be something you can considere.
Secondly with semaphore, the command will be executed within the thread of the caller. It means that the concurrent calls are not fully isolated from other (not like when you use thread).
Lastly, with semaphore, when there is a timeout, it can't be terminated (unless you specifically set it up). If you don't know what will be the client's behaviour ,then this would be not a nice thing to have.

Related

Why do we need both threading and asynchronous calls

My understanding is that threads exist as a way of doing several things in parallel that share the same address space but each has its individual stack. Asynchronous programming is basically a way of using fewer threads. I don't understand why it's undesirable to just have blocking calls and a separate thread for each blocked command?
For example, suppose I want to scrape a large part of the web. A presumably uncontroversial implementation might be to have a large number of asynchronous loops. Each loop would ask for a webpage, wait to avoid overloading the remote server, then ask for another webpage from the same website until done. The loops would then be executed on a much smaller number of threads (which is fine because they mostly wait). So to restate the question, I don't see why it's any cheaper to e.g. maintain a threadpool within the language runtime than it would be to just have one (mostly blocked) OS thread per loop and let the operating system deal with the complexity of scheduling the work? After all, if piling two different schedulers on top of each other is so good, it can still be implemented that way in the OS :-)
It seems obvious the answer is something like "threads are expensive". However, a thread just needs to keep track of where it has got to before it was interrupted the last time. This is pretty much exactly what an asynchronous command needs to know before it blocks (perhaps representing what happens next by storing a callback). I suppose one difference is that a blocking asynchronous command does so at a well defined point whereas a thread can be interrupted anywhere. If there really is a substantial difference in terms of the cost of keeping the state, where does it come from? I doubt it's the cost of the stack since that wastes at most a 4KB page, so it's a non-issue even for 1000s of blocked threads.
Many thanks, and sorry if the question is very basic. It might be I just cannot figure out the right incantation to type into Google.
Threads consume memory, as they need to have their state preserved even if they aren't doing anything. If you make an asynchronous call on a single thread, it's literally (aside from registering that the call was made somewhere) not consuming any resources until it needs to be resolved, because if it isn't actively being processed you don't care about it.
If the architecture of your application is written in a way that the resources it needs scale linearly (or worse) with the number of users / traffic you receive, then that is going to be a problem. You can watch this talk about node.js if you want to watch someone talk at length about this.
https://www.youtube.com/watch?v=ztspvPYybIY

Semaphores & threads - what is the point?

I've been reading about semaphores and came across this article:
www.csc.villanova.edu/~mdamian/threads/posixsem.html
So, this page states that if there are two threads accessing the same data, things can get ugly. The solution is to allow only one thread to access the data at the same time.
This is clear and I understand the solution, only why would anyone need threads to do this? What is the point? If the threads are blocked so that only one can execute, why use them at all? There is no advantage. (or maybe this is a just a dumb example; in such a case please point me to a sensible one)
Thanks in advance.
Consider this:
void update_shared_variable() {
sem_wait( &g_shared_variable_mutex );
g_shared_variable++;
sem_post( &g_shared_variable_mutex );
}
void thread1() {
do_thing_1a();
do_thing_1b();
do_thing_1c();
update_shared_variable(); // may block
}
void thread2() {
do_thing_2a();
do_thing_2b();
do_thing_2c();
update_shared_variable(); // may block
}
Note that all of the do_thing_xx functions still happen simultaneously. The semaphore only comes into play when the threads need to modify some shared (global) state or use some shared resource. So a thread will only block if another thread is trying to access the shared thing at the same time.
Now, if the only thing your threads are doing is working with one single shared variable/resource, then you are correct - there is no point in having threads at all (it would actually be less efficient than just one thread, due to context switching.)
When you are using multithreading not everycode that runs will be blocking. For example, if you had a queue, and two threads are reading from that queue, you would make sure that no thread reads at the same time from the queue, so that part would be blocking, but that's the part that will probably take the less time. Once you have retrieved the item to process from the queue, all the rest of the code can be run asynchronously.
The idea behind the threads is to allow simultaneous processing. A shared resource must be governed to avoid things like deadlocks or starvation. If something can take a while to process, then why not create multiple instances of those processes to allow them to finish faster? The bottleneck is just what you mentioned, when a process has to wait for I/O.
Being blocked while waiting for the shared resource is small when compared to the processing time, this is when you want to use multiple threads.
This is of course a SSCCE (Short, Self Contained, Correct Example)
Let's say you have 2 worker threads that do a lot of work and write the result to a file.
you only need to lock the file (shared resource) access.
The problem with trivial examples....
If the problem you're trying to solve can be broken down into pieces that can be executed in parallel then threads are a good thing.
A slightly less trivial example - imagine a for loop where the data being processed in each iteration is different every time. In that circumstance you could execute each iteration of the for loop simultaneously in separate threads. And indeed some compilers like Intel's will convert suitable for loops to threads automatically for you. In that particular circumstances no semaphores are needed because of the iterations' data independence.
But say you were wanting to process a stream of data, and that processing had two distinct steps, A and B. The threadless approach would involve reading in some data then doing A then B and then output the data before reading more input. Or you could have a thread reading and doing A, another thread doing B and output. So how do you get the interim result from the first thread to the second?
One way would be to have a memory buffer to contain the interim result. The first thread could write the interim result to a memory buffer and the second could read from it. But with two threads operating independently there's no way for the first thread to know if it's safe to overwrite that buffer, and there's no way for the second to know when to read from it.
That's where you can use semaphores to synchronise the action of the two threads. The first thread takes a semaphore that I'll call empty, fills the buffer, and then posts a semaphore called filled. Meanwhile the second thread will take the filled semaphore, read the buffer, and then post empty. So long as filled is initialised to 0 and empty is initialised to 1 it will work. The second thread will process the data only after the first has written it, and the first won't write it until the second has finished with it.
It's only worth it of course if the amount of time each thread spends processing data outweighs the amount of time spent waiting for semaphores. This limits the extent to which splitting code up into threads yields a benefit. Going beyond that tends to mean that the overall execution is effectively serial.
You can do multithreaded programming without semaphores at all. There's the Actor model or Communicating Sequential Processes (the one I favour). It's well worth looking up JCSP on Wikipedia.
In these programming styles data is shared between threads by sending it down communication channels. So instead of using semaphores to grant another thread access to data it would be sent a copy of that data down something a bit like a network socket, or a pipe. The advantage of CSP (which limits that communication channel to send-finishes-only-if-receiver-has-read) is that it stops you falling into the many many pitfalls that plague multithreaded do programs. It sounds inefficient (copying data is inefficient), but actually it's not so bad with Intel's QPI architecture, AMD's Hypertransport. And it means hat the 'channel' really could be a network connection; scalability built in by design.

Why are message queues used insted of mulithreading?

I have the following query which i need someone to please help me with.Im new to message queues and have recently started looking at the Kestrel message queue.
As i understand,both threads and message queues are used for concurrency in applications so what is the advantage of using message queues over multitreading ?
Please help
Thank you.
message queues allow you to communicate outside your program.
This allows you to decouple your producer from your consumer. You can spread the work to be done over several processes and machines, and you can manage/upgrade/move around those programs independently of each other.
A message queue also typically consists of one or more brokers that takes care of distributing your messages and making sure the messages are not lost in case something bad happens (e.g. your program crashes, you upgrade one of your programs etc.)
Message queues might also be used internally in a program, in which case it's often just a facility to exchange/queue data from a producer thread to a consumer thread to do async processing.
Actually, one facilitates the other. Message queue is a nice and simple multithreading pattern: when you have a control thread (usually, but not necessarily an application's main thread) and a pool of (usually looping) worker threads, message queues are the easiest way to facilitate control over the thread pool.
For example, to start processing a relatively heavy task, you submit a corresponding message into the queue. If you have more messages, than you can currently process, your queue grows, and if less, it goes vice versa. When your message queue is empty, your threads sleep (usually by staying locked under a mutex).
So, there is nothing to compare: message queues are part of multithreading and hence they're used in some more complicated cases of multithreading.
Creating threads is expensive, and every thread that is simultaneously "live" will add a certain amount of overhead, even if the thread is blocked waiting for something to happen. If program Foo has 1,000 tasks to be performed and doesn't really care in what order they get done, it might be possible to create 1,000 threads and have each thread perform one task, but such an approach would not be terribly efficient. An second alternative would be to have one thread perform all 1,000 tasks in sequence. If there were other processes in the system that could employ any CPU time that Foo didn't use, this latter approach would be efficient (and quite possibly optimal), but if there isn't enough work to keep all CPUs busy, CPUs would waste some time sitting idle. In most cases, leaving a CPU idle for a second is just as expensive as spending a second of CPU time (the main exception is when one is trying to minimize electrical energy consumption, since an idling CPU may consume far less power than a busy one).
In most cases, the best strategy is a compromise between those two approaches: have some number of threads (say 10) that start performing the first ten tasks. Each time a thread finishes a task, have it start work on another until all tasks have been completed. Using this approach, the overhead related to threading will be cut by 99%, and the only extra cost will be the queue of tasks that haven't yet been started. Since a queue entry is apt to be much cheaper than a thread (likely less than 1% of the cost, and perhaps less than 0.01%), this can represent a really huge savings.
The one major problem with using a job queue rather than threading is that if some jobs cannot complete until jobs later in the list have run, it's possible for the system to become deadlocked since the later tasks won't run until the earlier tasks have completed. If each task had been given a separate thread, that problem would not occur since the threads associated with the later tasks would eventually manage to complete and thus let the earlier ones proceed. Indeed, the more earlier tasks were blocked, the more CPU time would be available to run the later ones.
It makes more sense to contrast message queues and other concurrency primitives, such as semaphores, mutex, condition variables, etc. They can all be used in the presence of threads, though message-passing is also commonly used in non-threaded contexts, such as inter-process communication, whereas the others tend to be confined to inter-thread communication and synchronisation.
The short answer is that message-passing is easier on the brain. In detail...
Message-passing works by sending stuff from one agent to another. There is generally no need to coordinate access to the data. Once an agent receives a message it can usually assume that it has unqualified access to that data.
The "threading" style works by giving all agent open-slather access to shared data but requiring them to carefully coordinate their access via primitives. If one agent misbehaves, the process becomes corrupted and all hell breaks loose. Message passing tends to confine problems to the misbehaving agent and its cohort, and since agents are generally self-contained and often programmed in a sequential or state-machine style, they tend not to misbehave as often — or as mysteriously — as conventional threaded code.

Concurrent execution/Re-entrant /ThreadSafe/?

I read many answers given here for questions related to thread safety, re-entrancy, but when i think about them, some more questions came to mind, hence this question/s.
1.) I have one executable program say some *.exe. If i run this program on command prompt, and while it is executing, i run the same program on another command prompt, then in what conditions the results could be corrupted, i.e. should the code of this program be re-entrant or it should be thread safe alone?
2.) While defining re-entrancy, we say that the routine can be re-entered while it is already running, in what situations the function can be re-entered (apart from being recursive routine, i am not talking recursive execution here). There has to be some thread to execute the same code again, or how can that function be entered again?
3.) In a practical case, will two threads execute same code, i.e. perform same functionality. I thought the idea of multi-threading is to execute different functionality, concurrently(on different cores/processors).
Sorry if these queries seem different, but they all occured to me, same time when i read about the threadsafe Vs reentrant post on SO, hence i put them together.
Any pointers, reading material will be appreciated.
thanks,
-AD.
I'll try to explain these, in order:
Each program runs in its own process, and gets its own isolated memory space. You don't have to worry about thread safety in this situation. (However, if the processes are both accessing some other shared resource, such as a file, you may have different issues. For example, process 1 may "lock" the data file, preventing process 2 from being able to open it).
The idea here is that two threads may try to run the same routine at the same time. This is not always valid - it takes special care to define a class or a process in a way that multiple threads can use the same instance of the same class, or the same static function, without errors occurring. This typically requires synchronization in the class.
Two threads often execute the same code. There are two different conceptual ways to parition your work when threading. You can either think in terms of tasks - ie: one thread does task A while another does task B. Alternatively, you can think in terms of decomposing the the problem based on data. In this case, you work with a large collection, and each element is processed using the same routine, but the processing happens in parallel. For more info, you can read this blog post I wrote on Decomposition for Parallelism.
Two processes cannot share memory. So thread-safety is moot here.
Re-entrancy means that a method can be safely executed by two threads at the same time. This doesn't require recursion - threads are separate units of execution, and there is nothing keeping them both from attempting to run the same method simultaneously.
The benefits to threading can happen in two ways. One is when you perform different types of operations concurrently (like running cpu-intensive code and I/O-intensive code at the ame time). The other is when you can divide up a long-running operation among multiple processors. In this latter case, two threads may be executing the same function at the same time on different input data sets.
First of all, I strongly suggest you to look at some basic stuffs of computer system, especially how a process/thread is executing on CPU and scheduled by operating system. For example, virtual address, context switching, process/thread concepts(e.g., each thread has its own stack and register vectors while heap is shared by threads. A thread is an execution and scheduling unit, so it maintains control flow of code..) and so on. All of the questions are related to understanding how your program is actually working on CPU
1) and 2) are already answered.
3) Multithreading is just concurrent execution of any arbitrary thread. The same code can be executed by multiple threads. These threads can share some data, and even can make data races which are very hard to find. Of course, many times threads are executing separate code(we say it as thread-level parallelism).
In this context, I have used concurrent as two meaning: (a) in a single processor, multiple threads are sharing a single physical processor, but operating system gives a sort of illusion that threads are running concurrently. (b) In a multicore, yes, physically two or more threads can be executed concurrently.
Having concrete understanding of concurrent/parallel execution takes quite long time. But, you already have a solid understanding!

What are multi-threading DOs and DONTs? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am applying my new found knowledge of threading everywhere and getting lots of surprises
Example:
I used threads to add numbers in an
array. And outcome was different every
time. The problem was that all of my
threads were updating the same
variable and were not synchronized.
What are some known thread issues?
What care should be taken while using
threads?
What are good multithreading resources.
Please provide examples.
sidenote:(I renamed my program thread_add.java to thread_random_number_generator.java:-)
In a multithreading environment you have to take care of synchronization so two threads doesn't clobber the state by simultaneously performing modifications. Otherwise you can have race conditions in your code (for an example see the infamous Therac-25 accident.) You also have to schedule the threads to perform various tasks. You then have to make sure that your synchronization and scheduling doesn't cause a deadlock where multiple threads will wait for each other indefinitely.
Synchronization
Something as simple as increasing a counter requires synchronization:
counter += 1;
Assume this sequence of events:
counter is initialized to 0
thread A retrieves counter from memory to cpu (0)
context switch
thread B retrieves counter from memory to cpu (0)
thread B increases counter on cpu
thread B writes back counter from cpu to memory (1)
context switch
thread A increases counter on cpu
thread A writes back counter from cpu to memory (1)
At this point the counter is 1, but both threads did try to increase it. Access to the counter has to be synchronized by some kind of locking mechanism:
lock (myLock) {
counter += 1;
}
Only one thread is allowed to execute the code inside the locked block. Two threads executing this code might result in this sequence of events:
counter is initialized to 0
thread A acquires myLock
context switch
thread B tries to acquire myLock but has to wait
context switch
thread A retrieves counter from memory to cpu (0)
thread A increases counter on cpu
thread A writes back counter from cpu to memory (1)
thread A releases myLock
context switch
thread B acquires myLock
thread B retrieves counter from memory to cpu (1)
thread B increases counter on cpu
thread B writes back counter from cpu to memory (2)
thread B releases myLock
At this point counter is 2.
Scheduling
Scheduling is another form of synchronization and you have to you use thread synchronization mechanisms like events, semaphores, message passing etc. to start and stop threads. Here is a simplified example in C#:
AutoResetEvent taskEvent = new AutoResetEvent(false);
Task task;
// Called by the main thread.
public void StartTask(Task task) {
this.task = task;
// Signal the worker thread to perform the task.
this.taskEvent.Set();
// Return and let the task execute on another thread.
}
// Called by the worker thread.
void ThreadProc() {
while (true) {
// Wait for the event to become signaled.
this.taskEvent.WaitOne();
// Perform the task.
}
}
You will notice that access to this.task probably isn't synchronized correctly, that the worker thread isn't able to return results back to the main thread, and that there is no way to signal the worker thread to terminate. All this can be corrected in a more elaborate example.
Deadlock
A common example of deadlock is when you have two locks and you are not careful how you acquire them. At one point you acquire lock1 before lock2:
public void f() {
lock (lock1) {
lock (lock2) {
// Do something
}
}
}
At another point you acquire lock2 before lock1:
public void g() {
lock (lock2) {
lock (lock1) {
// Do something else
}
}
}
Let's see how this might deadlock:
thread A calls f
thread A acquires lock1
context switch
thread B calls g
thread B acquires lock2
thread B tries to acquire lock1 but has to wait
context switch
thread A tries to acquire lock2 but has to wait
context switch
At this point thread A and B are waiting for each other and are deadlocked.
There are two kinds of people that do not use multi threading.
1) Those that do not understand the concept and have no clue how to program it.
2) Those that completely understand the concept and know how difficult it is to get it right.
I'd make a very blatant statement:
DON'T use shared memory.
DO use message passing.
As a general advice, try to limit the amount of shared state and prefer more event-driven architectures.
I can't give you examples besides pointing you at Google. Search for threading basics, thread synchronisation and you'll get more hits than you know.
The basic problem with threading is that threads don't know about each other - so they will happily tread on each others toes, like 2 people trying to get through 1 door, sometimes they will pass though one after the other, but sometimes they will both try to get through at the same time and will get stuck. This is difficult to reproduce, difficult to debug, and sometimes causes problems. If you have threads and see "random" failures, this is probably the problem.
So care needs to be taken with shared resources. If you and your friend want a coffee, but there's only 1 spoon you cannot both use it at the same time, one of you will have to wait for the other. The technique used to 'synchronise' this access to the shared spoon is locking. You make sure you get a lock on the shared resource before you use it, and let go of it afterwards. If someone else has the lock, you wait until they release it.
Next problem comes with those locks, sometimes you can have a program that is complex, so much that you get a lock, do something else then access another resource and try to get a lock for that - but some other thread has that 2nd resource, so you sit and wait... but if that 2nd thread is waiting for the lock you hold for the 1st resource.. it's going to sit and wait. And your app just sits there. This is called deadlock, 2 threads both waiting for each other.
Those 2 are the vast majority of thread issues. The answer is generally to lock for as short a time as possible, and only hold 1 lock at a time.
I notice you are writing in java and that nobody else mentioned books so Java Concurrency In Practice should be your multi-threaded bible.
-- What are some known thread issues? --
Race conditions.
Deadlocks.
Livelocks.
Thread starvation.
-- What care should be taken while using threads? --
Using multi-threading on a single-processor machine to process multiple tasks where each task takes approximately the same time isn’t always very effective.For example, you might decide to spawn ten threads within your program in order to process ten separate tasks. If each task takes approximately 1 minute to process, and you use ten threads to do this processing, you won’t have access to any of the task results for the whole 10 minutes. If instead you processed the same tasks using just a single thread, you would see the first result in 1 minute, the next result 1 minute later, and so on. If you can make use of each result without having to rely on all of the results being ready simultaneously, the single
thread might be the better way of implementing the program.
If you launch a large number of threads within a process, the overhead of thread housekeeping and context switching can become significant. The processor will spend considerable time in switching between threads, and many of the threads won’t be able to make progress. In addition, a single process with a large number of threads means that threads in other processes will be scheduled less frequently and won’t receive a reasonable share of processor time.
If multiple threads have to share many of the same resources, you’re unlikely to see performance benefits from multi-threading your application. Many developers see multi-threading as some sort of magic wand that gives automatic performance benefits. Unfortunately multi-threading isn’t the magic wand that it’s sometimes perceived to be. If you’re using multi-threading for performance reasons, you should measure your application’s performance very closely in several different situations, rather than just relying on some non-existent magic.
Coordinating thread access to common data can be a big performance killer. Achieving good performance with multiple threads isn’t easy when using a coarse locking plan, because this leads to low concurrency and threads waiting for access. Alternatively, a fine-grained locking strategy increases the complexity and can also slow down performance unless you perform some sophisticated tuning.
Using multiple threads to exploit a machine with multiple processors sounds like a good idea in theory, but in practice you need to be careful. To gain any significant performance benefits, you might need to get to grips with thread balancing.
-- Please provide examples. --
For example, imagine an application that receives incoming price information from
the network, aggregates and sorts that information, and then displays the results
on the screen for the end user.
With a dual-core machine, it makes sense to split the task into, say, three threads. The first thread deals with storing the incoming price information, the second thread processes the prices, and the final thread handles the display of the results.
After implementing this solution, suppose you find that the price processing is by far the longest stage, so you decide to rewrite that thread’s code to improve its performance by a factor of three. Unfortunately, this performance benefit in a single thread may not be reflected across your whole application. This is because the other two threads may not be able to keep pace with the improved thread. If the user interface thread is unable to keep up with the faster flow of processed information, the other threads now have to wait around for the new bottleneck in the system.
And yes, this example comes directly from my own experience :-)
DONT use global variables
DONT use many locks (at best none at all - though practically impossible)
DONT try to be a hero, implementing sophisticated difficult MT protocols
DO use simple paradigms. I.e share the processing of an array to n slices of the same size - where n should be equal to the number of processors
DO test your code on different machines (using one, two, many processors)
DO use atomic operations (such as InterlockedIncrement() and the like)
YAGNI
The most important thing to remember is: do you really need multithreading?
I agree with pretty much all the answers so far.
A good coding strategy is to minimise or eliminate the amount of data that is shared between threads as much as humanly possible. You can do this by:
Using thread-static variables (although don't go overboard on this, it will eat more memory per thread, depending on your O/S).
Packaging up all state used by each thread into a class, then guaranteeing that each thread gets exactly one state class instance to itself. Think of this as "roll your own thread-static", but with more control over the process.
Marshalling data by value between threads instead of sharing the same data. Either make your data transfer classes immutable, or guarantee that all cross-thread calls are synchronous, or both.
Try not to have multiple threads competing for the exact same I/O "resource", whether it's a disk file, a database table, a web service call, or whatever. This will cause contention as multiple threads fight over the same resource.
Here's an extremely contrived OTT example. In a real app you would cap the number of threads to reduce scheduling overhead:
All UI - one thread.
Background calcs - one thread.
Logging errors to a disk file - one thread.
Calling a web service - one thread per unique physical host.
Querying the database - one thread per independent group of tables that need updating.
Rather than guessing how to do divvy up the tasks, profile your app and isolate those bits that are (a) very slow, and (b) could be done asynchronously. Those are good candidates for a separate thread.
And here's what you should avoid:
Calcs, database hits, service calls, etc - all in one thread, but spun up multiple times "to improve performance".
Don't start new threads unless you really need to. Starting threads is not cheap and for short running tasks starting the thread may actually take more time than executing the task itself. If you're on .NET take a look at the built in thread pool, which is useful in a lot of (but not all) cases. By reusing the threads the cost of starting threads is reduced.
EDIT: A few notes on creating threads vs. using thread pool (.NET specific)
Generally try to use the thread pool. Exceptions:
Long running CPU bound tasks and blocking tasks are not ideal run on the thread pool cause they will force the pool to create additional threads.
All thread pool threads are background threads, so if you need your thread to be foreground, you have to start it yourself.
If you need a thread with different priority.
If your thread needs more (or less) than the standard 1 MB stack space.
If you need to be able to control the life time of the thread.
If you need different behavior for creating threads than that offered by the thread pool (e.g. the pool will throttle creating of new threads, which may or may not be what you want).
There are probably more exceptions and I am not claiming that this is the definitive answer. It is just what I could think of atm.
I am applying my new found knowledge of threading everywhere
[Emphasis added]
DO remember that a little knowledge is dangerous. Knowing the threading API of your platform is the easy bit. Knowing why and when you need to use synchronisation is the hard part. Reading up on "deadlocks", "race-conditions", "priority inversion" will start you in understanding why.
The details of when to use synchronisation are both simple (shared data needs synchronisation) and complex (atomic data types used in the right way don't need synchronisation, which data is really shared): a lifetime of learning and very solution specific.
An important thing to take care of (with multiple cores and CPUs) is cache coherency.
I am surprised that no one has pointed out Herb Sutter's Effective Concurrency columns yet. In my opinion, this is a must read if you want to go anywhere near threads.
a) Always make only 1 thread responsible for a resource's lifetime. That way thread A won't delete a resource thread B needs - if B has ownership of the resource
b) Expect the unexpected
DO think about how you will test your code and set aside plenty of time for this. Unit tests become more complicated. You may not be able to manually test your code - at least not reliably.
DO think about thread lifetime and how threads will exit. Don't kill threads. Provide a mechanism so that they exit gracefully.
DO add some kind of debug logging to your code - so that you can see that your threads are behaving correctly both in development and in production when things break down.
DO use a good library for handling threading rather than rolling your own solution (if you can). E.g. java.util.concurrency
DON'T assume a shared resource is thread safe.
DON'T DO IT. E.g. use an application container that can take care of threading issues for you. Use messaging.
In .Net one thing that surprised me when I started trying to get into multi-threading is that you cannot straightforwardly update the UI controls from any thread other than the thread that the UI controls were created on.
There is a way around this, which is to use the Control.Invoke method to update the control on the other thread, but it is not 100% obvious the first time around!
Don't be fooled into thinking you understand the difficulties of concurrency until you've split your head into a real project.
All the examples of deadlocks, livelocks, synchronization, etc, seem simple, and they are. But they will mislead you, because the "difficulty" in implementing concurrency that everyone is talking about is when it is used in a real project, where you don't control everything.
While your initial differences in sums of numbers are, as several respondents have pointed out, likely to be the result of lack of synchronisation, if you get deeper into the topic, be aware that, in general, you will not be able to reproduce exactly the numeric results you get on a serial program with those from a parallel version of the same program. Floating-point arithmetic is not strictly commutative, associative, or distributive; heck, it's not even closed.
And I'd beg to differ with what, I think, is the majority opinion here. If you are writing multi-threaded programs for a desktop with one or more multi-core CPUs, then you are working on a shared-memory computer and should tackle shared-memory programming. Java has all the features to do this.
Without knowing a lot more about the type of problem you are tackling, I'd hesitate to write that 'you should do this' or 'you should not do that'.

Resources