pthreads - how to parallelize a job

pthreads - how to parallelize a job - multithreading

I need to parallelize a simple password cracker, for using it on a n-processor system. My idea is to create n threads and to feed them more and more job as they finish.
What is the best way to know when a thread has finished? A mutex? Isn't expensive checking this mutex constantly while other threads are running?

You can have a simple queue structure - use any data structure you like - and then just use a mutex when you add/remove items from it.
Provided your threads grab the work they need to do in big enough "chunks", then there will be very little contention on the mutex, so very little overhead.
For example, if each thread was to grab approximately 1 second of work at a time and work independently for 1 second, then there would be very few operations on the mutex.
The threads could exit when they had no more work; the main thread could then wait using pthread_join.

Use message queues between the threads :-
Master -> Process (saying go with this).
Process -> Master (saying I'm done - give me more, or, I've found the result!)
Using this, the thread only closes down when the system does - otherwise it's either processing data or waiting on a message queue.
This way, the MCP (I've always wanted to say that!) simply processes messages and hands jobs out to threads that are waiting for more work.
This may be more efficient that creating and destroying threads all the time.

Normally you use a "condition variable" for this kind of thing where you want to wait for an asynchronous job to finish.
Condition variables are basically mutex-protected, simple signals. Pthread has condition variables (see e.g. the pthread_cond_create(...) function).

Related

Semaphores & threads - what is the point?

I've been reading about semaphores and came across this article:
www.csc.villanova.edu/~mdamian/threads/posixsem.html
So, this page states that if there are two threads accessing the same data, things can get ugly. The solution is to allow only one thread to access the data at the same time.
This is clear and I understand the solution, only why would anyone need threads to do this? What is the point? If the threads are blocked so that only one can execute, why use them at all? There is no advantage. (or maybe this is a just a dumb example; in such a case please point me to a sensible one)
Thanks in advance.

Consider this:
void update_shared_variable() {
sem_wait( &g_shared_variable_mutex );
g_shared_variable++;
sem_post( &g_shared_variable_mutex );
}
void thread1() {
do_thing_1a();
do_thing_1b();
do_thing_1c();
update_shared_variable(); // may block
}
void thread2() {
do_thing_2a();
do_thing_2b();
do_thing_2c();
update_shared_variable(); // may block
}
Note that all of the do_thing_xx functions still happen simultaneously. The semaphore only comes into play when the threads need to modify some shared (global) state or use some shared resource. So a thread will only block if another thread is trying to access the shared thing at the same time.
Now, if the only thing your threads are doing is working with one single shared variable/resource, then you are correct - there is no point in having threads at all (it would actually be less efficient than just one thread, due to context switching.)

When you are using multithreading not everycode that runs will be blocking. For example, if you had a queue, and two threads are reading from that queue, you would make sure that no thread reads at the same time from the queue, so that part would be blocking, but that's the part that will probably take the less time. Once you have retrieved the item to process from the queue, all the rest of the code can be run asynchronously.

The idea behind the threads is to allow simultaneous processing. A shared resource must be governed to avoid things like deadlocks or starvation. If something can take a while to process, then why not create multiple instances of those processes to allow them to finish faster? The bottleneck is just what you mentioned, when a process has to wait for I/O.
Being blocked while waiting for the shared resource is small when compared to the processing time, this is when you want to use multiple threads.

This is of course a SSCCE (Short, Self Contained, Correct Example)
Let's say you have 2 worker threads that do a lot of work and write the result to a file.
you only need to lock the file (shared resource) access.

The problem with trivial examples....
If the problem you're trying to solve can be broken down into pieces that can be executed in parallel then threads are a good thing.
A slightly less trivial example - imagine a for loop where the data being processed in each iteration is different every time. In that circumstance you could execute each iteration of the for loop simultaneously in separate threads. And indeed some compilers like Intel's will convert suitable for loops to threads automatically for you. In that particular circumstances no semaphores are needed because of the iterations' data independence.
But say you were wanting to process a stream of data, and that processing had two distinct steps, A and B. The threadless approach would involve reading in some data then doing A then B and then output the data before reading more input. Or you could have a thread reading and doing A, another thread doing B and output. So how do you get the interim result from the first thread to the second?
One way would be to have a memory buffer to contain the interim result. The first thread could write the interim result to a memory buffer and the second could read from it. But with two threads operating independently there's no way for the first thread to know if it's safe to overwrite that buffer, and there's no way for the second to know when to read from it.
That's where you can use semaphores to synchronise the action of the two threads. The first thread takes a semaphore that I'll call empty, fills the buffer, and then posts a semaphore called filled. Meanwhile the second thread will take the filled semaphore, read the buffer, and then post empty. So long as filled is initialised to 0 and empty is initialised to 1 it will work. The second thread will process the data only after the first has written it, and the first won't write it until the second has finished with it.
It's only worth it of course if the amount of time each thread spends processing data outweighs the amount of time spent waiting for semaphores. This limits the extent to which splitting code up into threads yields a benefit. Going beyond that tends to mean that the overall execution is effectively serial.
You can do multithreaded programming without semaphores at all. There's the Actor model or Communicating Sequential Processes (the one I favour). It's well worth looking up JCSP on Wikipedia.
In these programming styles data is shared between threads by sending it down communication channels. So instead of using semaphores to grant another thread access to data it would be sent a copy of that data down something a bit like a network socket, or a pipe. The advantage of CSP (which limits that communication channel to send-finishes-only-if-receiver-has-read) is that it stops you falling into the many many pitfalls that plague multithreaded do programs. It sounds inefficient (copying data is inefficient), but actually it's not so bad with Intel's QPI architecture, AMD's Hypertransport. And it means hat the 'channel' really could be a network connection; scalability built in by design.

Win32 Uderstanding semaphore

I'm new to Multithread in Win32. And I have an assignment with Semaphore. But I cannot understand this.
Assume that we have 20 tasks (each task is the same with other tasks). We use semaphore then there's 2 circumstances:
First, there should be have 20 childthreads in order that each thread will handle 1 task.
Or:
Second, there would be have n childthreads. When a thread finishs a task, it will handle another task?
The second problem I counter that I cannot find any samples for Semaphore in Win32(API) but Consonle that I found in MSDN.
Can you help me with the "20 task" and tell me the instruction of writing a Semaphore in WinAPI application (Where should I place CreateSemaphore() function ...)?
Your suggestion will be appreciated.

You can start a thread for every task, which is a common approach, or you can use a "threadpool" where threads are reused. This is up to you. In both scenarios, you may or may not use a semaphore, the difference is only how you start the multiple threads.
Now, concerning your question where to place the CreateSemaphore() function, you should call that before starting any further threads. The reason is that these threads need to access the semaphore, but they can't do that if it doesn't exist yet. You could of course pass it to the other threads, but that again would give you the problem how to pass it safely without any race conditions, which is something that semaphores and other synchronization primitives are there to avoid. In other words, you would only complicate things by creating a chicken-and-egg problem.
Note that if this doesn't help you any further, you should perhaps provide more info. What are the goals? What have you done yourself so far? Any related questions here that you read but that didn't fully present answers to your problem?

Well, if you are contrained to using semaphores only, you could use two semaphores to create an unbounded producer-consumer queue class that you could use to implement a thread pool.
You need a 'SimpleQueue' class for task objects. I assume you either have one already, can easily build one or whatever.
In the ctor of your 'ProducerConsumerQueue' class, (or in main(), or in some factory function that returns a *ProducerConsumerQueue struct, whatever your language has), create a SimpleClass and two semaphores. A 'QueueCount' semaphore, initialized with a count of 0, and a 'QueueAccess' semaphore, initialized with a count of 1.
Add 'push(*task)' and ' *task pop()' methods/memberFunctions/methods to the ProducerConsumerQueue:
In 'push', first call 'WaitForSingleObject()' API on QueueAccess, then push the *task onto the SimpleQueue, then ReleaseSemaphore() API on QueueAccess. This pushes the *task in a thread-safe manner. Then ReleaseSemaphore() on QueueCount - this will signal any waiting threads.
In pop(), first call 'WaitForSingleObject()' API on QueueCount - this ensures that any calling consumer thread has to wait until there is a *task in the queue. Then call 'WaitForSingleObject()' API on QueueAccess, then pop task from the SimpleQueue, then ReleaseSemaphore() API on QueueAccess and return the task - this this thread-safely dequeues the *task.
Once you have created your ProducerConsumerQueue, create some threads to run the tasks. In CreateThread(), pass the same *ProducerConsumerQueue as the 'auxiliary' *void parameter.
In the thread function, cast the *void back to *ProducerConsumerQueue and then just loop around for ever, calling pop() and then running the returned task.
OK, your pool of threads is now ready to do stuff. If you want to run 20 tasks, create them in a loop and push them onto the ProducerConsumerQueue. The threads will then run them all.
You can create as many threads as you want to in the pool, (within reason). As many threads as cores is reasonable for tasks that are CPU-intensive. If the tasks make blocking calls, you may want to create many more threads for quickest overall throughput.
A useful enhancement is to check for 'null' in the thread function loop after each task is received and, if it is null, clean up an exit the thread, so terminating it. This allows the threads to be easily terminated by queueing up nulls, making it easier to shutdown your thread pool, (should you need to), and also to control the number of threads in the pool at runtime.

Advantages of using condition variables over mutex

I was wondering what is the performance benefit of using condition variables over mutex locks in pthreads.
What I found is : "Without condition variables, the programmer would need to have threads continually polling (possibly in a critical section), to check if the condition is met. This can be very resource consuming since the thread would be continuously busy in this activity. A condition variable is a way to achieve the same goal without polling." (https://computing.llnl.gov/tutorials/pthreads)
But it also seems that mutex calls are blocking (unlike spin-locks). Hence if a thread (T1) fails to get a lock because some other thread (T2) has the lock, T1 is put to sleep by the OS, and is woken up only when T2 releases the lock and the OS gives T1 the lock. The thread T1 does not really poll to get the lock. From this description, it seems that there is no performance benefit of using condition variables. In either case, there is no polling involved. The OS anyway provides the benefit that the condition-variable paradigm can provide.
Can you please explain what actually happens.

A condition variable allows a thread to be signaled when something of interest to that thread occurs.
By itself, a mutex doesn't do this.
If you just need mutual exclusion, then condition variables don't do anything for you. However, if you need to know when something happens, then condition variables can help.
For example, if you have a queue of items to work on, you'll have a mutex to ensure the queue's internals are consistent when accessed by the various producer and consumer threads. However, when the queue is empty, how will a consumer thread know when something is in there for it to work on? Without something like a condition variable it would need to poll the queue, taking and releasing the mutex on each poll (otherwise a producer thread could never put something on the queue).
Using a condition variable lets the consumer find that when the queue is empty it can just wait on the condition variable indicating that the queue has had something put into it. No polling - that thread does nothing until a producer puts something in the queue, then signals the condition that the queue has a new item.

You're looking for too much overlap in two separate but related things: a mutex and a condition variable.
A common implementation approach for a mutex is to use a flag and a queue. The flag indicates whether the mutex is held by anyone (a single-count semaphore would work too), and the queue tracks which threads are in line waiting to acquire the mutex exclusively.
A condition variable is then implemented as another queue bolted onto that mutex. Threads that got in line to wait to acquire the mutex can—usually once they have acquired it—volunteer to get out of the front of the line and get into the condition queue instead. At this point, you have two separate sets of waiters:
Those waiting to acquire the mutex exclusively
Those waiting for the condition variable to be signaled
When a thread holding the mutex exclusively signals the condition variable, for which we'll assume for now that it's a singular signal (unleashing no more than one waiting thread) and not a broadcast (unleashing all the waiting threads), the first thread in the condition variable queue gets shunted back over into the front (usually) of the mutex queue. Once the thread currently holding the mutex—usually the thread that signaled the condition variable—relinquishes the mutex, the next thread in the mutex queue can acquire it. That next thread in line will have been the one that was at the head of the condition variable queue.
There are many complicated details that come into play, but this sketch should give you a feel for the structures and operations in play.

If you are looking for performance, then start reading about "non blocking / non locking" thread synchronization algorithms. They are based upon atomic operations, which gcc is kind enough to provide. Lookup gcc atomic operations. Our tests showed we could increment a global value with multiple threads using atomic operation magnitudes faster than locking with a mutex. Here is some sample code that shows how to add items to and from a linked list from multiple threads at the same time without locking.
For sleeping and waking threads, signals are much faster than conditions. You use pthread_kill to send the signal, and sigwait to sleep the thread. We tested this too with the same kind of performance benefits. Here is some example code.

Mutithreading thread control

How do I control the number of threads that my program is working on?
I have a program that is now ready for mutithreading but one problem is that the program is extremely memory intensive and i have to limit the number of threads running so that i don't run out of ram. The main program goes through and creates a whole bunch of handles and associated threads in suspended state.
I want the program to activate a set number of threads and when one thread finishes, it will automatically unsuspended the next thread in line until all the work has been completed. How do i do this?
Someone has once mentioned something about using a thread handler, but I can't seem to find any information about how to write one or exactly how it would work.
If anyone can help, it would be greatly appreciated.
Using windows and visual c++.
Note: i don't need to worry about the traditional problems of access with the threads, each one is completely independent of each other, its more of like batch processing rather than true mutithreading of a program.
Thanks,
-Faken

Don't create threads explicitly. Create a thread pool, see Thread Pools and queue up your work using QueueUserWorkItem. The thread pool size should be determined by the number of hardware threads available (number of cores and ratio of hyperthreading) and the ratio of CPU vs. IO your work items do. By controlling the size of the thread pool you control the number of maximum concurrent threads.

A Suspended thread doesn't use CPU resources, but it still consumes memory, so you really shouldn't be creating more threads than you want to run simultaneously.
It is better to have only as many threads as your maximum number of simultaneous tasks, and to use a queue to pass units of work to the pool of worker threads.
You can give work to the standard pool of threads created by Windows using the Windows Thread Pool API.
Be aware that you will share these threads and the queue used to submit work to them with all of the code in your process. If, for some reason, you don't want to share your worker threads with other code in your process, then you can create a FIFO queue, create as many threads as you want to run simultaneously and have each of them pull work items out of the queue. If the queue is empty they will block until work items are added to the queue.

There is so much to say here.
There are a few ways
You should only create as many thread handles as you plan on running at the same time, then reuse them when they complete. (Look up thread pool).
This guarantees that you can never have too many running at the same time. This raises the question of funding out when a thread completes. You can have a callback be called just before a thread terminates where a parameter in that callback is the thread handle that just finished. Use Boost bind and boost signals for that. When the callback is called, look for another task for that thread handle and restart the thread. That way all you have to do is add to the "tasks to do" list and the callback will remove the tasks for you. No polling needed, and no worries about too many threads.

What are multi-threading DOs and DONTs? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am applying my new found knowledge of threading everywhere and getting lots of surprises
Example:
I used threads to add numbers in an
array. And outcome was different every
time. The problem was that all of my
threads were updating the same
variable and were not synchronized.
What are some known thread issues?
What care should be taken while using
threads?
What are good multithreading resources.
Please provide examples.
sidenote:(I renamed my program thread_add.java to thread_random_number_generator.java:-)

In a multithreading environment you have to take care of synchronization so two threads doesn't clobber the state by simultaneously performing modifications. Otherwise you can have race conditions in your code (for an example see the infamous Therac-25 accident.) You also have to schedule the threads to perform various tasks. You then have to make sure that your synchronization and scheduling doesn't cause a deadlock where multiple threads will wait for each other indefinitely.
Synchronization
Something as simple as increasing a counter requires synchronization:
counter += 1;
Assume this sequence of events:
counter is initialized to 0
thread A retrieves counter from memory to cpu (0)
context switch
thread B retrieves counter from memory to cpu (0)
thread B increases counter on cpu
thread B writes back counter from cpu to memory (1)
context switch
thread A increases counter on cpu
thread A writes back counter from cpu to memory (1)
At this point the counter is 1, but both threads did try to increase it. Access to the counter has to be synchronized by some kind of locking mechanism:
lock (myLock) {
counter += 1;
}
Only one thread is allowed to execute the code inside the locked block. Two threads executing this code might result in this sequence of events:
counter is initialized to 0
thread A acquires myLock
context switch
thread B tries to acquire myLock but has to wait
context switch
thread A retrieves counter from memory to cpu (0)
thread A increases counter on cpu
thread A writes back counter from cpu to memory (1)
thread A releases myLock
context switch
thread B acquires myLock
thread B retrieves counter from memory to cpu (1)
thread B increases counter on cpu
thread B writes back counter from cpu to memory (2)
thread B releases myLock
At this point counter is 2.
Scheduling
Scheduling is another form of synchronization and you have to you use thread synchronization mechanisms like events, semaphores, message passing etc. to start and stop threads. Here is a simplified example in C#:
AutoResetEvent taskEvent = new AutoResetEvent(false);
Task task;
// Called by the main thread.
public void StartTask(Task task) {
this.task = task;
// Signal the worker thread to perform the task.
this.taskEvent.Set();
// Return and let the task execute on another thread.
}
// Called by the worker thread.
void ThreadProc() {
while (true) {
// Wait for the event to become signaled.
this.taskEvent.WaitOne();
// Perform the task.
}
}
You will notice that access to this.task probably isn't synchronized correctly, that the worker thread isn't able to return results back to the main thread, and that there is no way to signal the worker thread to terminate. All this can be corrected in a more elaborate example.
Deadlock
A common example of deadlock is when you have two locks and you are not careful how you acquire them. At one point you acquire lock1 before lock2:
public void f() {
lock (lock1) {
lock (lock2) {
// Do something
}
}
}
At another point you acquire lock2 before lock1:
public void g() {
lock (lock2) {
lock (lock1) {
// Do something else
}
}
}
Let's see how this might deadlock:
thread A calls f
thread A acquires lock1
context switch
thread B calls g
thread B acquires lock2
thread B tries to acquire lock1 but has to wait
context switch
thread A tries to acquire lock2 but has to wait
context switch
At this point thread A and B are waiting for each other and are deadlocked.

There are two kinds of people that do not use multi threading.
1) Those that do not understand the concept and have no clue how to program it.
2) Those that completely understand the concept and know how difficult it is to get it right.

I'd make a very blatant statement:
DON'T use shared memory.
DO use message passing.
As a general advice, try to limit the amount of shared state and prefer more event-driven architectures.

I can't give you examples besides pointing you at Google. Search for threading basics, thread synchronisation and you'll get more hits than you know.
The basic problem with threading is that threads don't know about each other - so they will happily tread on each others toes, like 2 people trying to get through 1 door, sometimes they will pass though one after the other, but sometimes they will both try to get through at the same time and will get stuck. This is difficult to reproduce, difficult to debug, and sometimes causes problems. If you have threads and see "random" failures, this is probably the problem.
So care needs to be taken with shared resources. If you and your friend want a coffee, but there's only 1 spoon you cannot both use it at the same time, one of you will have to wait for the other. The technique used to 'synchronise' this access to the shared spoon is locking. You make sure you get a lock on the shared resource before you use it, and let go of it afterwards. If someone else has the lock, you wait until they release it.
Next problem comes with those locks, sometimes you can have a program that is complex, so much that you get a lock, do something else then access another resource and try to get a lock for that - but some other thread has that 2nd resource, so you sit and wait... but if that 2nd thread is waiting for the lock you hold for the 1st resource.. it's going to sit and wait. And your app just sits there. This is called deadlock, 2 threads both waiting for each other.
Those 2 are the vast majority of thread issues. The answer is generally to lock for as short a time as possible, and only hold 1 lock at a time.

I notice you are writing in java and that nobody else mentioned books so Java Concurrency In Practice should be your multi-threaded bible.

-- What are some known thread issues? --
Race conditions.
Deadlocks.
Livelocks.
Thread starvation.
-- What care should be taken while using threads? --
Using multi-threading on a single-processor machine to process multiple tasks where each task takes approximately the same time isn’t always very effective.For example, you might decide to spawn ten threads within your program in order to process ten separate tasks. If each task takes approximately 1 minute to process, and you use ten threads to do this processing, you won’t have access to any of the task results for the whole 10 minutes. If instead you processed the same tasks using just a single thread, you would see the first result in 1 minute, the next result 1 minute later, and so on. If you can make use of each result without having to rely on all of the results being ready simultaneously, the single
thread might be the better way of implementing the program.
If you launch a large number of threads within a process, the overhead of thread housekeeping and context switching can become significant. The processor will spend considerable time in switching between threads, and many of the threads won’t be able to make progress. In addition, a single process with a large number of threads means that threads in other processes will be scheduled less frequently and won’t receive a reasonable share of processor time.
If multiple threads have to share many of the same resources, you’re unlikely to see performance benefits from multi-threading your application. Many developers see multi-threading as some sort of magic wand that gives automatic performance benefits. Unfortunately multi-threading isn’t the magic wand that it’s sometimes perceived to be. If you’re using multi-threading for performance reasons, you should measure your application’s performance very closely in several different situations, rather than just relying on some non-existent magic.
Coordinating thread access to common data can be a big performance killer. Achieving good performance with multiple threads isn’t easy when using a coarse locking plan, because this leads to low concurrency and threads waiting for access. Alternatively, a fine-grained locking strategy increases the complexity and can also slow down performance unless you perform some sophisticated tuning.
Using multiple threads to exploit a machine with multiple processors sounds like a good idea in theory, but in practice you need to be careful. To gain any significant performance benefits, you might need to get to grips with thread balancing.
-- Please provide examples. --
For example, imagine an application that receives incoming price information from
the network, aggregates and sorts that information, and then displays the results
on the screen for the end user.
With a dual-core machine, it makes sense to split the task into, say, three threads. The first thread deals with storing the incoming price information, the second thread processes the prices, and the final thread handles the display of the results.
After implementing this solution, suppose you find that the price processing is by far the longest stage, so you decide to rewrite that thread’s code to improve its performance by a factor of three. Unfortunately, this performance benefit in a single thread may not be reflected across your whole application. This is because the other two threads may not be able to keep pace with the improved thread. If the user interface thread is unable to keep up with the faster flow of processed information, the other threads now have to wait around for the new bottleneck in the system.
And yes, this example comes directly from my own experience :-)

DONT use global variables
DONT use many locks (at best none at all - though practically impossible)
DONT try to be a hero, implementing sophisticated difficult MT protocols
DO use simple paradigms. I.e share the processing of an array to n slices of the same size - where n should be equal to the number of processors
DO test your code on different machines (using one, two, many processors)
DO use atomic operations (such as InterlockedIncrement() and the like)

YAGNI
The most important thing to remember is: do you really need multithreading?

I agree with pretty much all the answers so far.
A good coding strategy is to minimise or eliminate the amount of data that is shared between threads as much as humanly possible. You can do this by:
Using thread-static variables (although don't go overboard on this, it will eat more memory per thread, depending on your O/S).
Packaging up all state used by each thread into a class, then guaranteeing that each thread gets exactly one state class instance to itself. Think of this as "roll your own thread-static", but with more control over the process.
Marshalling data by value between threads instead of sharing the same data. Either make your data transfer classes immutable, or guarantee that all cross-thread calls are synchronous, or both.
Try not to have multiple threads competing for the exact same I/O "resource", whether it's a disk file, a database table, a web service call, or whatever. This will cause contention as multiple threads fight over the same resource.
Here's an extremely contrived OTT example. In a real app you would cap the number of threads to reduce scheduling overhead:
All UI - one thread.
Background calcs - one thread.
Logging errors to a disk file - one thread.
Calling a web service - one thread per unique physical host.
Querying the database - one thread per independent group of tables that need updating.
Rather than guessing how to do divvy up the tasks, profile your app and isolate those bits that are (a) very slow, and (b) could be done asynchronously. Those are good candidates for a separate thread.
And here's what you should avoid:
Calcs, database hits, service calls, etc - all in one thread, but spun up multiple times "to improve performance".

Don't start new threads unless you really need to. Starting threads is not cheap and for short running tasks starting the thread may actually take more time than executing the task itself. If you're on .NET take a look at the built in thread pool, which is useful in a lot of (but not all) cases. By reusing the threads the cost of starting threads is reduced.
EDIT: A few notes on creating threads vs. using thread pool (.NET specific)
Generally try to use the thread pool. Exceptions:
Long running CPU bound tasks and blocking tasks are not ideal run on the thread pool cause they will force the pool to create additional threads.
All thread pool threads are background threads, so if you need your thread to be foreground, you have to start it yourself.
If you need a thread with different priority.
If your thread needs more (or less) than the standard 1 MB stack space.
If you need to be able to control the life time of the thread.
If you need different behavior for creating threads than that offered by the thread pool (e.g. the pool will throttle creating of new threads, which may or may not be what you want).
There are probably more exceptions and I am not claiming that this is the definitive answer. It is just what I could think of atm.

I am applying my new found knowledge of threading everywhere
[Emphasis added]
DO remember that a little knowledge is dangerous. Knowing the threading API of your platform is the easy bit. Knowing why and when you need to use synchronisation is the hard part. Reading up on "deadlocks", "race-conditions", "priority inversion" will start you in understanding why.
The details of when to use synchronisation are both simple (shared data needs synchronisation) and complex (atomic data types used in the right way don't need synchronisation, which data is really shared): a lifetime of learning and very solution specific.

An important thing to take care of (with multiple cores and CPUs) is cache coherency.

I am surprised that no one has pointed out Herb Sutter's Effective Concurrency columns yet. In my opinion, this is a must read if you want to go anywhere near threads.

a) Always make only 1 thread responsible for a resource's lifetime. That way thread A won't delete a resource thread B needs - if B has ownership of the resource
b) Expect the unexpected

DO think about how you will test your code and set aside plenty of time for this. Unit tests become more complicated. You may not be able to manually test your code - at least not reliably.
DO think about thread lifetime and how threads will exit. Don't kill threads. Provide a mechanism so that they exit gracefully.
DO add some kind of debug logging to your code - so that you can see that your threads are behaving correctly both in development and in production when things break down.
DO use a good library for handling threading rather than rolling your own solution (if you can). E.g. java.util.concurrency
DON'T assume a shared resource is thread safe.
DON'T DO IT. E.g. use an application container that can take care of threading issues for you. Use messaging.

In .Net one thing that surprised me when I started trying to get into multi-threading is that you cannot straightforwardly update the UI controls from any thread other than the thread that the UI controls were created on.
There is a way around this, which is to use the Control.Invoke method to update the control on the other thread, but it is not 100% obvious the first time around!

Don't be fooled into thinking you understand the difficulties of concurrency until you've split your head into a real project.
All the examples of deadlocks, livelocks, synchronization, etc, seem simple, and they are. But they will mislead you, because the "difficulty" in implementing concurrency that everyone is talking about is when it is used in a real project, where you don't control everything.

While your initial differences in sums of numbers are, as several respondents have pointed out, likely to be the result of lack of synchronisation, if you get deeper into the topic, be aware that, in general, you will not be able to reproduce exactly the numeric results you get on a serial program with those from a parallel version of the same program. Floating-point arithmetic is not strictly commutative, associative, or distributive; heck, it's not even closed.
And I'd beg to differ with what, I think, is the majority opinion here. If you are writing multi-threaded programs for a desktop with one or more multi-core CPUs, then you are working on a shared-memory computer and should tackle shared-memory programming. Java has all the features to do this.
Without knowing a lot more about the type of problem you are tackling, I'd hesitate to write that 'you should do this' or 'you should not do that'.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string