(When) are parallel sorts practical and how do you write an efficient one? - multithreading

I'm working on a parallelization library for the D programming language. Now that I'm pretty happy with the basic primitives (parallel foreach, map, reduce and tasks/futures), I'm starting to think about some higher level parallel algorithms. Among the more obvious candidates for parallelization is sorting.
My first question is, are parallelized versions of sorting algorithms useful in the real world, or are they mostly academic? If they are useful, where are they useful? I personally would seldom use them in my work, simply because I usually peg all of my cores at 100% using a much coarser grained level of parallelism than a single sort() call.
Secondly, it seems like quick sort is almost embarrassingly parallel for large arrays, yet I can't get the near-linear speedups I believe I should be getting. For a quick sort, the only inherently serial part is the first partition. I tried parallelizing a quick sort by, after each partition, sorting the two subarrays in parallel. In simplified pseudocode:
// I tweaked this number a bunch. Anything smaller than this and the
// overhead is smaller than the parallelization gains.
const smallestToParallelize = 500;
void quickSort(T)(T[] array) {
if(array.length < someConstant) {
insertionSort(array);
return;
}
size_t pivotPosition = partition(array);
if(array.length >= smallestToParallelize) {
// Sort left subarray in a task pool thread.
auto myTask = taskPool.execute(quickSort(array[0..pivotPosition]));
quickSort(array[pivotPosition + 1..$]);
myTask.workWait();
} else {
// Regular serial quick sort.
quickSort(array[0..pivotPosition]);
quickSort(array[pivotPosition + 1..$]);
}
}
Even for very large arrays, where the time the first partition takes is negligible, I can only get about a 30% speedup on a dual core, compared to a purely serial version of the algorithm. I'm guessing the bottleneck is shared memory access. Any insight on how to eliminate this bottleneck or what else the bottleneck might be?
Edit: My task pool has a fixed number of threads, equal to the number of cores in the system minus 1 (since the main thread also does work). Also, the type of wait I'm using is a work wait, i.e. if the task is started but not finished, the thread calling workWait() steals other jobs off the pool and does them until the one it's waiting on is done. If the task isn't started, it is completed in the current thread. This means that the waiting isn't inefficient. As long as there is work to be done, all threads will be kept busy.

Keep in mind I'm not an expert on parallel sort, and folks make research careers out of parallel sort but...
1) are they useful in the real world.
of course they are, if you need to sort something expensive (like strings or worse) and you aren't pegging all the cores.
think UI code where you need to sort a large dynamic list of strings based on context
think something like a barnes-hut n-bodies sim where you need to sort the particles
2) Quicksort seems like it would give a linear speedup, but it isn't. The partition step is a sequential bottleneck, you will see this if you profile and it will tend to cap out at 2-3x on a quad core.
If you want to get good speedups on a smaller system you need to ensure that your per task overheads are really small and ideally you will want to ensure that you don't have too many threads running, i.e. not much more than 2 on a dual core. A thread pool probably isn't the right abstraction.
If you want to get good speedups on a larger system you'll need to look at the scan based parallel sorts, there are papers on this. bitonic sort is also quite easy parallelize as is merge sort. A parallel radix sort can also be useful, there is one in the PPL (if you aren't averse to Visual Studio 11).

I'm no expert but... here is what I'd look at:
First of all, I've heard that as a rule of thumb, algorithms that look at small bits of a problem from the start tends to work better as parallel algorithms.
Looking at your implementation, try making the parallel/serial switch go the other way: partition the array and sort in parallel until you have N segments, then go serial. If you are more or less grabbing a new thread for each parallel case, then N should be ~ your core count. OTOH if your thread pool is of fixed size and acts as a queue of short lived delegates, then I'd use N ~ 2+ times your core count (so that cores don't sit idle because one partition finished faster).
Other tweaks:
skip the myTask.wait(); at the local level and rather have a wrapper function that waits on all the tasks.
Make a separate serial implementation of the function that avoids the depth check.

"My first question is, are parallelized versions of sorting algorithms useful in the real world" - depends on the size of the data set that you are working on in the real work. For small sets of data the answer is no. For larger data sets it depends not only on the size of the data set but also the specific architecture of the system.
One of the limiting factors that will prevent the expected increase in performance is the cache layout of the system. If the data can fit in the L1 cache of a core, then there is little to gain by sorting across multiple cores as you incur the penalty of the L1 cache miss between each iteration of the sorting algorithm.
The same reasoning applies to chips that have multiple L2 caches and NUMA (non-uniform memory access) architectures. So the more cores that you want to distribute the sorting across, the smallestToParallelize constant will need to be increased accordingly.
Another limiting factor which you identified is shared memory access, or contention over the memory bus. Since the memory bus can only satisfy a certain number of memory accesses per second; having additional cores that do essentially nothing but read and write to main memory will put a lot of stress on the memory system.
The last factor that I should point out is the thread pool itself as it may not be as efficient as you think. Because you have threads that steal and generate work from a shared queue, that queue requires synchronization methods; and depending on how those are implemented, they can cause very long serial sections in your code.

I don't know if answers here are applicable any longer or if my suggestions are applicable to D.
Anyway ...
Assuming that D allows it, there is always the possibility of providing prefetch hints to the caches. The core in question requests that data it will soon (not immediately) need be loaded into a certain cache level. In the ideal case the data will have been fetched by the time the core starts working on it. More likely the prefetch process will be more or less on the way which at least will result in less wait states than if the data were fetched "cold."
You'll still be constrained by the overall cache-to-RAM throughput capacity so you'll need to have organized the data such that so much data is in the core's exclusive caches that it can spend a fair amount of time there before having to write updated data.
The code and data need to be organized according to the concept of cache lines (fetch units of 64 bytes each) which is the smallest-sized unit in a cache. This should result in that for two cores the work needs to be organized such that the memory system works half as much per core (assuming 100% scalability) as before when only one core was working and the work hadn't been organized. For four cores a quarter as much and so on. It's quite a challenge but by no means impossible, it just depends on how imaginative you are in restructuring the work. As always, there are solutions that cannot be conceived ... until someone does just that!
I don't know how WYSIWYG D is compared to C - which I use - but in general I think the process of developing scaleable applications is ameliorated by how much the developer can influence the compiler in its actual machine code generation. For interpreted languages there will be so much memory work going on by the interpreter that you risk not being able to discern improvements from the general "background noise."
I once wrote a multi-threaded shellsort which ran 70% faster on two cores compared to one and 100% on three cores compared to one. Four cores ran slower than three. So I know the dilemmas you face.

I would like to point you to External Sorting[1] which faces similar problems. Usually, this class of algorithms is used mostly to cope with large volumes of data, but their main point is that they split up large chunks into smaller and unrelated problems, which are therefore really great to run in parallel. You "only" need to stitch together the partial results afterwards, which is not quite as parallel (but relatively cheap compared to the actual sorting).
An External Merge Sort would also work really well with an unknown amount of threads. You just split the work-load arbitrarily, and give each chunk of n elements to a thread whenever there is one idle, until all your work units are done, at which point you can start joining them up.
[1] http://en.wikipedia.org/wiki/External_sorting

Related

Multiprocessing: why doesn't a single thread just use more cpu?

I'm learning about multiprocessing and it seems to be applicable in one of two scenarios:
our program is waitng for some I/O, so it makes sense to go do something else while waiting;
we break our program up so that individual parts of it can run "in parellel", in an attempt to take full advantage of the cpu
My confusion is about the second case. I'm probably just lacking in my understanding of how cpus really work: but if our single thread process is only using 1% of the cpu and it therefore makes sense to get more threads going, then why wouldn't we just (somehow?) speed up that single process so that it uses more cpu and finishes faster?
but if our single thread process is only using 1% of the cpu and it therefore makes sense to get more threads going, then why wouldn't we just (somehow?) speed up that single process so that it uses more cpu and finishes faster?
We don't know how to. There seem to be fundamental limitations to how fast we can do things that we haven't quite figured out how to get around. So instead, we do more than one thing at a time.
It takes a woman 9 months to make a baby. So if you want lots of babies, you get lots of women. You don't try to get one woman to go faster.
Say you want to raise 7 to the twenty-millionth power and also raise 11 to the twenty-millionth power. Each of these two operations can be reduced in the number of steps, but you will reach a limit. Say each operation takes N sequential steps (each requiring the output from the previous step as its input) and the fastest we can do a single step is Q nanoseconds. With one thread, it will take at least 2NQ nanoseconds to perform all the operations. With two threads, can do one step from each of the two operations at the same time, reducing the time minimum to N*Q nanoseconds.
That's a big win.
I might be wrong, but when we split things into threads, we want to make use of multi-core architecture of our CPUs.
We mostly think CPUs being a single unit, but you must've heard about how i5 is a quad-core processor, meaning it has 4 cores-- or 4 cores make a CPU, i3 is a dual core processor-- i.e, it only has two cores.
So the aggregate CPU utilization for quad-core would be 100% split into 4x25. There's a difference b/w concurrency and parallelism. Parallel means each thread runs on a separate core, making full use of it. Now you have 4 people doing one job-- or a better analogy would be there are 4 printers in the office, and 4 people can go ahead and get the copies that they want. This is parallelism.
Using that same analogy let's extend it to just one copier/printer and 4 people want to make copies, what we do is make use concurrency, we print each requested copy but only 25% of it, then we switch to the next person, then the next and then the next, this will take 4 iterations for all the copies to get printed. Even though we utilized 100% of the copier's capability, still our guys had to wait-- this waiting time also depends on what was the length of the document they wanted to print-- so we use something like pre-emption, you can only execute/print for a certain amount of time, before we start printing for the next guy.
Speeding up a single process-- allocating it 100% of the CPU is not a problem [although we want to run bunch of other stuff like GUI, play music, system services etc, but 85% is doable], the execution time becomes 1/4th when it's distributed b/w the CPUs. Imagine you have to print a book, with 4 copiers, book is 400pages long-- you use 4 copiers to print 100pages each. Will be faster right?
I hope I made some sense, Going to sleep.

Multithreading vs Shared Memory

I have a problem which is essentially a series of searches for multiple copies of items (needles) in a massive but in memory database (10s of Gb) - the haystack.
This is divided into tasks where each task is to find each of a series of needles in the haystack
and each task is logically independent from the other tasks.
(This is already distributed across multiple machines where each machine
has its own copy of the haystack.)
There are many ways this could be parallelized on individual machines.
We could have one search process per CPU core sharing memory.
Or we could have one search process with multiple threads (one per core). Or even several multi-threaded processes.
3 possible architectures:
A process loads the haystack into Posix shared memory.
Subsequent processes use the shared memory segment instead (like a cache)
A process loads the haystack into memory and then forks.
Each process uses the same memory because of copy on write semantics.
A process loads the haystack into memory and spawns multiple search threads
The question is one method likely to be better and why? or rather what are the trade offs.
(For argument's sake assume performance trumps implementation complexity).
Implementing two or three and measuring is possible of course but hard work.
Are there any reasons why one might be definitively better?
Data in the haystack is immutable.
The processes are running on Linux. So processes are not significantly more expensive than threads.
The haystack spans many GBs so CPU caches are not likely to help.
The search process is essentially a binary search (actually equal_range with a touch of interpolation).
Because the tasks are logically independent there is no benefit from inter-thread communication being
cheaper than inter-process communication (as per for example https://stackoverflow.com/a/18114475/1569204).
I cannot think of any obvious performance trade-offs between threads and shared memory here. Are there any? Perhaps the code maintenance trade-offs are more relevant?
Background research
The only relevant SO answer I could find refers to the overhead of synchronising threads - Linux: Processes and Threads in a Multi-core CPU - which is true but less applicable here.
Related and interesting but different questions are:
Multithreading: What is the point of more threads than cores?
Performance difference between IPC shared memory and threads memory
performance - multithreaded or multiprocess applications
An interesting presentation is https://elinux.org/images/1/1c/Ben-Yossef-GoodBadUgly.pdf
It suggests there can be a small difference in the speed of thread vs process context switches.
I am assuming that except for a monitoring threads/process the others are never switched out.
General advise: Be able to measure improvements! Without that, you may tweak all you like based on advise off the internet but still don't get optimal performance. Effectively, I'm telling you not to trust me or anyone else (including yourself) but to measure. Also prepare yourself for measuring this in real time on production systems. A benchmark may help you to some extent, but real load patterns are still a different beast.
Then, you say the operations are purely in-memory, so the speed doesn't depend on (network or storage) IO performance. The two bottlenecks you face are CPU and RAM bandwidth. So, in order to work on the right part, find out which is the limiting factor. Making sure that the according part is efficient ensures optimal performance for your searches.
Further, you say that you do binary searches. This basically means you do log(n) comparisons, where each comparison requires a load of a certain element from the haystack. This load probably goes through all caches, because the size of the data makes cache hits very unlikely. However, you could hold multiple needles to search for in cache at the same time. If you then manage to trigger the cache loads for the needles first and then perform the comparison, you could reduce the time where either CPU or RAM are idle because they wait for new operations to perform. This is obviously (like others) a parameter you need to tweak for the system it runs on.
Even further, reconsider binary searching. Binary searching performs reliably with a good upper bound on random data. If you have any patterns (i.e. anything non-random) in your data, try to exploit this knowledge. If you can roughly estimate the location of the needle you're searching for, you may thus reduce the number of lookups. This is basically moving the work from the RAM bus to the CPU, so it again depends which is the actual bottleneck. Note that you can also switch algorithms, e.g. going from an educated guess to a binary search when you have less than a certain amount of elements left to consider.
Lastly, you say that every node has a full copy of your database. If each of the N nodes is assigned one Nth of the database, it could improve caching. You'd then make one first step at locating the element to determine the node and then dispatch the search to the responsible node. If in doubt, every node can still process the search as a fallback.
The modern approach is to use threads and a single process.
Whether that is better than using multiple processes and a shared memory segment might depend somewhat on your personal preference and how easy threads are to use in the language you are using, but I would say that if decent thread support is available (e.g. Java) you are pretty much always better off using it.
The main advantage of using multiple processes as far as I can see is that it is impossible to run into the kind of issues you can get when managing multiple threads (e.g., forgetting to synchronise access to shared writable resources - except for the shared memory pool). However, thread-safety by not having threads at all is not much of an argument in favour.
It might also be slightly easier to add processes than add threads. You would have to write some code to change the number of processing threads online (or use a framework or application server).
But overall, the multiple-process approach is dead. I haven't used shared memory in decades. Threads have won the day and it is worth the investment to learn to use them.
If you do need to have multi-threaded access to common writable memory then languages like Java give you all sorts of classes for doing that (as well as language primitives). At some point you are going to find you want that and then with the multi-process approach you are faced with synchronising using semaphores and writing your own classes or maybe looking for a third party library, but the Java people will be miles ahead by then.
You also mentioned forking and relying on copy-on-write. This seems like a very fragile solution dependent on particular behaviour of the system and I would not myself use it.

How is processor speed distributed across threads?

Objective:
I am trying to estimate how fast my code will execute when run concurrently in multiple threads.
Question 1)
If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?
Question 2)
What impact, if any, does the presence of other threads effect the execution speed of each other thread?
My Situation:
I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time. Takes Half a second on 1 thread and I was worried how this will scale with multiple users performing the same query. Every user requests is handled by a separate thread so 100 simultaneous users will require 100 simultaneous threads. Each thread is sharing the same resource but read only. No writing. Is there any chance I could get each user to see roughly the same execution time?
Note: I know it will depend upon a number of factors but surely there must be some way of identifying whether or not your code will scale if you find it takes x amount of time for a single thread given x hardware. As final note I'd like to add I have limited experience with computer hardware architecture and how multi-threading works under the hood.
These are all interesting questions, but there is, unfortunately, no straightforward answer, because the answer will depend on a lot of different factors.
Most modern machines are multi-core: in an ideal situation, a four-thread process has the ability to scale up almost linearly in a four-core machine (i.e. run four times as fast).
Most programs, though, spend most of their time waiting for things: disk or database access, the memory bus, network I/O, user input, and other resources. Faster machines don't generally make these things appreciably faster.
The way that most modern operating systems, including Windows, Unix/Linux, and MacOS, use the processor is by scheduling processor time to processes and threads in a more-or-less round-robin manner: at any given time there may be threads that are waiting for processor time (this is a bit simplistic, as they all have some notions of process prioritization, so that high-criticality processes get pushed up the queue earlier than less important ones).
When a thread is using a processor core, it gets it all for as long as its time slice lasts: indeed, only one thing at a time is actually running on a single core. When the process uses up its time slice, or requests some resource that isn't immediately available, it its turn at the processor core is ended, and the next scheduled task will begin. This tends to make pretty optimal use of the processor resources.
So what are the factors that determine how well a process will scale up?
What portion of its run time does a single process spend waiting for
I/O and user input?
Do multiple threads hit the same resources, or different ones?
How much communication has to happen between threads? Between individual threads and your processes main thread? This takes synchronization, and introduces waiting.
How "tight" are the hotspots of the active thread? Can the body of it fit into the processor's memory, or does the (much slower) bus memory have to be accessed?
As a general rule, the more independent individual threads are of one another, the more linearly your application will scale. In real-world business applications, though, that is far from the case. The best way to increase the scaling ability of your process is to understand it--and its dependencies--well, and then use a profiler to find out where the most waiting occurs, and see if you can devise technical strategies to obviate them.
If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?
No, you should determine it empirically.
What impact, if any, does the presence of other threads effect the execution speed of each other thread?
Computation-bound tasks will likely scale very well and be mostly independent of other threads. Interestingly enough, some CPU manufacturers implement features which can increase the clock of a lone-busy CPU core to compensate for the all the idle cores. This sort of feature might confound your measurements and expectations about scaling.
Cache/Memory/disk-bound tasks will start to contend with each other except for where resource partitions exist.
I know it will depend upon a number of factors
Absolutely! So I recommend that you prototype it and measure it. And then find out why it didn't scale as well as you'd hoped and try a different algorithm. Iterate.
but surely there must be some way of identifying whether or not your code will scale
Yes, but unfortunately it requires a detailed description of the algorithm implemented by the code. Your results will be heavily dependent on the ratio of your code's activity among these general regions, and your target's capability for these:
disk I/O
network I/O
memory I/O
computation
My Situation: My application runs in an app server that assigns one thread for every user request. If my application executes in 2 seconds for 1 user I can't assume it will be always take 2 seconds if say 100 users are simultaneously running the same operation correct?
If your app server computes pi to 100 digits for each user request, it will likely scale reasonably well until you encounter the core limit of your target.
If your app server does database queries for each user request, it will likely scale only as well as the target hardware can sustain the necessary load.
EDIT given specifics:
I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time.
Your problem sounds memory+cache-bound. You should study the details of your target CPU/mem deployment or if you are designing it, opt for high memory throughput.
A NUMA system ("resource partitioning" for memory) can likely maximize your overall concurrent memory throughput. Note that since your problem seems to dictate concurrent access to the same memory pages, a NUMA system would penalize the process doing remote memory accesses. In this case, consider creating multiple copies of the data at initialization time.
Depending on the pattern of traversal, TLB pressure might be a factor. Consider experimenting with huge (aka "large") pages.
Cache contention may be a factor in scaling as well.
Your specific algorithm could easily end up dominating over any of the specific system effects, depending on how far apart the best and worst cases are.
limited experience with computer hardware architecture and how multi-threading works under the hood.
Profile the query using CPU performance counters with a tool like Intel's VTune, perf, or oprofile. It can tell you where expensive operations are executing in your code. With this information you can optimize your query to perform well (individually and in aggregate).

Optimal Number of threads to use

Ok so, I'm solving an very paraell problem.
- generating primes (it's not quiet embarrassingly parallel, since they are written (and read from for checking if they are a factor) from a common source.
for interest: http://pastebin.com/sQQLpMgB
In any case, the thing that inspired me to write this (in part) was realisation of my access to this
dual Xeon E5520 CPUs (with IIRC 16GB ram to go with it)
So I know that each CPU supports 8 active threads.
But then there are background processes (and likely other users) using up some of those (in fact probably more that all of those).
So what is a good rule of thumb as to how many threads make things go faster, before they are being held back by their over head. (I guess this rule would need to take into acount how many threads can be active at once)
There is no such rule. It will depend on many factors, particularly on whether your app is I/O bound (it sounds like yours isn't). The thing to do is to parameterise the number of threads so that it can be specified from a config file or from the command line, and then play around with this number until you hit a sweet spot for your particular problem and configuration.
If the operation is mostly CPU bound (not waiting for I/O operations) then a good first guess is 1-to-1 with the number of logical CPU cores. Considering that generating prime numbers is mostly CPU bound and that you will have 16 logical cores at your disposal then I would start with 16 threads. Do a few tests and see what happens. I expect the performance to peak around 16 threads, but that really depends on how much I/O is occurring to store the primes that have been generated.

Programming for Multi core Processors

As far as I know, the multi-core architecture in a processor does not effect the program. The actual instruction execution is handled in a lower layer.
my question is,
Given that you have a multicore environment, Can I use any programming practices to utilize the available resources more effectively? How should I change my code to gain more performance in multicore environments?
That is correct. Your program will not run any faster (except for the fact that the core is handling fewer other processes, because some of the processes are being run on the other core) unless you employ concurrency. If you do use concurrency, though, more cores improves the actual parallelism (with fewer cores, the concurrency is interleaved, whereas with more cores, you can get true parallelism between threads).
Making programs efficiently concurrent is no simple task. If done poorly, making your program concurrent can actually make it slower! For example, if you spend lots of time spawning threads (thread construction is really slow), and do work on a very small chunk size (so that the overhead of thread construction dominates the actual work), or if you frequently synchronize your data (which not only forces operations to run serially, but also has a very high overhead on top of it), or if you frequently write to data in the same cache line between multiple threads (which can lead to the entire cache line being invalidated on one of the cores), then you can seriously harm the performance with concurrent programming.
It is also important to note that if you have N cores, that DOES NOT mean that you will get a speedup of N. That is the theoretical limit to the speedup. In fact, maybe with two cores it is twice as fast, but with four cores it might be about three times as fast, and then with eight cores it is about three and a half times as fast, etc. How well your program is actually able to take advantage of these cores is called the parallel scalability. Often communication and synchronization overhead prevent a linear speedup, although, in the ideal, if you can avoid communication and synchronization as much as possible, you can hopefully get close to linear.
It would not be possible to give a complete answer on how to write efficient parallel programs on StackOverflow. This is really the subject of at least one (probably several) computer science courses. I suggest that you sign up for such a course or buy a book. I'd recommend a book to you if I knew of a good one, but the paralell algorithms course I took did not have a textbook for the course. You might also be interested in writing a handful of programs using a serial implementation, a parallel implementation with multithreading (regular threads, thread pools, etc.), and a parallel implementation with message passing (such as with Hadoop, Apache Spark, Cloud Dataflows, asynchronous RPCs, etc.), and then measuring their performance, varying the number of cores in the case of the parallel implementations. This was the bulk of the course work for my parallel algorithms course and can be quite insightful. Some computations you might try parallelizing include computing Pi using the Monte Carlo method (this is trivially parallelizable, assuming you can create a random number generator where the random numbers generated in different threads are independent), performing matrix multiplication, computing the row echelon form of a matrix, summing the square of the number 1...N for some very large number of N, and I'm sure you can think of others.
I don't know if it's the best possible place to start, but I've subscribed to the article feed from Intel Software Network some time ago and have found a lot of interesting thing there, presented in pretty simple way. You can find some very basic articles on fundamental concepts of parallel computing, like this. Here you have a quick dive into openMP that is one possible approach to start parallelizing the slowest parts of your application, without changing the rest. (If those parts present parallelism, of course.) Also check Intel Guide for Developing Multithreaded Applications. Or just go and browse the article section, the articles are not too many, so you can quickly figure out what suits you best. They also have a forum and a weekly webcast called Parallel Programming Talk.
Yes, simply adding more cores to a system without altering the software would yield you no results (with exception of the operating system would be able to schedule multiple concurrent processes on separate cores).
To have your operating system utilise your multiple cores, you need to do one of two things: increase the thread count per process, or increase the number of processes running at the same time (or both!).
Utilising the cores effectively, however, is a beast of a different colour. If you spend too much time synchronising shared data access between threads/processes, your level of concurrency will take a hit as threads wait on each other. This also assumes that you have a problem/computation that can relatively easily be parallelised, since the parallel version of an algorithm is often much more complex than the sequential version thereof.
That said, especially for CPU-bound computations with work units that are independent of each other, you'll most likely see a linear speed-up as you throw more threads at the problem. As you add serial segments and synchronisation blocks, this speed-up will tend to decrease.
I/O heavy computations would typically fare the worst in a multi-threaded environment, since access to the physical storage (especially if it's on the same controller, or the same media) is also serial, in which case threading becomes more useful in the sense that it frees up your other threads to continue with user interaction or CPU-based operations.
You might consider using programming languages designed for concurrent programming. Erlang and Go come to mind.

Resources