Multithreading vs Shared Memory - multithreading

I have a problem which is essentially a series of searches for multiple copies of items (needles) in a massive but in memory database (10s of Gb) - the haystack.
This is divided into tasks where each task is to find each of a series of needles in the haystack
and each task is logically independent from the other tasks.
(This is already distributed across multiple machines where each machine
has its own copy of the haystack.)
There are many ways this could be parallelized on individual machines.
We could have one search process per CPU core sharing memory.
Or we could have one search process with multiple threads (one per core). Or even several multi-threaded processes.
3 possible architectures:
A process loads the haystack into Posix shared memory.
Subsequent processes use the shared memory segment instead (like a cache)
A process loads the haystack into memory and then forks.
Each process uses the same memory because of copy on write semantics.
A process loads the haystack into memory and spawns multiple search threads
The question is one method likely to be better and why? or rather what are the trade offs.
(For argument's sake assume performance trumps implementation complexity).
Implementing two or three and measuring is possible of course but hard work.
Are there any reasons why one might be definitively better?
Data in the haystack is immutable.
The processes are running on Linux. So processes are not significantly more expensive than threads.
The haystack spans many GBs so CPU caches are not likely to help.
The search process is essentially a binary search (actually equal_range with a touch of interpolation).
Because the tasks are logically independent there is no benefit from inter-thread communication being
cheaper than inter-process communication (as per for example https://stackoverflow.com/a/18114475/1569204).
I cannot think of any obvious performance trade-offs between threads and shared memory here. Are there any? Perhaps the code maintenance trade-offs are more relevant?
Background research
The only relevant SO answer I could find refers to the overhead of synchronising threads - Linux: Processes and Threads in a Multi-core CPU - which is true but less applicable here.
Related and interesting but different questions are:
Multithreading: What is the point of more threads than cores?
Performance difference between IPC shared memory and threads memory
performance - multithreaded or multiprocess applications
An interesting presentation is https://elinux.org/images/1/1c/Ben-Yossef-GoodBadUgly.pdf
It suggests there can be a small difference in the speed of thread vs process context switches.
I am assuming that except for a monitoring threads/process the others are never switched out.

General advise: Be able to measure improvements! Without that, you may tweak all you like based on advise off the internet but still don't get optimal performance. Effectively, I'm telling you not to trust me or anyone else (including yourself) but to measure. Also prepare yourself for measuring this in real time on production systems. A benchmark may help you to some extent, but real load patterns are still a different beast.
Then, you say the operations are purely in-memory, so the speed doesn't depend on (network or storage) IO performance. The two bottlenecks you face are CPU and RAM bandwidth. So, in order to work on the right part, find out which is the limiting factor. Making sure that the according part is efficient ensures optimal performance for your searches.
Further, you say that you do binary searches. This basically means you do log(n) comparisons, where each comparison requires a load of a certain element from the haystack. This load probably goes through all caches, because the size of the data makes cache hits very unlikely. However, you could hold multiple needles to search for in cache at the same time. If you then manage to trigger the cache loads for the needles first and then perform the comparison, you could reduce the time where either CPU or RAM are idle because they wait for new operations to perform. This is obviously (like others) a parameter you need to tweak for the system it runs on.
Even further, reconsider binary searching. Binary searching performs reliably with a good upper bound on random data. If you have any patterns (i.e. anything non-random) in your data, try to exploit this knowledge. If you can roughly estimate the location of the needle you're searching for, you may thus reduce the number of lookups. This is basically moving the work from the RAM bus to the CPU, so it again depends which is the actual bottleneck. Note that you can also switch algorithms, e.g. going from an educated guess to a binary search when you have less than a certain amount of elements left to consider.
Lastly, you say that every node has a full copy of your database. If each of the N nodes is assigned one Nth of the database, it could improve caching. You'd then make one first step at locating the element to determine the node and then dispatch the search to the responsible node. If in doubt, every node can still process the search as a fallback.

The modern approach is to use threads and a single process.
Whether that is better than using multiple processes and a shared memory segment might depend somewhat on your personal preference and how easy threads are to use in the language you are using, but I would say that if decent thread support is available (e.g. Java) you are pretty much always better off using it.
The main advantage of using multiple processes as far as I can see is that it is impossible to run into the kind of issues you can get when managing multiple threads (e.g., forgetting to synchronise access to shared writable resources - except for the shared memory pool). However, thread-safety by not having threads at all is not much of an argument in favour.
It might also be slightly easier to add processes than add threads. You would have to write some code to change the number of processing threads online (or use a framework or application server).
But overall, the multiple-process approach is dead. I haven't used shared memory in decades. Threads have won the day and it is worth the investment to learn to use them.
If you do need to have multi-threaded access to common writable memory then languages like Java give you all sorts of classes for doing that (as well as language primitives). At some point you are going to find you want that and then with the multi-process approach you are faced with synchronising using semaphores and writing your own classes or maybe looking for a third party library, but the Java people will be miles ahead by then.
You also mentioned forking and relying on copy-on-write. This seems like a very fragile solution dependent on particular behaviour of the system and I would not myself use it.

Related

Flaws in Shared Memory of Massively Multi-Threaded Designs

I am trying to create my first application of multi-threading, one that is scalable to multi-core technology. Its inspiration comes from the concept of a event-driven spiking neural network.
The design is a little like this: The data structure of the algorithm is stored in 1 location in memory, in the form of instances of classes. An example of a task that can be performed on this structure is a neuron spiking: it will modify several values in the neuron and connected neurons, and identify any future tasks that may need to be performed. The tasks to be performed are added a queue. There are several threads whose only function is to pull a task from the queue, perform the task, and lather rinse repeat. Any updates to values can be performed in any order, as long as they are performed. Small but rare errors that result from this parallelism would have a statistically insignificant effect on the performance of the system.
This design does not use any memory other than shared memory (except for possibly a small amount of dedicated memory used for calculations). I've recently watched a few lectures where the speaker implied that the use of shared memory in multi-core and GPU applications was very slow. Even though I have a few ideas as to why that might be the case, I'd like to find out from people who have experience with this sort of thing, and maybe be directed to a useful resource to help me out.
Accessing shared state from multiple threads in multicore system can be slow due to CPU cache coherency protocol. That is every change in the shared state must be reflected in the cache lines of all the cores.
http://msdn.microsoft.com/en-us/magazine/cc163715.aspx#S2 provides good explanation why accessing shared data from multiple threads can be slow and what can be done about it.

Concurrency: Processes vs Threads

What are the main advantages of using a model for concurrency based on processes over one
based on threads and in what contexts is the latter appropriate?
Fault-tolerance and scalability are the main advantages of using Processes vs. Threads.
A system that relies on shared memory or some other kind of technology that is only available when using threads, will be useless when you want to run the system on multiple machines. Sooner or later you will need to communicate between different processes.
When using processes you are forced to deal with communication via messages, for example, this is the way Erlang handles communication. Data is not shared, so there is no risk of data corruption.
Another advantage of processes is that they can crash and you can feel relatively safe in the knowledge that you can just restart them (even across network hosts). However, if a thread crashes, it may crash the entire process, which may bring down your entire application. To illustrate: If an Erlang process crashes, you will only lose that phone call, or that webrequest, etc. Not the whole application.
In saying all this, OS processes also have many drawbacks that can make them harder to use, like the fact that it takes forever to spawn a new process. However, Erlang has it's own notion of processes, which are extremely lightweight.
With that said, this discussion is really a topic of research. If you want to get into more of the details, you can give Joe Armstrong's paper on fault-tolerant systems]1 a read, it explains a lot about Erlang and the philosophy that drives it.
The disadvantage of using a process-based model is that it will be slower. You will have to copy data between the concurrent parts of your program.
The disadvantage of using a thread-based model is that you will probably get it wrong. It may sound mean, but it's true-- show me code based on threads and I'll show you a bug. I've found bugs in threaded code that has run "correctly" for 10 years.
The advantages of using a process-based model are numerous. The separation forces you to think in terms of protocols and formal communication patterns, which means its far more likely that you will get it right. Processes communicating with each other are easier to scale out across multiple machines. Multiple concurrent processes allows one process to crash without necessarily crashing the others.
The advantage of using a thread-based model is that it is fast.
It may be obvious which of the two I prefer, but in case it isn't: processes, every day of the week and twice on Sunday. Threads are too hard: I haven't ever met anybody who could write correct multi-threaded code; those that claim to be able to usually don't know enough about the space yet.
In this case Processes are more independent of eachother, while Threads shares some resources e.g. memory. But in a general case Threads are more light-weight than Processes.
Erlang Processes is not the same thing as OS Processes. Erlang Processes are very light-weight and Erlang can have many Erlang Processes within the same OS Thread. See Technically why is processes in Erlang more efficient than OS threads?
First and foremost, processes differ from threads mostly in the way their memory is handled:
Process = n*Thread + memory region (n>=1)
Processes have their own isolated memory.
Processes can have multiple threads.
Processes are isolated from each other on the operating system level.
Threads share their memory with their peers in the process.
(This is often undesirable. There are libraries and methods out there to remedy this, but that is usually an artificial layer over operating system threads.)
The memory thing is the most important discerning factor, as it has certain implications:
Exchanging data between processes is slower than between threads. Breaking the process isolation always requires some involvement of kernel calls and memory remapping.
Threads are more lightweight than processes. The operating system has to allocate resources and do memory management for each process.
Using processes gives you memory isolation and synchronization. Common problems with access to memory shared between threads do not concern you. Since you have to make a special effort to share data between processes, you will most likely sync automatically with that.
Using processes gives you good (or ultimate) encapsulation. Since inter process communication needs special effort, you will be forced to define a clean interface. It is a good idea to break certain parts of your application out of the main executable. Maybe you can split dependencies like that.
e.g. Process_RobotAi <-> Process_RobotControl
The AI will have vastly different dependencies compared to the control component. The interface might be simple: Process_RobotAI --DriveXY--> Process_RobotControl.
Maybe you change the robot platform. You only have to implement a new RobotControl executable with that simple interface. You don't have to touch or even recompile anything in your AI component.
It will also, for the same reasons, speed up compilation in most cases.
Edit: Just for completeness I will shamelessly add what the others have reminded me of :
A crashing process does not (necessarily) crash your whole application.
In General:
Want to create something highly concurrent or synchronuous, like an algorithm with n>>1 instances running in parallel and sharing data, use threads.
Have a system with multiple components that do not need to share data or algorithms, nor do they exchange data too often, use processes. If you use a RPC library for the inter process communication, you get a network-distributable solution at no extra cost.
1 and 2 are the extreme and no-brainer scenarios, everything in between must be decided individually.
For a good (or awesome) example of a system that uses IPC/RPC heavily, have a look at ros.

Programming for Multi core Processors

As far as I know, the multi-core architecture in a processor does not effect the program. The actual instruction execution is handled in a lower layer.
my question is,
Given that you have a multicore environment, Can I use any programming practices to utilize the available resources more effectively? How should I change my code to gain more performance in multicore environments?
That is correct. Your program will not run any faster (except for the fact that the core is handling fewer other processes, because some of the processes are being run on the other core) unless you employ concurrency. If you do use concurrency, though, more cores improves the actual parallelism (with fewer cores, the concurrency is interleaved, whereas with more cores, you can get true parallelism between threads).
Making programs efficiently concurrent is no simple task. If done poorly, making your program concurrent can actually make it slower! For example, if you spend lots of time spawning threads (thread construction is really slow), and do work on a very small chunk size (so that the overhead of thread construction dominates the actual work), or if you frequently synchronize your data (which not only forces operations to run serially, but also has a very high overhead on top of it), or if you frequently write to data in the same cache line between multiple threads (which can lead to the entire cache line being invalidated on one of the cores), then you can seriously harm the performance with concurrent programming.
It is also important to note that if you have N cores, that DOES NOT mean that you will get a speedup of N. That is the theoretical limit to the speedup. In fact, maybe with two cores it is twice as fast, but with four cores it might be about three times as fast, and then with eight cores it is about three and a half times as fast, etc. How well your program is actually able to take advantage of these cores is called the parallel scalability. Often communication and synchronization overhead prevent a linear speedup, although, in the ideal, if you can avoid communication and synchronization as much as possible, you can hopefully get close to linear.
It would not be possible to give a complete answer on how to write efficient parallel programs on StackOverflow. This is really the subject of at least one (probably several) computer science courses. I suggest that you sign up for such a course or buy a book. I'd recommend a book to you if I knew of a good one, but the paralell algorithms course I took did not have a textbook for the course. You might also be interested in writing a handful of programs using a serial implementation, a parallel implementation with multithreading (regular threads, thread pools, etc.), and a parallel implementation with message passing (such as with Hadoop, Apache Spark, Cloud Dataflows, asynchronous RPCs, etc.), and then measuring their performance, varying the number of cores in the case of the parallel implementations. This was the bulk of the course work for my parallel algorithms course and can be quite insightful. Some computations you might try parallelizing include computing Pi using the Monte Carlo method (this is trivially parallelizable, assuming you can create a random number generator where the random numbers generated in different threads are independent), performing matrix multiplication, computing the row echelon form of a matrix, summing the square of the number 1...N for some very large number of N, and I'm sure you can think of others.
I don't know if it's the best possible place to start, but I've subscribed to the article feed from Intel Software Network some time ago and have found a lot of interesting thing there, presented in pretty simple way. You can find some very basic articles on fundamental concepts of parallel computing, like this. Here you have a quick dive into openMP that is one possible approach to start parallelizing the slowest parts of your application, without changing the rest. (If those parts present parallelism, of course.) Also check Intel Guide for Developing Multithreaded Applications. Or just go and browse the article section, the articles are not too many, so you can quickly figure out what suits you best. They also have a forum and a weekly webcast called Parallel Programming Talk.
Yes, simply adding more cores to a system without altering the software would yield you no results (with exception of the operating system would be able to schedule multiple concurrent processes on separate cores).
To have your operating system utilise your multiple cores, you need to do one of two things: increase the thread count per process, or increase the number of processes running at the same time (or both!).
Utilising the cores effectively, however, is a beast of a different colour. If you spend too much time synchronising shared data access between threads/processes, your level of concurrency will take a hit as threads wait on each other. This also assumes that you have a problem/computation that can relatively easily be parallelised, since the parallel version of an algorithm is often much more complex than the sequential version thereof.
That said, especially for CPU-bound computations with work units that are independent of each other, you'll most likely see a linear speed-up as you throw more threads at the problem. As you add serial segments and synchronisation blocks, this speed-up will tend to decrease.
I/O heavy computations would typically fare the worst in a multi-threaded environment, since access to the physical storage (especially if it's on the same controller, or the same media) is also serial, in which case threading becomes more useful in the sense that it frees up your other threads to continue with user interaction or CPU-based operations.
You might consider using programming languages designed for concurrent programming. Erlang and Go come to mind.

(When) are parallel sorts practical and how do you write an efficient one?

I'm working on a parallelization library for the D programming language. Now that I'm pretty happy with the basic primitives (parallel foreach, map, reduce and tasks/futures), I'm starting to think about some higher level parallel algorithms. Among the more obvious candidates for parallelization is sorting.
My first question is, are parallelized versions of sorting algorithms useful in the real world, or are they mostly academic? If they are useful, where are they useful? I personally would seldom use them in my work, simply because I usually peg all of my cores at 100% using a much coarser grained level of parallelism than a single sort() call.
Secondly, it seems like quick sort is almost embarrassingly parallel for large arrays, yet I can't get the near-linear speedups I believe I should be getting. For a quick sort, the only inherently serial part is the first partition. I tried parallelizing a quick sort by, after each partition, sorting the two subarrays in parallel. In simplified pseudocode:
// I tweaked this number a bunch. Anything smaller than this and the
// overhead is smaller than the parallelization gains.
const smallestToParallelize = 500;
void quickSort(T)(T[] array) {
if(array.length < someConstant) {
insertionSort(array);
return;
}
size_t pivotPosition = partition(array);
if(array.length >= smallestToParallelize) {
// Sort left subarray in a task pool thread.
auto myTask = taskPool.execute(quickSort(array[0..pivotPosition]));
quickSort(array[pivotPosition + 1..$]);
myTask.workWait();
} else {
// Regular serial quick sort.
quickSort(array[0..pivotPosition]);
quickSort(array[pivotPosition + 1..$]);
}
}
Even for very large arrays, where the time the first partition takes is negligible, I can only get about a 30% speedup on a dual core, compared to a purely serial version of the algorithm. I'm guessing the bottleneck is shared memory access. Any insight on how to eliminate this bottleneck or what else the bottleneck might be?
Edit: My task pool has a fixed number of threads, equal to the number of cores in the system minus 1 (since the main thread also does work). Also, the type of wait I'm using is a work wait, i.e. if the task is started but not finished, the thread calling workWait() steals other jobs off the pool and does them until the one it's waiting on is done. If the task isn't started, it is completed in the current thread. This means that the waiting isn't inefficient. As long as there is work to be done, all threads will be kept busy.
Keep in mind I'm not an expert on parallel sort, and folks make research careers out of parallel sort but...
1) are they useful in the real world.
of course they are, if you need to sort something expensive (like strings or worse) and you aren't pegging all the cores.
think UI code where you need to sort a large dynamic list of strings based on context
think something like a barnes-hut n-bodies sim where you need to sort the particles
2) Quicksort seems like it would give a linear speedup, but it isn't. The partition step is a sequential bottleneck, you will see this if you profile and it will tend to cap out at 2-3x on a quad core.
If you want to get good speedups on a smaller system you need to ensure that your per task overheads are really small and ideally you will want to ensure that you don't have too many threads running, i.e. not much more than 2 on a dual core. A thread pool probably isn't the right abstraction.
If you want to get good speedups on a larger system you'll need to look at the scan based parallel sorts, there are papers on this. bitonic sort is also quite easy parallelize as is merge sort. A parallel radix sort can also be useful, there is one in the PPL (if you aren't averse to Visual Studio 11).
I'm no expert but... here is what I'd look at:
First of all, I've heard that as a rule of thumb, algorithms that look at small bits of a problem from the start tends to work better as parallel algorithms.
Looking at your implementation, try making the parallel/serial switch go the other way: partition the array and sort in parallel until you have N segments, then go serial. If you are more or less grabbing a new thread for each parallel case, then N should be ~ your core count. OTOH if your thread pool is of fixed size and acts as a queue of short lived delegates, then I'd use N ~ 2+ times your core count (so that cores don't sit idle because one partition finished faster).
Other tweaks:
skip the myTask.wait(); at the local level and rather have a wrapper function that waits on all the tasks.
Make a separate serial implementation of the function that avoids the depth check.
"My first question is, are parallelized versions of sorting algorithms useful in the real world" - depends on the size of the data set that you are working on in the real work. For small sets of data the answer is no. For larger data sets it depends not only on the size of the data set but also the specific architecture of the system.
One of the limiting factors that will prevent the expected increase in performance is the cache layout of the system. If the data can fit in the L1 cache of a core, then there is little to gain by sorting across multiple cores as you incur the penalty of the L1 cache miss between each iteration of the sorting algorithm.
The same reasoning applies to chips that have multiple L2 caches and NUMA (non-uniform memory access) architectures. So the more cores that you want to distribute the sorting across, the smallestToParallelize constant will need to be increased accordingly.
Another limiting factor which you identified is shared memory access, or contention over the memory bus. Since the memory bus can only satisfy a certain number of memory accesses per second; having additional cores that do essentially nothing but read and write to main memory will put a lot of stress on the memory system.
The last factor that I should point out is the thread pool itself as it may not be as efficient as you think. Because you have threads that steal and generate work from a shared queue, that queue requires synchronization methods; and depending on how those are implemented, they can cause very long serial sections in your code.
I don't know if answers here are applicable any longer or if my suggestions are applicable to D.
Anyway ...
Assuming that D allows it, there is always the possibility of providing prefetch hints to the caches. The core in question requests that data it will soon (not immediately) need be loaded into a certain cache level. In the ideal case the data will have been fetched by the time the core starts working on it. More likely the prefetch process will be more or less on the way which at least will result in less wait states than if the data were fetched "cold."
You'll still be constrained by the overall cache-to-RAM throughput capacity so you'll need to have organized the data such that so much data is in the core's exclusive caches that it can spend a fair amount of time there before having to write updated data.
The code and data need to be organized according to the concept of cache lines (fetch units of 64 bytes each) which is the smallest-sized unit in a cache. This should result in that for two cores the work needs to be organized such that the memory system works half as much per core (assuming 100% scalability) as before when only one core was working and the work hadn't been organized. For four cores a quarter as much and so on. It's quite a challenge but by no means impossible, it just depends on how imaginative you are in restructuring the work. As always, there are solutions that cannot be conceived ... until someone does just that!
I don't know how WYSIWYG D is compared to C - which I use - but in general I think the process of developing scaleable applications is ameliorated by how much the developer can influence the compiler in its actual machine code generation. For interpreted languages there will be so much memory work going on by the interpreter that you risk not being able to discern improvements from the general "background noise."
I once wrote a multi-threaded shellsort which ran 70% faster on two cores compared to one and 100% on three cores compared to one. Four cores ran slower than three. So I know the dilemmas you face.
I would like to point you to External Sorting[1] which faces similar problems. Usually, this class of algorithms is used mostly to cope with large volumes of data, but their main point is that they split up large chunks into smaller and unrelated problems, which are therefore really great to run in parallel. You "only" need to stitch together the partial results afterwards, which is not quite as parallel (but relatively cheap compared to the actual sorting).
An External Merge Sort would also work really well with an unknown amount of threads. You just split the work-load arbitrarily, and give each chunk of n elements to a thread whenever there is one idle, until all your work units are done, at which point you can start joining them up.
[1] http://en.wikipedia.org/wiki/External_sorting

Multithreading in .NET 4.0 and performance

I've been toying around with the Parallel library in .NET 4.0. Recently, I developed a custom ORM for some unusual read/write operations one of our large systems has to use. This allows me to decorate an object with attributes and have reflection figure out what columns it has to pull from the database, as well as what XML it has to output on writes.
Since I envision this wrapper to be reused in many projects, I'd like to squeeze as much speed out of it as possible. This library will mostly be used in .NET web applications. I'm testing the framework using a throwaway console application to poke at the classes I've created.
I've now learned a lesson of the overhead that multithreading comes with. Multithreading causes it to run slower. From reading around, it seems like it's intuitive to people who've been doing it for a long time, but it's actually counter-intuitive to me: how can running a method 30 times at the same time be slower than running it 30 times sequentially?
I don't think I'm causing problems by multiple threads having to fight over the same shared object (though I'm not good enough at it yet to tell for sure or not), so I assume the slowdown is coming from the overhead of spawning all those threads and the runtime keeping them all straight. So:
Though I'm doing it mainly as a learning exercise, is this pessimization? For trivial, non-IO tasks, is multithreading overkill? My main goal is speed, not responsiveness of the UI or anything.
Would running the same multithreading code in IIS cause it to speed up because of already-created threads in the thread pool, whereas right now I'm using a console app, which I assume would be single-threaded until I told it otherwise? I'm about to run some tests, but I figure there's some base knowledge I'm missing to know why it would be one way or the other. My console app is also running on my desktop with two cores, whereas a server for a web app would have more, so I might have to use that as a variable as well.
Thread's don't actually all run concurrently.
On a desktop machine I'm presuming you have a dual core CPU, (maybe a quad at most). This means only 2/4 threads can be running at the same time.
If you have spawned 30 threads, the OS is going to have to context switch between those 30 threads to keep them all running. Context switches are quite costly, so hence the slowdown.
As a basic suggestion, I'd aim for 1 thread per CPU if you are trying to optimise calculations. Any more than this and you're not really doing any extra work, you are just swapping threads in an out on the same CPU. Try to think of your computer as having a limited number of workers inside, you can't do more work concurrently than the number of workers you have available.
Some of the new features in the .net 4.0 parallel task library allow you to do things that account for scalability in the number of threads. For example you can create a bunch of tasks and the task parallel library will internally figure out how many CPUs you have available, and optimise the number of threads is creates/uses so as not to overload the CPUs, so you could create 30 tasks, but on a dual core machine the TP library would still only create 2 threads, and queue the . Obviously, this will scale very nicely when you get to run it on a bigger machine. Or you can use something like ThreadPool.QueueUserWorkItem(...) to queue up a bunch of tasks, and the pool will automatically manage how many threads is uses to perform those tasks.
Yes there is a lot of overhead to thread creation, but if you are using the .net thread pool, (or the parallel task library in 4.0) .net will be managing your thread creation, and you may actually find it creates less threads than the number of tasks you have created. It will internally swap your tasks around on the available threads. If you actually want to control explicit creation of actual threads you would need to use the Thread class.
[Some cpu's can do clever stuff with threads and can have multiple Threads running per CPU - see hyperthreading - but check out your task manager, I'd be very surprised if you have more than 4-8 virtual CPUs on today's desktops]
There are so many issues with this that it pays to understand what is happening under the covers. I would highly recommend the "Concurrent Programming on Windows" book by Joe Duffy and the "Java Concurrency in Practice" book. The latter talks about processor architecture at the level you need to understand it when writing multithreaded code. One issue you are going to hit that's going to hurt your code is caching, or more likely the lack of it.
As has been stated there is an overhead to scheduling and running threads, but you may find that there is a larger overhead when you share data across threads. That data may be flushed from the processor cache into main memory, and that will cause serious slow downs to your code.
This is the sort of low-level stuff that managed environments are supposed to protect us from, however, when writing highly parallel code, this is exactly the sort of issue you have to deal with.
A colleague of mine recorded a screencast about the performance issue with Parallel.For and Parallel.ForEach which may help:
http://rocksolidknowledge.com/ScreenCasts.mvc/Watch?video=ParallelLoops.wmv
You're speaking of an ORM, so I presume some amount of I/O is going on. If this is the case, the overhead of thread creation and context switching is going to be comparatively non-existent.
Most likely, you're experiencing I/O contention: it can be slower (particularly on rotational hard drives, but also on other storage devices) to read the same set of data if you read it out of order than if you read it in-order. So, if you're executing 30 database queries, it's possible they'll run faster sequentially than in parallel if they're all backed by the same I/O device and the queries aren't in cache. Running them in parallel may cause the system to have a bunch of I/O read requests almost simultaneously, which may cause the OS to read little bits of each in turn - causing your drive head to jump back and forth, wasting precious milliseconds.
But that's just a guess; it's not possible to really determine what's causing your slowdown without knowing more.
Although thread creation is "extremely expensive" when compared to say adding two numbers, it's not usually something you'll easily overdo. If your operations are extremely short (say, a millisecond or less), using a thread-pool rather than new threads will noticeably save time. Generally though, if your operations are that short, you should reconsider the granularity of parallelism anyhow; perhaps you're better off splitting the computation into bigger chunks: for instance, by having a fairly low number of worker tasks which handle entire batches of smaller work-items at a time rather than each item separately.

Resources