Flaws in Shared Memory of Massively Multi-Threaded Designs - multithreading

I am trying to create my first application of multi-threading, one that is scalable to multi-core technology. Its inspiration comes from the concept of a event-driven spiking neural network.
The design is a little like this: The data structure of the algorithm is stored in 1 location in memory, in the form of instances of classes. An example of a task that can be performed on this structure is a neuron spiking: it will modify several values in the neuron and connected neurons, and identify any future tasks that may need to be performed. The tasks to be performed are added a queue. There are several threads whose only function is to pull a task from the queue, perform the task, and lather rinse repeat. Any updates to values can be performed in any order, as long as they are performed. Small but rare errors that result from this parallelism would have a statistically insignificant effect on the performance of the system.
This design does not use any memory other than shared memory (except for possibly a small amount of dedicated memory used for calculations). I've recently watched a few lectures where the speaker implied that the use of shared memory in multi-core and GPU applications was very slow. Even though I have a few ideas as to why that might be the case, I'd like to find out from people who have experience with this sort of thing, and maybe be directed to a useful resource to help me out.

Accessing shared state from multiple threads in multicore system can be slow due to CPU cache coherency protocol. That is every change in the shared state must be reflected in the cache lines of all the cores.
http://msdn.microsoft.com/en-us/magazine/cc163715.aspx#S2 provides good explanation why accessing shared data from multiple threads can be slow and what can be done about it.


Multithreading vs Shared Memory

I have a problem which is essentially a series of searches for multiple copies of items (needles) in a massive but in memory database (10s of Gb) - the haystack.
This is divided into tasks where each task is to find each of a series of needles in the haystack
and each task is logically independent from the other tasks.
(This is already distributed across multiple machines where each machine
has its own copy of the haystack.)
There are many ways this could be parallelized on individual machines.
We could have one search process per CPU core sharing memory.
Or we could have one search process with multiple threads (one per core). Or even several multi-threaded processes.
3 possible architectures:
A process loads the haystack into Posix shared memory.
Subsequent processes use the shared memory segment instead (like a cache)
A process loads the haystack into memory and then forks.
Each process uses the same memory because of copy on write semantics.
A process loads the haystack into memory and spawns multiple search threads
The question is one method likely to be better and why? or rather what are the trade offs.
(For argument's sake assume performance trumps implementation complexity).
Implementing two or three and measuring is possible of course but hard work.
Are there any reasons why one might be definitively better?
Data in the haystack is immutable.
The processes are running on Linux. So processes are not significantly more expensive than threads.
The haystack spans many GBs so CPU caches are not likely to help.
The search process is essentially a binary search (actually equal_range with a touch of interpolation).
Because the tasks are logically independent there is no benefit from inter-thread communication being
cheaper than inter-process communication (as per for example https://stackoverflow.com/a/18114475/1569204).
I cannot think of any obvious performance trade-offs between threads and shared memory here. Are there any? Perhaps the code maintenance trade-offs are more relevant?
Background research
The only relevant SO answer I could find refers to the overhead of synchronising threads - Linux: Processes and Threads in a Multi-core CPU - which is true but less applicable here.
Related and interesting but different questions are:
Multithreading: What is the point of more threads than cores?
Performance difference between IPC shared memory and threads memory
performance - multithreaded or multiprocess applications
An interesting presentation is https://elinux.org/images/1/1c/Ben-Yossef-GoodBadUgly.pdf
It suggests there can be a small difference in the speed of thread vs process context switches.
I am assuming that except for a monitoring threads/process the others are never switched out.
General advise: Be able to measure improvements! Without that, you may tweak all you like based on advise off the internet but still don't get optimal performance. Effectively, I'm telling you not to trust me or anyone else (including yourself) but to measure. Also prepare yourself for measuring this in real time on production systems. A benchmark may help you to some extent, but real load patterns are still a different beast.
Then, you say the operations are purely in-memory, so the speed doesn't depend on (network or storage) IO performance. The two bottlenecks you face are CPU and RAM bandwidth. So, in order to work on the right part, find out which is the limiting factor. Making sure that the according part is efficient ensures optimal performance for your searches.
Further, you say that you do binary searches. This basically means you do log(n) comparisons, where each comparison requires a load of a certain element from the haystack. This load probably goes through all caches, because the size of the data makes cache hits very unlikely. However, you could hold multiple needles to search for in cache at the same time. If you then manage to trigger the cache loads for the needles first and then perform the comparison, you could reduce the time where either CPU or RAM are idle because they wait for new operations to perform. This is obviously (like others) a parameter you need to tweak for the system it runs on.
Even further, reconsider binary searching. Binary searching performs reliably with a good upper bound on random data. If you have any patterns (i.e. anything non-random) in your data, try to exploit this knowledge. If you can roughly estimate the location of the needle you're searching for, you may thus reduce the number of lookups. This is basically moving the work from the RAM bus to the CPU, so it again depends which is the actual bottleneck. Note that you can also switch algorithms, e.g. going from an educated guess to a binary search when you have less than a certain amount of elements left to consider.
Lastly, you say that every node has a full copy of your database. If each of the N nodes is assigned one Nth of the database, it could improve caching. You'd then make one first step at locating the element to determine the node and then dispatch the search to the responsible node. If in doubt, every node can still process the search as a fallback.
The modern approach is to use threads and a single process.
Whether that is better than using multiple processes and a shared memory segment might depend somewhat on your personal preference and how easy threads are to use in the language you are using, but I would say that if decent thread support is available (e.g. Java) you are pretty much always better off using it.
The main advantage of using multiple processes as far as I can see is that it is impossible to run into the kind of issues you can get when managing multiple threads (e.g., forgetting to synchronise access to shared writable resources - except for the shared memory pool). However, thread-safety by not having threads at all is not much of an argument in favour.
It might also be slightly easier to add processes than add threads. You would have to write some code to change the number of processing threads online (or use a framework or application server).
But overall, the multiple-process approach is dead. I haven't used shared memory in decades. Threads have won the day and it is worth the investment to learn to use them.
If you do need to have multi-threaded access to common writable memory then languages like Java give you all sorts of classes for doing that (as well as language primitives). At some point you are going to find you want that and then with the multi-process approach you are faced with synchronising using semaphores and writing your own classes or maybe looking for a third party library, but the Java people will be miles ahead by then.
You also mentioned forking and relying on copy-on-write. This seems like a very fragile solution dependent on particular behaviour of the system and I would not myself use it.

How is processor speed distributed across threads?

I am trying to estimate how fast my code will execute when run concurrently in multiple threads.
Question 1)
If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?
Question 2)
What impact, if any, does the presence of other threads effect the execution speed of each other thread?
My Situation:
I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time. Takes Half a second on 1 thread and I was worried how this will scale with multiple users performing the same query. Every user requests is handled by a separate thread so 100 simultaneous users will require 100 simultaneous threads. Each thread is sharing the same resource but read only. No writing. Is there any chance I could get each user to see roughly the same execution time?
Note: I know it will depend upon a number of factors but surely there must be some way of identifying whether or not your code will scale if you find it takes x amount of time for a single thread given x hardware. As final note I'd like to add I have limited experience with computer hardware architecture and how multi-threading works under the hood.
These are all interesting questions, but there is, unfortunately, no straightforward answer, because the answer will depend on a lot of different factors.
Most modern machines are multi-core: in an ideal situation, a four-thread process has the ability to scale up almost linearly in a four-core machine (i.e. run four times as fast).
Most programs, though, spend most of their time waiting for things: disk or database access, the memory bus, network I/O, user input, and other resources. Faster machines don't generally make these things appreciably faster.
The way that most modern operating systems, including Windows, Unix/Linux, and MacOS, use the processor is by scheduling processor time to processes and threads in a more-or-less round-robin manner: at any given time there may be threads that are waiting for processor time (this is a bit simplistic, as they all have some notions of process prioritization, so that high-criticality processes get pushed up the queue earlier than less important ones).
When a thread is using a processor core, it gets it all for as long as its time slice lasts: indeed, only one thing at a time is actually running on a single core. When the process uses up its time slice, or requests some resource that isn't immediately available, it its turn at the processor core is ended, and the next scheduled task will begin. This tends to make pretty optimal use of the processor resources.
So what are the factors that determine how well a process will scale up?
What portion of its run time does a single process spend waiting for
I/O and user input?
Do multiple threads hit the same resources, or different ones?
How much communication has to happen between threads? Between individual threads and your processes main thread? This takes synchronization, and introduces waiting.
How "tight" are the hotspots of the active thread? Can the body of it fit into the processor's memory, or does the (much slower) bus memory have to be accessed?
As a general rule, the more independent individual threads are of one another, the more linearly your application will scale. In real-world business applications, though, that is far from the case. The best way to increase the scaling ability of your process is to understand it--and its dependencies--well, and then use a profiler to find out where the most waiting occurs, and see if you can devise technical strategies to obviate them.
If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?
No, you should determine it empirically.
What impact, if any, does the presence of other threads effect the execution speed of each other thread?
Computation-bound tasks will likely scale very well and be mostly independent of other threads. Interestingly enough, some CPU manufacturers implement features which can increase the clock of a lone-busy CPU core to compensate for the all the idle cores. This sort of feature might confound your measurements and expectations about scaling.
Cache/Memory/disk-bound tasks will start to contend with each other except for where resource partitions exist.
I know it will depend upon a number of factors
Absolutely! So I recommend that you prototype it and measure it. And then find out why it didn't scale as well as you'd hoped and try a different algorithm. Iterate.
but surely there must be some way of identifying whether or not your code will scale
Yes, but unfortunately it requires a detailed description of the algorithm implemented by the code. Your results will be heavily dependent on the ratio of your code's activity among these general regions, and your target's capability for these:
disk I/O
network I/O
memory I/O
My Situation: My application runs in an app server that assigns one thread for every user request. If my application executes in 2 seconds for 1 user I can't assume it will be always take 2 seconds if say 100 users are simultaneously running the same operation correct?
If your app server computes pi to 100 digits for each user request, it will likely scale reasonably well until you encounter the core limit of your target.
If your app server does database queries for each user request, it will likely scale only as well as the target hardware can sustain the necessary load.
EDIT given specifics:
I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time.
Your problem sounds memory+cache-bound. You should study the details of your target CPU/mem deployment or if you are designing it, opt for high memory throughput.
A NUMA system ("resource partitioning" for memory) can likely maximize your overall concurrent memory throughput. Note that since your problem seems to dictate concurrent access to the same memory pages, a NUMA system would penalize the process doing remote memory accesses. In this case, consider creating multiple copies of the data at initialization time.
Depending on the pattern of traversal, TLB pressure might be a factor. Consider experimenting with huge (aka "large") pages.
Cache contention may be a factor in scaling as well.
Your specific algorithm could easily end up dominating over any of the specific system effects, depending on how far apart the best and worst cases are.
limited experience with computer hardware architecture and how multi-threading works under the hood.
Profile the query using CPU performance counters with a tool like Intel's VTune, perf, or oprofile. It can tell you where expensive operations are executing in your code. With this information you can optimize your query to perform well (individually and in aggregate).

Why would multi threaded applications in general scale bad?

I am currently thinking of reasons why multi threaded applications may not scale well.
Two reasons I am aware of and that I have been fighting with are:
Communication between threads is not done well and slows down the speed
Number of cores on a chip and memory bandwith to the cpu do not increase proportionally. This leads to a slower memory bandwith per core the more cores on a chip are heavily used.
What else are problems?
For point 1), they are not necessarily 'not done well', but in most cases there are critical sections that processes/threads have to wait for each other, e.g. update some critical data. This is described well by Amdahl's law.
Another point I'd like to add is the scalability of the task itself. If the task (the input) is not scalable, then increasing processing power (cores/threads) cannot improve the whole throughput. For example, an application is to handle data flows, but there is a constraint that data packets from same flow can not be handled in parallel (due to ordering consideration), then the scalability will be limited by the number of flows.
In addition, the scalability of the algorithm is even more fundamental, considering the difference between O(1) and O(n) algorithms. Of course, maybe the topic here focus on scalability of processing power, rather than data size.
I think that, in (1), you've nailed one of most important factors that can negatively influence the performance of multithreaded apps. Esp. Google for 'false sharing'.
(2), however only affects a set of multithreaded apps - those that that run CPU-bound threads in parallel. If an app uses many threads that are I/O bound, (2) does not matter too much.
Looking at my box here, it has 100 processes and 1403 threads, CPU use 3%. Only 7 out of the 100 processes are single-threaded. Most of the apps, therefore, are multithreaded but I/O waiting.
My box would work reasonably well, at the moment, if it had only one core. Sure, hitting a link that winds up my browser would probably be a bit slower to bring up a complex page, but not much.
In the commonest case then, where apps are multithreaded to take avantage of the high I/O performance of preemptive multitaskers, apps scale very well indeed, even on a single-core CPU.
Try not to fall into the trap of thinking that preemptive multitasking OS are all about 'doing CPU-bound tasks in parallel' - they actually make this difficult by forcing the need for locking, synchro, signalling etc. It's much more about high-performance I/O, something that a cooperative scheduler is spectacularly bad at.
Many multi-threaded applications are built around the "one user one thread" concept which means that once a user or chore needs to be handled a thread is allocated to the task. Every extra thread increases the load on the scheduler leading up to the point where all processing is done trying to determine which thread should be run at this moment. Call this "scheduler saturation."
Windows (the multi-threaded engine, not 95/98/Me etc) has a mechanism called I/O Completion ports which recommend one thread per processor for best performance. IOCP-based applications are usually tremendously fast though, as always, the bottlenecks instead appear in other places such as running out of certain types of OS memory or waiting on the communications medium.
You can search for IOCP here at SO, it has its own tag.
I would add:
The more threads, the smaller their share of CPU cache. A typical modern CPU's might have 3 levels of cache: L1, L2 and L3. L1 might be private to that core, but L2 and L3 might be shared between cores on the die or something. So a single thread can use the entire L2 & L3, but if you have many threads then you get many more cache misses, depending on the profile of your algorithm.
See also:
many-core CPU's: Programming techniques to avoid disappointing scalability
It could be limited by the fixed maximum bandwidth of main memory, where your program has run out of the memory bandwidth, and however you make more thread can't create more available memory bandwidth. This is related to your specific application, whether is a memory bounded one or a compute bounded one, see roofline model.

Delphi 2010: Advantage of running multi threads if cannot allocate memory to create object for calculation in each thread

My Previous Question
From the above answer, means if in my threads has create objects, i will face memory allocation/deallocation bottleneck, thus result running threads may slower or no obvious time taken diff. than no thread. What's the advantages of running multi threads in the application if I cannot allocate memory to create the object for calculations in my thread?
What's the advantages of running multi threads in the application if I cannot allocate memory to create the objects for calculations in my thread?
It depends on where your bottlenecks are. If your bottleneck is the amount of memory available, then creating more threads won't help. Or, if I/O is a bottleneck, trying to parallelize will just slightly slow down everything because of context switching. It's like trying to make an underpowered car faster by putting wider tyres in it: fixing the wrong thing doesn't help.
Threads are useful when the bottleneck is the processor and there are several processors available.
Well, if you allocate chunks of memory in a loop, things will slow down.
If you can create your objects once at the beginning of TThread.execute, the overhead will be smaller.
Threads can also be benificial if you have to wait for IO-operations, or if you have expensive calculations to do on a machine with more than one physical core.
If you have memory intensive threads (many memory allocations/deallocations) you better use TopMM instead of FastMM:
FastMM uses a lock which blocks all other threads, TopMM does not so it scales much better on multi cores/cpus!
When it comes to multithreding, shared resources issues will always arise (with current technology). All resources that may need serialization (RAM, disk, etc.) are a possible bottleneck. Multithreading is not a magic solution that turns a slow app in a fast one, and not always result in better speed. Made in the wrong way, it can actually result in worse speed. it should be analyzed to find possible bottlenecks, and some parts could need to be rewritten to minimize bottlenecks using different techniques (i.e. preallocating memory, using async I/O, etc.). Anyway, performance is only one of the reasons to use more than one thread. There are several other reason, for example letting the user to be able to interact with the application while background threads perform operations (i.e. printing, checking data, etc.) without "locking" the user. The application that way could seem "faster" (the user can keep on using it without waiting) even if it is actually slowerd (it takes more time to finish operations than if made them serially).

Programming for Multi core Processors

As far as I know, the multi-core architecture in a processor does not effect the program. The actual instruction execution is handled in a lower layer.
my question is,
Given that you have a multicore environment, Can I use any programming practices to utilize the available resources more effectively? How should I change my code to gain more performance in multicore environments?
That is correct. Your program will not run any faster (except for the fact that the core is handling fewer other processes, because some of the processes are being run on the other core) unless you employ concurrency. If you do use concurrency, though, more cores improves the actual parallelism (with fewer cores, the concurrency is interleaved, whereas with more cores, you can get true parallelism between threads).
Making programs efficiently concurrent is no simple task. If done poorly, making your program concurrent can actually make it slower! For example, if you spend lots of time spawning threads (thread construction is really slow), and do work on a very small chunk size (so that the overhead of thread construction dominates the actual work), or if you frequently synchronize your data (which not only forces operations to run serially, but also has a very high overhead on top of it), or if you frequently write to data in the same cache line between multiple threads (which can lead to the entire cache line being invalidated on one of the cores), then you can seriously harm the performance with concurrent programming.
It is also important to note that if you have N cores, that DOES NOT mean that you will get a speedup of N. That is the theoretical limit to the speedup. In fact, maybe with two cores it is twice as fast, but with four cores it might be about three times as fast, and then with eight cores it is about three and a half times as fast, etc. How well your program is actually able to take advantage of these cores is called the parallel scalability. Often communication and synchronization overhead prevent a linear speedup, although, in the ideal, if you can avoid communication and synchronization as much as possible, you can hopefully get close to linear.
It would not be possible to give a complete answer on how to write efficient parallel programs on StackOverflow. This is really the subject of at least one (probably several) computer science courses. I suggest that you sign up for such a course or buy a book. I'd recommend a book to you if I knew of a good one, but the paralell algorithms course I took did not have a textbook for the course. You might also be interested in writing a handful of programs using a serial implementation, a parallel implementation with multithreading (regular threads, thread pools, etc.), and a parallel implementation with message passing (such as with Hadoop, Apache Spark, Cloud Dataflows, asynchronous RPCs, etc.), and then measuring their performance, varying the number of cores in the case of the parallel implementations. This was the bulk of the course work for my parallel algorithms course and can be quite insightful. Some computations you might try parallelizing include computing Pi using the Monte Carlo method (this is trivially parallelizable, assuming you can create a random number generator where the random numbers generated in different threads are independent), performing matrix multiplication, computing the row echelon form of a matrix, summing the square of the number 1...N for some very large number of N, and I'm sure you can think of others.
I don't know if it's the best possible place to start, but I've subscribed to the article feed from Intel Software Network some time ago and have found a lot of interesting thing there, presented in pretty simple way. You can find some very basic articles on fundamental concepts of parallel computing, like this. Here you have a quick dive into openMP that is one possible approach to start parallelizing the slowest parts of your application, without changing the rest. (If those parts present parallelism, of course.) Also check Intel Guide for Developing Multithreaded Applications. Or just go and browse the article section, the articles are not too many, so you can quickly figure out what suits you best. They also have a forum and a weekly webcast called Parallel Programming Talk.
Yes, simply adding more cores to a system without altering the software would yield you no results (with exception of the operating system would be able to schedule multiple concurrent processes on separate cores).
To have your operating system utilise your multiple cores, you need to do one of two things: increase the thread count per process, or increase the number of processes running at the same time (or both!).
Utilising the cores effectively, however, is a beast of a different colour. If you spend too much time synchronising shared data access between threads/processes, your level of concurrency will take a hit as threads wait on each other. This also assumes that you have a problem/computation that can relatively easily be parallelised, since the parallel version of an algorithm is often much more complex than the sequential version thereof.
That said, especially for CPU-bound computations with work units that are independent of each other, you'll most likely see a linear speed-up as you throw more threads at the problem. As you add serial segments and synchronisation blocks, this speed-up will tend to decrease.
I/O heavy computations would typically fare the worst in a multi-threaded environment, since access to the physical storage (especially if it's on the same controller, or the same media) is also serial, in which case threading becomes more useful in the sense that it frees up your other threads to continue with user interaction or CPU-based operations.
You might consider using programming languages designed for concurrent programming. Erlang and Go come to mind.
