Why is cache coherency important in multi-processor system? - multithreading

Multiprocessor systems have some kind of cache coherency protocols built into them e.g. MSI, MESI etc. The only case where cache coherency matters is when instructions executing in two different processors tries to write/read shared data. For the shared data to be practically valid, programmer anyway has to introduce memory barriers. If there is no memory barrier, the shared data is going to be "wrong" regardless of whether underlying processor implements cache coherence or not. Why then the need of cache coherence mechanisms at hardware level?

Without cache coherency, instead of merely barriers, you'd have to flush and invalidate caches when accessing shared data, which has a much higher overhead than cache coherency.
Historically, there have been a few shared memory multiprocessor architectures, but they have all died out in favor of CC due to being very difficult to program correctly and efficiently.

Related

Protecting thread-local storage of a thread from other threads

Thread-local storage is a method to reduce synchronization overhead in multi-threaded applications where data is not shared between threads. That requires a protection mechanism around certain thread-local memory locations (like TLS and stack) in order that only a single one of the threads may access that memory. Since all threads within a process share the same virtual address space, how is thread-local storage and stack of a thread is protected from other threads of the same process?
I guess OS should provide such a protection mechanism, and if so, how? ... The whole concept of thread-local storage is to reduce overhead, so involving OS means adding overhead. Is there a runtime library or hardware support? or maybe is not protected at all and is left to the programmer...
You are correct in thinking that a programmer could access the thread local storage space of another thread within the same process. It wouldn't be trivial since the programmer would have to either access the memory directly or use some undocumented APIs but it could theoretically be done. But, why would he (or she)?! The whole premise of the TLS is to make it easy for programmers to store data in a place that is not shared with other threads within the process.
The fact that the thread local storage is managed by the OS means that the actual location of the thread local storage in the process's memory is not advertised directly. Reading and writing to the TLS is "managed" by the operating system with relatively low overhead (a function call) by suppling a simple Get/Set api. The protection here is mostly convenience by making it difficult for somebody to accidentally access data that belongs (is also accessed) by a different thread.
It doesn't require a protection mechanism. In fact, a protection mechanism would just make things much more difficult. For example, say a thread wants to sort one of its objects, so it passes it to a sorting method. What happens if that sorting method uses multiple threads "under the hood" to do the sorting?
So your question is based on an entirely false premise. Such a protection mechanism would mean that every method that operated on an object would have to declare whether it was safe to operate on thread-specific data or not, which would be a nightmare.

performance - multithreaded or multiprocess applications

In order to develop a highly network intensive server application on linux, what sort of architecture is preferred? The idea is that this app would typically run on machines with multiple cores (either virtual or physical). Considering that performance is the key criteria, is it better to go for a multi-threaded application or the one with multi-process design? I do know that sharing of resources and synchronization to access of such resources from multiple processes is a lot of programming overhead, but as mentioned earlier overall performance is the key requirement and so we can ignore those things. And the programming language would be C/C++.
I have heard that even the multi-threaded applications (single process) can take advantage of multiple cores and run each thread on a different core independently (as long as there is no sync issues). And this scheduling is done by the kernel. If so, is there not much difference in performance between multi-threaded applications and multi-process applications? Nginx uses a multi-process architecture and is really quick, but can one get the same performance with multi-threaded applications?
Thanks.
Processes and threads on linux are very similar to each other - the main difference is that the whole virtual memory is shared as well as certain things like signal handling differ.
This makes for cheaper context switches between threads (no need for costly MMU reloads etc.) but doesn't necessarily cause much difference in speed (especially outside of thread creation).
For designing a highly network intensive application, basically the only solution is to use an evented architecture (otherwise you'll bog down the system with huge amount of processes/threads and spend more time on their management than actually running work code), where you react to I/O on sockets and based on which sockets exhibit activity do apropriate operations.
A famous writeup about the problems faced in such situations is "The C10k problem", available from http://www.kegel.com/c10k.html - it describes different I/O approaches, so despite being a bit dated, it's a very good introduction.
Be careful before jumping deeply into reactor-like designs, though - they can get unwieldy and complex, so see if you can't use library/language that provides a nicer abstraction over it (Erlang is my personal favourite in this, languages with coroutines like Go can be useful too).
If your threads are doing the job independent from one another, under linux, there is simply no reason to not going with multiple processes instead. Multiple processes would increase your memory usage as each process has its own private memory space, but on the other hand sharing the memory space between independent threads is the worse decision. Context switching between threads vs processes is usually done better for processes rather than threads although its a little bit architecture and code dependent. Processes are safe to not get serialized with locks and mutex es. Processes are easier to manage and interact with in Linux. here is a good document you might find interesting (http://elinux.org/images/1/1c/Ben-Yossef-GoodBadUgly.pdf).

OpenMP program on different hosts

I want to know if it would be possible to run an OpenMP program on multiple hosts. So far I only heard of programs that can be executed on multiple thread but all within the same physical computer. Is it possible to execute a program on two (or more) clients? I don't want to use MPI.
Yes, it is possible to run OpenMP programs on a distributed system, but I doubt it is within the reach of every user around. ScaleMP offers vSMP - an expensive commercial hypervisor software that allows one to create a virtual NUMA machine on top of many networked hosts, then run a regular OS (Linux or Windows) inside this VM. It requires a fast network interconnect (e.g. InfiniBand) and dedicated hosts (since it runs as a hypervisor beneath the normal OS). We have an operational vSMP cluster here and it runs unmodified OpenMP applications, but performance is strongly dependent on data hierarchy and access patterns.
NICTA used to develop similar SSI hypervisor named vNUMA, but development also stopped. Besides their solution was IA64-specific (IA64 is Intel Itanium, not to be mistaken with Intel64, which is their current generation of x86 CPUs).
Intel used to develop Cluster OpenMP (ClOMP; not to be mistaken with the similarly named project to bring OpenMP support to Clang), but it was abandoned due to "general lack of interest among customers and fewer cases than expected where it showed a benefit" (from here). ClOMP was an Intel extension to OpenMP and it was built into the Intel compiler suite, e.g. you couldn't use it with GCC (this request to start ClOMP development for GCC went in the limbo). If you have access to old versions of Intel compilers (versions from 9.1 to 11.1), you would have to obtain a (trial) ClOMP license, which might be next to impossible given that the product is dead and old (trial) licenses have already expired. Then again, starting with version 12.0, Intel compilers no longer support ClOMP.
Other research projects exist (just search for "distributed shared memory"), but only vSMP (the ScaleMP solution) seems to be mature enough for production HPC environments (and it's priced accordingly). Seems like most efforts now go into development of co-array languages (Co-Array Fortran, Unified Parallel C, etc.) instead. I would suggest that you have a look at Berkeley UPC or invest some time in learning MPI as it is definitely not going away in the years to come.
Before, there was the Cluster OpenMP.
Cluster OpenMP, was an implementation of OpenMP that could make use of multiple SMP machines without resorting to MPI. This advance had the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP was maintained across all machines through a distributed shared-memory subsystem. Cluster OpenMP is based on the relaxed memory consistency of OpenMP, allowing shared variables to be made consistent only when absolutely necessary. source
Performance Considerations for Cluster OpenMP
Some memory operations are much more expensive than others. To achieve good performance with Cluster OpenMP, the number of accesses to unprotected pages must be as high as possible, relative to the number of accesses to protected pages. This means that once a page is brought up-to-date on a given node, a large number of accesses should be made to it before the next synchronization. In order to accomplish this, a program should have as little synchronization as possible, and re-use the data on a given page as much as possible. This translates to avoiding fine-grained synchronization, such as atomic constructs or locks, and having high data locality source.
Another option for running OpenMP programs on multiple hosts is the remote offloading plugin in the LLVM OpenMP runtime.
https://openmp.llvm.org/design/Runtimes.html#remote-offloading-plugin
The big issue with running OpenMP programs on distributed memory is data movement. Coincidentally, that is also one of the major issues in programming GPU's. Extending OpenMP to handle GPU programming has given rise to OpenMP directives to describe data transfer. Programming GPU's has also forced programmers to think more carefully about building programs that consider data movement.

Which Linux IPC technique to use?

We are still in the design-phase of our project but we are thinking of having three separate processes on an embedded Linux kernel. One of the processes with be a communications module which handles all communications to and from the device through various mediums.
The other two processes will need to be able to send/receive messages through the communication process. I am trying to evaluate the IPC techniques that Linux provides; the message the other processes will be sending will vary in size, from debug logs to streaming media at ~5 Mbit rate. Also, the media could be streaming in and out simultaneously.
Which IPC technique would you suggestion for this application?
http://en.wikipedia.org/wiki/Inter-process_communication
Processor is running around 400-500 Mhz if that changes anything.
Does not need to be cross-platform, Linux only is fine.
Implementation in C or C++ is required.
When selecting your IPC you should consider causes for performance differences including transfer buffer sizes, data transfer mechanisms, memory allocation schemes, locking mechanism implementations, and even code complexity.
Of the available IPC mechanisms, the choice for performance often comes down to Unix domain sockets or named pipes (FIFOs). I read a paper on Performance Analysis of Various Mechanisms for Inter-process Communication that indicates Unix domain sockets for IPC may provide the best performance. I have seen conflicting results elsewhere which indicate pipes may be better.
When sending small amounts of data, I prefer named pipes (FIFOs) for their simplicity. This requires a pair of named pipes for bi-directional communication. Unix domain sockets take a bit more overhead to setup (socket creation, initialization and connection), but are more flexible and may offer better performance (higher throughput).
You may need to run some benchmarks for your specific application/environment to determine what will work best for you. From the description provided, it sounds like Unix domain sockets may be the best fit.
Beej's Guide to Unix IPC is good for getting started with Linux/Unix IPC.
I would go for Unix Domain Sockets: less overhead than IP sockets (i.e. no inter-machine comms) but same convenience otherwise.
Can't believe nobody has mentioned dbus.
http://www.freedesktop.org/wiki/Software/dbus
http://en.wikipedia.org/wiki/D-Bus
Might be a bit over the top if your application is architecturally simple, in which case - in a controlled embedded environment where performance is crucial - you can't beat shared memory.
If performance really becomes a problem you can use shared memory - but it's a lot more complicated than the other methods - you'll need a signalling mechanism to signal that data is ready (semaphore etc) as well as locks to prevent concurrent access to structures while they're being modified.
The upside is that you can transfer a lot of data without having to copy it in memory, which will definitely improve performance in some cases.
Perhaps there are usable libraries which provide higher level primitives via shared memory.
Shared memory is generally obtained by mmaping the same file using MAP_SHARED (which can be on a tmpfs if you don't want it persisted); a lot of apps also use System V shared memory (IMHO for stupid historical reasons; it's a much less nice interface to the same thing)
As of this writing (November 2014) Kdbus and Binder have left the staging branch of the linux kernel. There is no guarantee at this point that either will make it in, but the outlook is somewhat positive for both. Binder is a lightweight IPC mechanism in Android, Kdbus is a dbus-like IPC mechanism in the kernel which reduces context switch thus greatly speeding up messaging.
There is also "Transparent Inter-Process Communication" or TIPC, which is robust, useful for clustering and multi-node set ups; http://tipc.sourceforge.net/
Unix domain sockets will address most of your IPC requirements. You don't really need a dedicated communication process in this case since kernel provides this IPC facility. Also, look at POSIX message queues which in my opinion is one of the most under-utilized IPC in Linux but comes very handy in many cases where n:1 communications are needed.

Are there examples for programming-languages support automatic management of resources besides memory?

The idea of automatic memory management has gained big support with new programming languages. I'm interested if concepts exists for automatic management of other resources like files, network-sockets etc.?
For single threaded applications, the pattern of a resource being available for the extent of a block of code, with clean-up at the end, exists in several languages. Examples are the use of RAII in C++, or with-open-file in Common Lisp (and equivalent in newer Lisp-influenced languages - the same in Dylan, C#, Python and in Ruby you can pass a block to a file object).
I'm not aware of anything better suited for the multithreaded environments where modern garbage collection shines, short of combining RAII and reference counting or auto_ptr in C++, which isn't always a trivial combination.
One important distinction between automatic management of resources and automatic memory management is that memory management can often afford to be non-deterministic and only reclaimed when the process requires it, whereas often a resource is limited at an OS level, so should be reclaimed as soon as it is no longer used. Hence the choice of smart pointers rather than garbage collection as the management implementation. There's an intermediate level of resource - GDI objects, temporary file handles, threads - where an application wants to limit the total it uses, but doesn't care so much about releasing them to other processes - these are often pooled, which gets you some way towards automatic management.
One of the reasons we can automatically manage memory allocation now is we have so much of it.
Back in the days when memory was tight you had to squeeze the most out of every bite the system had.
Other resouces such as file handles and sockets are far fewer, and still need to be handled by hand (pun intended).
Consider also the .net compact framework, it’s not uncommon for windows mobile devices to have 32mb or 64mb of volatile memory to play with which - when you think about it - is still “lots”.
I wondering what the footprint of the .net compact framework is, and how would it perform on a Nokia phone with 4mb of volatile memory.
Anyone any ideas?
(This is a wiki answer, feel free to correct or add more detail)
So, IMHIO we can afford to be slow reclaiming memory, because we're not going to run out of it in a hurry, which isn't the case with other resources.
Object persistence and caching subsystems can be considered an automatic allocation of files and resources. If you apply a caching subsystem to a network connection you don't have to care about file opening, file deleting, and so on.
A way to manage automatically network connection could be done in parallel computing environment (i.e MPI), you can set programmatically the shape of the processors interconnections. Than you just send a message from a process to another, almost ignoring the way it's implemented. Sometimes those messages are translated in sockets.
If you have a function that let you get a page from its Url, would you consider it a sort of Automatic socket management?

Resources