Thread-local storage is a method to reduce synchronization overhead in multi-threaded applications where data is not shared between threads. That requires a protection mechanism around certain thread-local memory locations (like TLS and stack) in order that only a single one of the threads may access that memory. Since all threads within a process share the same virtual address space, how is thread-local storage and stack of a thread is protected from other threads of the same process?
I guess OS should provide such a protection mechanism, and if so, how? ... The whole concept of thread-local storage is to reduce overhead, so involving OS means adding overhead. Is there a runtime library or hardware support? or maybe is not protected at all and is left to the programmer...
You are correct in thinking that a programmer could access the thread local storage space of another thread within the same process. It wouldn't be trivial since the programmer would have to either access the memory directly or use some undocumented APIs but it could theoretically be done. But, why would he (or she)?! The whole premise of the TLS is to make it easy for programmers to store data in a place that is not shared with other threads within the process.
The fact that the thread local storage is managed by the OS means that the actual location of the thread local storage in the process's memory is not advertised directly. Reading and writing to the TLS is "managed" by the operating system with relatively low overhead (a function call) by suppling a simple Get/Set api. The protection here is mostly convenience by making it difficult for somebody to accidentally access data that belongs (is also accessed) by a different thread.
It doesn't require a protection mechanism. In fact, a protection mechanism would just make things much more difficult. For example, say a thread wants to sort one of its objects, so it passes it to a sorting method. What happens if that sorting method uses multiple threads "under the hood" to do the sorting?
So your question is based on an entirely false premise. Such a protection mechanism would mean that every method that operated on an object would have to declare whether it was safe to operate on thread-specific data or not, which would be a nightmare.
Related
I'm have a working Java EE application that runs a multitude of threads. I want to move these threads off my application and simply have access to their data (Strings and ints).
How should I achieve this if I want to say call a method on my web application that accesses a threads data on a different server/JVM?
Say you wanted to separate these layers (perhaps put them on different machines or for scalability) you would separate the data layer from your presentation layer and put them in different JVMs. You make the data layer provide a service to the presentation layer. How you do this depends on your preferred transport. e.g. it could be a web service, or you can use RMI, JMS, TCP, shared memory.
In any case, one JVM can only access the data of another process through services that JVM exposes. (Except in the case of shared memory, but it's not easy to get this working unless your data model is very simple)
accesses a threads data
In Java almost all data is on the heap, which is shared by many threads. Very little of it is scoped completely to an individual thread. So the very idea of moving a thread and only "its data" does not make much sense.
And it's not just Java, pretty much any language with shared mutable state will face the same issues.
Of course your application can have a concept of thread-owned data, but that would be application logic and not part of java itself or its Thread class.
In order to develop a highly network intensive server application on linux, what sort of architecture is preferred? The idea is that this app would typically run on machines with multiple cores (either virtual or physical). Considering that performance is the key criteria, is it better to go for a multi-threaded application or the one with multi-process design? I do know that sharing of resources and synchronization to access of such resources from multiple processes is a lot of programming overhead, but as mentioned earlier overall performance is the key requirement and so we can ignore those things. And the programming language would be C/C++.
I have heard that even the multi-threaded applications (single process) can take advantage of multiple cores and run each thread on a different core independently (as long as there is no sync issues). And this scheduling is done by the kernel. If so, is there not much difference in performance between multi-threaded applications and multi-process applications? Nginx uses a multi-process architecture and is really quick, but can one get the same performance with multi-threaded applications?
Thanks.
Processes and threads on linux are very similar to each other - the main difference is that the whole virtual memory is shared as well as certain things like signal handling differ.
This makes for cheaper context switches between threads (no need for costly MMU reloads etc.) but doesn't necessarily cause much difference in speed (especially outside of thread creation).
For designing a highly network intensive application, basically the only solution is to use an evented architecture (otherwise you'll bog down the system with huge amount of processes/threads and spend more time on their management than actually running work code), where you react to I/O on sockets and based on which sockets exhibit activity do apropriate operations.
A famous writeup about the problems faced in such situations is "The C10k problem", available from http://www.kegel.com/c10k.html - it describes different I/O approaches, so despite being a bit dated, it's a very good introduction.
Be careful before jumping deeply into reactor-like designs, though - they can get unwieldy and complex, so see if you can't use library/language that provides a nicer abstraction over it (Erlang is my personal favourite in this, languages with coroutines like Go can be useful too).
If your threads are doing the job independent from one another, under linux, there is simply no reason to not going with multiple processes instead. Multiple processes would increase your memory usage as each process has its own private memory space, but on the other hand sharing the memory space between independent threads is the worse decision. Context switching between threads vs processes is usually done better for processes rather than threads although its a little bit architecture and code dependent. Processes are safe to not get serialized with locks and mutex es. Processes are easier to manage and interact with in Linux. here is a good document you might find interesting (http://elinux.org/images/1/1c/Ben-Yossef-GoodBadUgly.pdf).
my question is why use TLS mechanism instead of just local variables in a thread function? Can you please provide some fine example, or what's the advantage of TLS over local vars?
Thank you,
Mateusz
If you can use local variables then do so and you invariably can use locals. Only as a last resort should you use thread local storage which suffers from almost all the same disadvantages as global variables. Although you are looking for a reason to use thread local storage, in fact best practice is to search for ways to avoid it!
Here is good link from Intel on using Thread-local Storage to Reduce Synchronization :
https://software.intel.com/en-us/articles/use-thread-local-storage-to-reduce-synchronization
TLS is helpful for things like user session context information which is thread specific, but might be used in various unrelated methods. In such situations, TLS is more convenient than passing the information up and down the call stack.
I know one very good example of using TLS. When you are implementing LIBC or porting one of the LIBC variants to new platform you need somehow 'errno' variable (which on single threaded platfrom is just extern int errno) to be unique for each thread. LIBC functions simply stores it in TLS of the current thread and a call to errno just reads it from TLS. TLS is the way to make any library thread safe. You store any kind on 'static' or 'global' data in TLS so the same function being called from another thread will not corrupt your 'static' or 'global' variables in another thread. Which makes you functions re entrant from different threads.
Thread-local storage can be used to emulate global or static variables on a per-thread basis. "Normal" local variables can't.
We are still in the design-phase of our project but we are thinking of having three separate processes on an embedded Linux kernel. One of the processes with be a communications module which handles all communications to and from the device through various mediums.
The other two processes will need to be able to send/receive messages through the communication process. I am trying to evaluate the IPC techniques that Linux provides; the message the other processes will be sending will vary in size, from debug logs to streaming media at ~5 Mbit rate. Also, the media could be streaming in and out simultaneously.
Which IPC technique would you suggestion for this application?
http://en.wikipedia.org/wiki/Inter-process_communication
Processor is running around 400-500 Mhz if that changes anything.
Does not need to be cross-platform, Linux only is fine.
Implementation in C or C++ is required.
When selecting your IPC you should consider causes for performance differences including transfer buffer sizes, data transfer mechanisms, memory allocation schemes, locking mechanism implementations, and even code complexity.
Of the available IPC mechanisms, the choice for performance often comes down to Unix domain sockets or named pipes (FIFOs). I read a paper on Performance Analysis of Various Mechanisms for Inter-process Communication that indicates Unix domain sockets for IPC may provide the best performance. I have seen conflicting results elsewhere which indicate pipes may be better.
When sending small amounts of data, I prefer named pipes (FIFOs) for their simplicity. This requires a pair of named pipes for bi-directional communication. Unix domain sockets take a bit more overhead to setup (socket creation, initialization and connection), but are more flexible and may offer better performance (higher throughput).
You may need to run some benchmarks for your specific application/environment to determine what will work best for you. From the description provided, it sounds like Unix domain sockets may be the best fit.
Beej's Guide to Unix IPC is good for getting started with Linux/Unix IPC.
I would go for Unix Domain Sockets: less overhead than IP sockets (i.e. no inter-machine comms) but same convenience otherwise.
Can't believe nobody has mentioned dbus.
http://www.freedesktop.org/wiki/Software/dbus
http://en.wikipedia.org/wiki/D-Bus
Might be a bit over the top if your application is architecturally simple, in which case - in a controlled embedded environment where performance is crucial - you can't beat shared memory.
If performance really becomes a problem you can use shared memory - but it's a lot more complicated than the other methods - you'll need a signalling mechanism to signal that data is ready (semaphore etc) as well as locks to prevent concurrent access to structures while they're being modified.
The upside is that you can transfer a lot of data without having to copy it in memory, which will definitely improve performance in some cases.
Perhaps there are usable libraries which provide higher level primitives via shared memory.
Shared memory is generally obtained by mmaping the same file using MAP_SHARED (which can be on a tmpfs if you don't want it persisted); a lot of apps also use System V shared memory (IMHO for stupid historical reasons; it's a much less nice interface to the same thing)
As of this writing (November 2014) Kdbus and Binder have left the staging branch of the linux kernel. There is no guarantee at this point that either will make it in, but the outlook is somewhat positive for both. Binder is a lightweight IPC mechanism in Android, Kdbus is a dbus-like IPC mechanism in the kernel which reduces context switch thus greatly speeding up messaging.
There is also "Transparent Inter-Process Communication" or TIPC, which is robust, useful for clustering and multi-node set ups; http://tipc.sourceforge.net/
Unix domain sockets will address most of your IPC requirements. You don't really need a dedicated communication process in this case since kernel provides this IPC facility. Also, look at POSIX message queues which in my opinion is one of the most under-utilized IPC in Linux but comes very handy in many cases where n:1 communications are needed.
The idea of automatic memory management has gained big support with new programming languages. I'm interested if concepts exists for automatic management of other resources like files, network-sockets etc.?
For single threaded applications, the pattern of a resource being available for the extent of a block of code, with clean-up at the end, exists in several languages. Examples are the use of RAII in C++, or with-open-file in Common Lisp (and equivalent in newer Lisp-influenced languages - the same in Dylan, C#, Python and in Ruby you can pass a block to a file object).
I'm not aware of anything better suited for the multithreaded environments where modern garbage collection shines, short of combining RAII and reference counting or auto_ptr in C++, which isn't always a trivial combination.
One important distinction between automatic management of resources and automatic memory management is that memory management can often afford to be non-deterministic and only reclaimed when the process requires it, whereas often a resource is limited at an OS level, so should be reclaimed as soon as it is no longer used. Hence the choice of smart pointers rather than garbage collection as the management implementation. There's an intermediate level of resource - GDI objects, temporary file handles, threads - where an application wants to limit the total it uses, but doesn't care so much about releasing them to other processes - these are often pooled, which gets you some way towards automatic management.
One of the reasons we can automatically manage memory allocation now is we have so much of it.
Back in the days when memory was tight you had to squeeze the most out of every bite the system had.
Other resouces such as file handles and sockets are far fewer, and still need to be handled by hand (pun intended).
Consider also the .net compact framework, it’s not uncommon for windows mobile devices to have 32mb or 64mb of volatile memory to play with which - when you think about it - is still “lots”.
I wondering what the footprint of the .net compact framework is, and how would it perform on a Nokia phone with 4mb of volatile memory.
Anyone any ideas?
(This is a wiki answer, feel free to correct or add more detail)
So, IMHIO we can afford to be slow reclaiming memory, because we're not going to run out of it in a hurry, which isn't the case with other resources.
Object persistence and caching subsystems can be considered an automatic allocation of files and resources. If you apply a caching subsystem to a network connection you don't have to care about file opening, file deleting, and so on.
A way to manage automatically network connection could be done in parallel computing environment (i.e MPI), you can set programmatically the shape of the processors interconnections. Than you just send a message from a process to another, almost ignoring the way it's implemented. Sometimes those messages are translated in sockets.
If you have a function that let you get a page from its Url, would you consider it a sort of Automatic socket management?