Thread Local Storage functioning - multithreading

I am trying to understand how the TLS works, but I think that the definitions provided by Wikipedia and MSDN are different.
By reading the Wikipedia page, my understanding is that TLS is a way to map data which normally would be global/static locally for each thread of a process. If this is true, different threads cannot access to the data of other threads, though.
According MSDN: "One thread allocates the index, which can be used by the other threads to retrieve the unique data associated with the index", so it looks like that a thread can have access to the data of other threads.
Which seems to be in contrast of what Wikipedia says, where's the catch?

Two confusions here:
Firstly, you allocate a (unique) TLS ID. Using that ID with the according function, every thread can access its associated TLS data. Note that this ID has to be allocated once but that the ID (not the data!) is used by all threads.
Every thread can access every other thread's data, whether it's TLS or not. The simple reason is that threads share a memory space (the memory space and the threads roughly make up a process). Getting at some other thread's TLS data is more difficult though, but a thread could e.g. pass a pointer.
In short, TLS works like a C++ map. The key is the pair of thread ID and TLS ID. The data is typically a pointer which can be used to indirectly reference some data. When accessing an element, you only supply the TLS ID, the implementation adds the calling thread's ID to form the key for lookup. Needless to say, access to that map is of course thread-safe.

Related

What happens to pthread_key_create() generated keys after a process fork?

From pthread_key_create FreeBSD man page:
/comment
... The pthread_key_create() function creates a thread-specific data key visible to all threads in the process. Key values provided by pthread_key_create() are opaque objects used to locate thread-specific data. Although the same key value may be used by different threads, the values bound to the key by pthread_setspecific() are maintained on a per-thread basis and persist for the life of the calling thread.
I am assuming that produced pthread_key_t instances are bound to a process (hence the opaque word in the man page), and that these become invalid after a fork(). I couldn't find any reference in existing man pages or documentation however.
Can anyone confirm/comment?
I am assuming that produced pthread_key_t instances are bound to a process
There is every reason to think that instances are process-specific, in that if you, say, record the representation of a key in shared memory then it is not meaningful to unrelated processes that happen to map the same shared memory segment.
(hence the opaque word in the man page),
But "opaque" has nothing in particular to do with that. It just means that there are no user-serviceable parts inside. You're meant to use it only as a whole.
and that these become invalid after a fork().
That seems a bit of a jump. I would not expect that at all. fork() creates a nearly exact duplicate of the calling process, subject to a few explicitly enumerated exceptions, which are listed in the manual. Probably the biggest relevant one is that the new process contains only one thread (the one that called fork()). Invalidation of thread-specific data keys is not on the list, nor is the loss of thread-specific data of the one initial thread of the child.
I couldn't find any reference in existing man pages or documentation however.
Exactly so.
HOWEVER, multi-threaded programs generally should not fork, as there is a high risk that the child process is left in an invalid state, such as with mutexes that are locked by threads that do not exist in the child. There are a few special cases, such as forking a child that immediately execs, but for the most part, you should not attempt to mix multithreading with multiprocessing.

Deciding the critical section of kernel code

Hi I am writing kernel code which intends to do process scheduling and multi-threaded execution. I've studied about locking mechanisms and their functionality. Is there a thumb rule regarding what sort of data structure in critical section should be protected by locking (mutex/semaphores/spinlocks)?
I know that where ever there is chance of concurrency in part of code, we require lock. But how do we decide, what if we miss and test cases don't catch them. Earlier I wrote code for system calls and file systems where I never cared about taking locks.
Is there a thumb rule regarding what sort of data structure in critical section should be protected by locking?
Any object (global variable, field of the structure object, etc.), accessed concurrently when one access is write access requires some locking discipline for access.
But how do we decide, what if we miss and test cases don't catch them?
Good practice is appropriate comment for every declaration of variable, structure, or structure field, which requires locking discipline for access. Anyone, who uses this variable, reads this comment and writes corresponded code for access. Kernel core and modules tend to follow this strategy.
As for testing, common testing rarely reveals concurrency issues because of their low probability. When testing kernel modules, I would advice to use Kernel Strider, which attempts to prove correctness of concurrent memory accesses or RaceHound, which increases probability of concurrent issues and checks them.
It is always safe to grab a lock for the duration of any code that accesses any shared data, but this is slow since it means only one thread at a time can run significant chunks of code.
Depending on the data in question though, there may be shortcuts that are safe and fast. If it is a simple integer ( and by integer I mean the native word size of the CPU, i.e. not a 64 bit on a 32 bit cpu ), then you may not need to do any locking: if one thread tries to write to the integer, and the other reads it at the same time, the reader will either get the old value, or the new value, never a mix of the two. If the reader doesn't care that he got the old value, then there is no need for a lock.
If however, you are updating two integers together, and it would be bad for the reader to get the new value for one and the old value for the other, then you need a lock. Another example is if the thread is incrementing the integer. That normally involves a read, add, and write. If one reads the old value, then the other manages to read, add, and write the new value, then the first thread adds and writes the new value, both believe they have incremented the variable, but instead of being incremented twice, it was only incremented once. This needs either a lock, or the use of an atomic increment primitive to ensure that the read/modify/write cycle can not be interrupted. There are also atomic test-and-set primitives so you can read a value, do some math on it, then try to write it back, but the write only succeeds if it still holds the original value. That is, if another thread changed it since the time you read it, the test-and-set will fail, then you can discard your new value and start over with a read of the value the other thread set and try to test-and-set it again.
Pointers are really just integers, so if you set up a data structure then store a pointer to it where another thread can find it, you don't need a lock as long as you set up the structure fully before you store its address in the pointer. Another thread reading the pointer ( it will need to make sure to read the pointer only once, i.e. by storing it in a local variable then using only that to refer to the structure from then on ) will either see the new structure, or the old one, but never an intermediate state. If most threads only read the structure via the pointer, and any that want to write do so either with a lock, or an atomic test-and-set of the pointer, this is sufficient. Any time you want to modify any member of the structure though, you have to copy it to a new one, change the new one, then update the pointer. This is essentially how the kernel's RCU ( read, copy, update ) mechanism works.
Ideally, you must enumerate all the resources available in your system , the related threads and communication, sharing mechanism during design. Determination of the following for every resource and maintaining a proper check list whenever change is made can be of great help :
The duration for which the resource will be busy (Utilization of resource) & type of lock
Amount of tasks queued upon that particular resource (Load) & priority
Type of communication, sharing mechanism related to resource
Error conditions related to resource
If possible, it is better to have a flow diagram depicting the resources, utilization, locks, load, communication/sharing mechanism and errors.
This process can help you in determining the missing scenarios/unknowns, critical sections and also in identification of bottlenecks.
On top of the above process, you may also need certain tools that can help you in testing / further analysis to rule out hidden problems if any :
Helgrind - a Valgrind tool for detecting synchronisation errors.
This can help in identifying data races/synchronization issues due
to improper locking, the lock ordering that can cause deadlocks and
also improper POSIX thread API usage that can have later impacts.
Refer : http://valgrind.org/docs/manual/hg-manual.html
Locksmith - For determining common lock errors that may arise during
runtime or that may cause deadlocks.
ThreadSanitizer - For detecting race condtion. Shall display all accesses & locks involved for all accesses.
Sparse can help to lists the locks acquired and released by a function and also identification of issues such as mixing of pointers to user address space and pointers to kernel address space.
Lockdep - For debugging of locks
iotop - For determining the current I/O usage by processes or threads on the system by monitoring the I/O usage information output by the kernel.
LTTng - For tracing race conditions and interrupt cascades possible. (A successor to LTT - Combination of kprobes, tracepoint and perf functionalities)
Ftrace - A Linux kernel internal tracer for analysing /debugging latency and performance related issues.
lsof and fuser can be handy in determining the processes having lock and the kind of locks.
Profiling can help in determining where exactly the time is being spent by the kernel. This can be done with tools like perf, Oprofile.
The strace can intercept/record system calls that are called by a process and also the signals that are received by a process. It shall show the order of events and all the return/resumption paths of calls.

storage and management of overlapped structure in multithreaded IOCP server

Is it good idea to use LINKED LIST to store overlapped structure?
my overlapped structure looks like this
typedef struct _PER_IO_CONTEXT
{
WSAOVERLAPPED Overlapped;
WSABUF wsabuf;
....
some other per operation data
....
struct _PER_IO_CONTEXT *pOverlappedForward;
struct _PER_IO_CONTEXT *pOverlappedBack;
} PER_IO_CONTEXT, *PPER_IO_CONTEXT;
When iocp server starts i allocate (for example) 5000 of them in Linked List. The start of this list is stored in global variable PPER_IO_CONTEXT OvlList. I must add that i use this linked list only in a case when i have to send data to all connected clients.
When Wsasend is posted or GQCS gets notification, linked list is updated( i use EnterCriticalSection for synchronization ).
Thanks in advance for yours tips, opinions and suggestions for better storage(caching) overlapped structure.
I assume the proposed use case is that you wish to cache the "per operation" overlapped structure to avoid repeated allocation and release of dynamic memory which could lead to both contention on the allocator and heap fragmentation.
Using a single 'pool' reduces the contention from 'contention between all threads using the allocator that is used for allocating and destroying the overlapped structures' to 'to contention between all threads issuing or handling I/O operations' which is usually a good thing to do. You're right that you need to synchronise around access, a critical section or perhaps a SRW lock in exclusive mode is probably best (the later is fractionally faster for uncontended access).
The reduction in heap fragmentation is also worth achieving in a long running system.
Using a 'standard' non invasive list, such as a std::deque looks like the obvious choice at first but the problem with non invasive collections is that they tend to allocate and deallocate memory for each operation (so you're back to your original contention). Far better, IMHO, to put a pointer in each overlapped structure and simply chain them together. This requires no additional memory to be allocated or released on pool access and means that your contention is back down to just the threads that use the pool.
Personally I find that I only need a singly linked list of per-operation structures for a pool (free list) which is actually just a stack, and a doubly linked list if I want to maintain a list of 'in use' 'per-operation' data which is sometimes useful (though not something that I now do).
The next step may then be to have several pools, but this will depend on the design of your system and how your I/O works.
If you can have multiple pending send and receive operations for a given connection it may be useful to have a small pool at the connection level. This can dramatically reduce contention for your single shared pool as each connection would first attempt to use the per-connection pool and if that is empty (or full) fall back to using the global pool. This tends to result in far less contention for the lock on the global pool.

Efficient multi-threaded server implementation in Qt

I'm planning a multithreaded server written in Qt. Each connection would be attended in a separate thread. Each of those threads would run its own event loop and use asynchronous sockets. I would like to dispatch a const value (for instance, a QString containing an event string) from the main thread to all the client threads in the most efficient possible way. The value should obviously be deleted when all the client threads have read it.
If I simply pass the data in a queued signal/slot connection, would this introduce a considerable overhead? Would it be more efficient to pass a QSharedPointer<QString>? What about passing a const QString* together with a QAtomicInt* for the reference counting and letting the thread decrease it and delete it when the reference counter reaches 0?
Somewhat off-topic, but please be aware that the one-thread-per-connection model could enable anyone able to connect to conduct a highly efficient denial of service attack against the system running the server, since the maximum number of threads that can be created on any system is limited. Also, if it's 32-bit, you can also starve address space since each thread gets its own stack. The default stack size varies acorss systems. On Win32 it's 1 MB, IIRC, so 2048 connections kept open and alive will eat 2 GB, i.e. the entire address space reserved for userspace (you can bump it up to 3 GB but that doesn't help much.)
For more details, check The C10K Problem, specifically the I/O Strategies -> Serve one client with each server thread chapter.
According to the documentation:
Behind the scenes, QString uses implicit sharing (copy-on-write) to reduce memory usage and to avoid the needless copying of data.
Based on this, you shouldn't have any more overhead sending copies of strings through the queued signal/slot connections than you would with your other proposed solutions. So I wouldn't worry about it until and unless it is a demonstrable performance problem.

Real World Examples of read-write in concurrent software

I'm looking for real world examples of needing read and write access to the same value in concurrent systems.
In my opinion, many semaphores or locks are present because there's no known alternative (to the implementer,) but do you know of any patterns where mutexes seem to be a requirement?
In a way I'm asking for candidates for the standard set of HARD problems for concurrent software in the real world.
What kind of locks are used depends on how the data is being accessed by multiple threads. If you can fine tune the use case, you can sometimes eliminate the need for exclusive locks completely.
An exclusive lock is needed only if your use case requires that the shared data must be 100% exact all the time. This is the default that most developers start with because that's how we think about data normally.
However, if what you are using the data for can tolerate some "looseness", there are several techniques to share data between threads without the use of exclusive locks on every access.
For example, if you have a linked list of data and if your use of that linked list would not be upset by seeing the same node multiple times in a list traversal and would not be upset if it did not see an insert immediately after the insert (or similar artifacts), you can perform list inserts and deletes using atomic pointer exchange without the need for a full-stop mutex lock around the insert or delete operation.
Another example: if you have an array or list object that is mostly read from by threads and only occasionally updated by a master thread, you could implement lock-free updates by maintaining two copies of the list: one that is "live" that other threads can read from and another that is "offline" that you can write to in the privacy of your own thread. To perform an update, you copy the contents of the "live" list into the "offline" list, perform the update to the offline list, and then swap the offline list pointer into the live list pointer using an atomic pointer exchange. You will then need some mechanism to let the readers "drain" from the now offline list. In a garbage collected system, you can just release the reference to the offline list - when the last consumer is finished with it, it will be GC'd. In a non-GC system, you could use reference counting to keep track of how many readers are still using the list. For this example, having only one thread designated as the list updater would be ideal. If multiple updaters are needed, you will need to put a lock around the update operation, but only to serialize updaters - no lock and no performance impact on readers of the list.
All the lock-free resource sharing techniques I'm aware of require the use of atomic swaps (aka InterlockedExchange). This usually translates into a specific instruction in the CPU and/or a hardware bus lock (lock prefix on a read or write opcode in x86 assembler) for a very brief period of time. On multiproc systems, atomic swaps may force a cache invalidation on the other processors (this was the case on dual proc Pentium II) but I don't think this is as much of a problem on current multicore chips. Even with these performance caveats, lock-free runs much faster than taking a full-stop kernel event object. Just making a call into a kernel API function takes several hundred clock cycles (to switch to kernel mode).
Examples of real-world scenarios:
producer/consumer workflows. Web service receives http requests for data, places the request into an internal queue, worker thread pulls the work item from the queue and performs the work. The queue is read/write and has to be thread safe.
Data shared between threads with change of ownership. Thread 1 allocates an object, tosses it to thread 2 for processing, and never wants to see it again. Thread 2 is responsible for disposing the object. The memory management system (malloc/free) must be thread safe.
File system. This is almost always an OS service and already fully thread safe, but it's worth including in the list.
Reference counting. Releases the resource when the number of references drops to zero. The increment/decrement/test operations must be thread safe. These can usually be implemented using atomic primitives instead of full-stop kernal mutex locks.
Most real world, concurrent software, has some form of requirement for synchronization at some level. Often, better written software will take great pains to reduce the amount of locking required, but it is still required at some point.
For example, I often do simulations where we have some form of aggregation operation occurring. Typically, there are ways to prevent locking during the simulation phase itself (ie: use of thread local state data, etc), but the actual aggregation portion typically requires some form of lock at the end.
Luckily, this becomes a lock per thread, not per unit of work. In my case, this is significant, since I'm typically doing operations on hundreds of thousands or millions of units of work, but most of the time, it's occuring on systems with 4-16 PEs, which means I'm usually restricting to a similar number of units of execution. By using this type of mechanism, you're still locking, but you're locking between tens of elements instead of potentially millions.

Resources