How is memory managed within threads?

How is memory managed within threads? - multithreading

If there is multiple threads going through a function which inside of it there is a for loop with variable assignment inside of it. How do variables values don't get messed up across multiple threads?

Each thread has a block of memory named "thread stack", of about 1MB for 32bits and 4Mb for 64 bitsr, where value types and references (pointers to the heap) that are parameters or intermediate results are stored; this space is not shared accross threads. The rest of things (ie: shared value types and references and the objects to which any references maybe pointing to) are stored in the heap, that is the rest of RAM memory and it is shared across all threads.
Values can be messed up across threads, for example in race conditions. If you try to update a 64 bits value from multiple threads with a 32 bits CPU, you could get an inconsistency, for preventing that you have many synchronization primitives in .NET like the Interlock and Monitor utilities for example. Also, non thread-safe objects like List<> can get messed up if using multiple threads without synchronization, and a simple integer incrementation can lost values if operation is not atomic.
You should do some read. I recommend you the book CLR via C# from Jeffrey Richter. Every .NET dev should read that book actually.

Related

Does a variable only read by one thread, read and written by another, need synchronization?

Motive:
I am just learning the fundamentals of multithreading, not close to finishing them, but I'd like to ask a question this early in my learning journey to guide me toward the topics most relevant to my project I 'm working on.
Main:
a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?

a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
In general, yes. Otherwise, the thread editing the value could change the value only locally so that the other thread will never see the value change. This can happens because of compilers (that could use registers to read/store variables) but also because of the hardware (regarding the cache coherence mechanism used on the target platform). Generally, locks, atomic variables and memory barriers are used to perform such synchronizations.
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?
In general, there is no guarantee that accesses are done atomically. Theoretically, two cores executing each one a thread can load/store the same variable at the same time (but often not in practice). It is very dependent of the target platform.
For processor having (coherent) caches (ie. all modern mainstream processors) cache lines (ie. chunks of typically 64 or 128 bytes) have a huge impact on the implicit synchronization between threads. This is a complex topic, but you can first read more about cache coherence in order to understand how the memory hierarchy works on modern platforms.
The cache coherence protocol prevent two load/store being done exactly at the same time in the same cache line. If the variable cross multiple cache lines, then there is no protection.
On widespread x86/x86-64 platforms, variables having primitive types of <= 8 bytes can be modified atomically (because the bus support that as well as the DRAM and the cache) assuming the address is correctly aligned (it does not cross cache lines). However, this does not means all such accesses are atomic. You need to specify this to the compiler/interpreter/etc. so it produces/executes the correct instructions. Note that there is also an extension for 16-bytes atomics. There is also an instruction set extension for the support of transactional memory. For wider types (or possibly composite ones) you likely need a lock or an atomic state to control the atomicity of the access to the target variable.

Can gcc's thread-local storage share cache lines across threads?

If a variable or constant-sized array is declared with __thread, can the backing virtual address range share a cache line across threads? (For example, if two copies of a thread-local integer land on the same cache line, performance will suffer because of cache line bouncing.) Does the answer depend on gcc/Linux version and hardware architecture?

According to Ultrich Drepper who is an infamous expert former glibc maintainer, "not allocated in the normal data segment; instead each thread has its own separate area where such variables are stored. The variables can have static initializers. All thread-local variables are addressable by all other threads but, unless a thread passes a pointer to a thread-local variable to those other threads, there is no way the other threads can find that variable. Due to the variable being thread-local, false sharing is not a problem—unless the program artificially creates a problem."
If you study, Memory part 6: More things programmers can do thread local variables, if a pointer is passed, could be accessing the same cache line.
To avoid this issue, simply don't pass the addresses in question around. As specific technique of grouping frequent rw variables together to share a cache line in a struct is described, with padding to AVOID multiple threads writing to same cache line when NOT using TLS, so long as the __thread variable pointers are not used by other threads, cache line bouncing ought be avoided.
Hopefully the linker implementers know the CPU architecture, you can't choose where in virtual & physical memory address space the multiplied thread local storage is allocated, it ought be safely assumed they plus CPU designers do a reasonable job of avoiding performance problems due to false clashes. You'ld have the same issues happening by accident between seperate processes, nevermind threads if cache associativity bounces was a common problem.

lookup tables in C++ 11 with multithreading

I have 2 similar situations in a multithreaded C++11 software :
an array that I'm using as a lookup table inside a method declaration
an array that I'm using as a lookup table declared outside a method and that is being used by different and several methods by reference or with pointers.
now if we forget for a minute about this LUTs and we just consider C++11 and a multithreaded approach for a generic method, the most appropriate qualifier for this methods in terms of storage duration is probably thread_local.
This way if i feed a method foo() that is thread_local to 3 threads I basically end up having 3 instances of foo() for each thread, this move "solves" the problem with foo() being shared and accessed between 3 different threads, avoiding cache missings, but I basically have 3 possible different behaviours for my foo(), for example if I have the same PRNG implemented in foo() and i provide a seed that is time-dependant with a really good and high resolution, I probably will get 3 different results with each thread and a real mess in terms of consistency.
But let's say that I'm fine with how thread_local works, how I can write down the fact that I need to keep a LUT always ready and cached for my methods ?
I read something about a relaxed or less relaxed memory model, but in C++11 I have never seen a keyword or a practical application that can inject the caching of an array/LUT .
I'm on x86 or ARM.
I probably need something that is the opposite thing of volatile basically.

If the LUTs are read-only, so that you can share them without locks, you should just use one of them (i.e. declare them static).
Threads do not have their own caches. But even if they did (cores typically have their own L1 cache, and you might be able to lock a thread to a core), there would be no problem for two different threads to cache different parts of the same memory structure.
"Thread-local storage" does not mean that the memory is somehow physically tied to the thread. Rather, it's a way to let the same name refer to a different object in each thread. In no way does it restrict the ability of any thread to access the object, if given its address.

The CPU cache is not programmable. It uses its own internal logic to determine which memory regions to cache. Typically it will cache the memory that either has just been accessed by the CPU, or its prediction logic determines will shortly be accessed by the CPU. In a multiprocessor system, each CPU may have its own cache, or different CPUs may share a cache. If there are multiple caches, a memory region may be cached in more than one simultaneously.
If all threads must see the same values in the look-up tables, then a single table would be best. This could be achieved with a variable with static storage duration. If the data can be modified then you would probably also need a std::mutex to protect accesses to the table and avoid data races. Read-only data can be shared without additional synchronization; in this case it is best to declare it const to make the read-only nature explicit and avoid accidental modifications.
void foo(){
static const int lut[]={...};
}
You use thread_local where each thread must have its own copy of the data, usually because each copy will be modified independently. For example, you may choose to use thread_local for your random-number generator, so that each thread has its own RNG which is independent of the other threads, and does not require synchronization.
void bar(){
thread_local RandomNumberGenerator rng; // one per thread
auto val=rng.nextRandomNumber(); // use the instance for the current thread
}

What to lock and what to not lock in a multithreaded environment (semaphores and shared memory)

I was implementing some simple Producer/Consumer program that had some semaphores and shared memory. To keep things simple, let's assume that there's just a block of shared memory and a semaphore in my program.
At first, I though that I only had to consider as critical section bits of code that'd try to write to the shared memory block. But as the shared memory block consists of, let's say, 1024bytes, I can't read all the data at the same time (it's not an atomic operation), so it is indeed possible that while I'm reading from it, the Producer comes and starts writing in it, so the reader will get half old data, half new data. From this, I can only think that I also have to put shared memory reading logic inside a "semaphore" block.
Now, I have lots of code that looks like this:
if (sharedMemory[0] == '0') { ... }
In this case, I am just looking for a single char in memory. I guess I don't have to worry about puting a semaphore around this, do I?
And what if instead I have something like
if (sharedMemory[0] == '0' && sharedMemory[1] == '1') { ... }
From my perspective, I guess that as this are 2 operations, I'd have to consider this as a critical section, thus having to put a semaphore around it. Am I right?
Thanks!

Technically, on a multicore or multiprocessor system the only thing that's atomic are assembly opcodes which are specifically documented as being atomic. Even reading a single byte presents a (quite small) chance the another processor will come along and modify it before you're doing reading it, except in some cases that deals with CPU cache and aligned chunks of memory (Fun thread: http://software.intel.com/en-us/forums/showthread.php?t=76744, Interesting read: http://www.corensic.com/CorensicBlog/tabid/101/EntryId/8/Memory-Consistency-Models.aspx)
You must either use types which internally guarantee atomicity or specifically protect accesses on multithreaded multicore systems.
(The answer may change slightly on IL platforms like .NET and JVMs since they make their own guarantees about what's atomic and what isn't).

Definitely lock around non-atomic operations, and checking two different values counts as as a non-atomic operation, although there are tricks you can use to check up to four bytes or more, provided your processor doesn't cache the results. You have to consider how your data is used. But basically, any access to shared memory should have have a semaphore around it.

Real World Examples of read-write in concurrent software

I'm looking for real world examples of needing read and write access to the same value in concurrent systems.
In my opinion, many semaphores or locks are present because there's no known alternative (to the implementer,) but do you know of any patterns where mutexes seem to be a requirement?
In a way I'm asking for candidates for the standard set of HARD problems for concurrent software in the real world.

What kind of locks are used depends on how the data is being accessed by multiple threads. If you can fine tune the use case, you can sometimes eliminate the need for exclusive locks completely.
An exclusive lock is needed only if your use case requires that the shared data must be 100% exact all the time. This is the default that most developers start with because that's how we think about data normally.
However, if what you are using the data for can tolerate some "looseness", there are several techniques to share data between threads without the use of exclusive locks on every access.
For example, if you have a linked list of data and if your use of that linked list would not be upset by seeing the same node multiple times in a list traversal and would not be upset if it did not see an insert immediately after the insert (or similar artifacts), you can perform list inserts and deletes using atomic pointer exchange without the need for a full-stop mutex lock around the insert or delete operation.
Another example: if you have an array or list object that is mostly read from by threads and only occasionally updated by a master thread, you could implement lock-free updates by maintaining two copies of the list: one that is "live" that other threads can read from and another that is "offline" that you can write to in the privacy of your own thread. To perform an update, you copy the contents of the "live" list into the "offline" list, perform the update to the offline list, and then swap the offline list pointer into the live list pointer using an atomic pointer exchange. You will then need some mechanism to let the readers "drain" from the now offline list. In a garbage collected system, you can just release the reference to the offline list - when the last consumer is finished with it, it will be GC'd. In a non-GC system, you could use reference counting to keep track of how many readers are still using the list. For this example, having only one thread designated as the list updater would be ideal. If multiple updaters are needed, you will need to put a lock around the update operation, but only to serialize updaters - no lock and no performance impact on readers of the list.
All the lock-free resource sharing techniques I'm aware of require the use of atomic swaps (aka InterlockedExchange). This usually translates into a specific instruction in the CPU and/or a hardware bus lock (lock prefix on a read or write opcode in x86 assembler) for a very brief period of time. On multiproc systems, atomic swaps may force a cache invalidation on the other processors (this was the case on dual proc Pentium II) but I don't think this is as much of a problem on current multicore chips. Even with these performance caveats, lock-free runs much faster than taking a full-stop kernel event object. Just making a call into a kernel API function takes several hundred clock cycles (to switch to kernel mode).
Examples of real-world scenarios:
producer/consumer workflows. Web service receives http requests for data, places the request into an internal queue, worker thread pulls the work item from the queue and performs the work. The queue is read/write and has to be thread safe.
Data shared between threads with change of ownership. Thread 1 allocates an object, tosses it to thread 2 for processing, and never wants to see it again. Thread 2 is responsible for disposing the object. The memory management system (malloc/free) must be thread safe.
File system. This is almost always an OS service and already fully thread safe, but it's worth including in the list.
Reference counting. Releases the resource when the number of references drops to zero. The increment/decrement/test operations must be thread safe. These can usually be implemented using atomic primitives instead of full-stop kernal mutex locks.

Most real world, concurrent software, has some form of requirement for synchronization at some level. Often, better written software will take great pains to reduce the amount of locking required, but it is still required at some point.
For example, I often do simulations where we have some form of aggregation operation occurring. Typically, there are ways to prevent locking during the simulation phase itself (ie: use of thread local state data, etc), but the actual aggregation portion typically requires some form of lock at the end.
Luckily, this becomes a lock per thread, not per unit of work. In my case, this is significant, since I'm typically doing operations on hundreds of thousands or millions of units of work, but most of the time, it's occuring on systems with 4-16 PEs, which means I'm usually restricting to a similar number of units of execution. By using this type of mechanism, you're still locking, but you're locking between tens of elements instead of potentially millions.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string