False Sharing and cache alignment on multiprocessor system - multithreading

I am trying to understand False sharing and cache alignment and its impact on performance on Multi core systems.Here is my case and I am trying to understand at the very high level.
Threads : 2
CPUS/Cores : 4
Locks : 1 per each Thread T1, T2
Data Structures : Each thread has 32k Size Structure which has several nested arrays and structures.
Language : C
I have 2 threads and 4 cores/CPU that can service the 2 threads at any given time.Now my threads continuously deal with writing and reading their respective data structures which are fairly large close to 32K size. Each threads are independent of each other and do not write/read Data structures of other threads.Threads always held a lock starting of their time slice.
Given my above case, are there any chances of false sharing or any negative impact that can hinder the performance. I assume there wouldnt be any false sharing since each thread works on their own data Structure and takes a lock at the very beginning of thread time slice.

I can think of two unlikely scenarios where false sharing can happen.
Suppose your thread 1 is running on core 1. After a while, it migrate to core 2 and resume execution. When running on core 2, it could try to access a cache line that is already cached in core 1. So the situation is similar to the cache line being shared across core 1 and 2.
Per thread data structures have been allocated from the shared memory. If you are not been careful to pad them to align to cache lines, last element of one data structure and first of the next, could be allocated in the same cache line.

Related

what exactly happens two threads reach to increment the atomic integers at same time

Consider the following scenario:
Thread 1 calls get and gets the value 1.
Thread 1 calculates next to be 2.
Thread 2 calls get and gets the value 1.
Thread 2 calculates next to be 2.
Both threads try to write the value.
Now because of atomics - only one thread will succeed, the other will recieve false from the compareAndSet and go around again.
I got stuck at "because of atomics" what if two threads pass compareandset method at sametime . I am looking for practical examples than theory.
Hardware interlocks will ensure that if two or more threads attempt a compareAndSet simultaneously, one will be selected as "winning" and all others will "lose". Typically this will be done by using a common clock for all cores, so that every core will see a discrete sequence of execution steps (called "cycles" at the hardware level) in which
various things happen. In a vastly over-simplified execution model where cores don't have caches but instead use a multi-port memory, each core could report to every other core on each cycle whether it is performing the "read" portion of a compareAndSet. Each core would then hold off on starting a compareAndSet on the cycle after it has seen another thread start one, and each core could defer and restart its own compareAndSet if a lower-numbered core starts one with the same address on the same cycle.
The net result is that it's impossible for two cores to "successfully" perform compareAndSet operations on the same storage at the same time. Instead, hardware will delay one of the actions so that they occur sequentially.
It is the hardware, specifically the cache coherence protocol (MESI, etc.) that's ensuring the consistency of atomic operations done concurrently from different CPU cores (which run respective concurrent threads). There is a good reading called "Memory Barriers: a Hardware View for Software Hackers" which I can highly recommend on the subject.

Compare and swap - What if 2 processors execute locking simultaneous?

I read about CAS in https://en.wikipedia.org/wiki/Compare-and-swap, and got some doubts:
Even though a single lock operation is implemented in a single instruction, but if 2 threads run on 2 different processors, then the 2 instruction could happen at the same time. Isn't that a race condition?
I saw following sentence in <Linux Kernel Development> 3rd page 168.
because a process can execute on only one processor at a time
I doubt that, not sure does it means what it literally says. What if the process has multiple threads, can't they run on multiple processors at a time?
Anyone help to explain these doubts? Thanks.
The cpu has a cache for memory, typically 64 bytes of size per so-called cache line. It will do stuff with respect to chunks of that size. In particular, when doing lock cmpxchg or similar things, the hardware thread you execute this on will negotiate exclusive access to the 64-byte portion of memory with other threads. And that's why it works.
In general, you want to read this book: https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html
This particular bit is explained on page 21.
Regarding the LKD quote, there is no context provided. It is safe to assume they meant threads and were updating a thread-local counter.

Heap memory access by multiple threads

What happens when multiple threads on a multi-core or multu-CPU machine try to access the same region in heap memory (read only - no mutating) at the same time? For example, trying to invoke a static method (the method does not mutate anything). Could the act of just trying to invoke the static method possibly create a race or deadlock condition?
EDIT: Can a read-only memory access by multiple threads at the same time cause a race condition (or any other issues)?
No, multi-threaded readings are fine.
Race conditions possible only if any thread tries to write. And even in this case it can work fine - it depends on lot of other things (cpu arch, write type etc)
Every platform that supports multiple cores that you are likely to use for the foreseeable future will support some version of MESI that keeps the core's views of memory coherent. Memory that is read on one core shortly after being read on another core will wind up being shared by all the cores that read it until either a core writes to it (at which point it will be exclusive on the core that wrote to it and invalid on the others) or it gets pushed out of cache.
You can't cause a race condition by reading memory that is not being modified. This is one of the reasons you can't have a race condition on the code itself unless the code is being modified.

Multi Threading Non Atomic Operations With Atomic

I was wondering if this scenario was possible or does the CPU make an guarantees that this won't happen at such a low level:
Say there is a value that is misaligned and requires 2 fetches to get the whole value (32 bit value misaligned on 32-bit system). So both threads are only executing one instruction, thread 1 a mov that is reading from memory and thread 2 an atomic mov that is writing to memory.
Thread 1 fetches first half of Value
Thread 2 atomically writes to Value
Thread 1 fetches second half of Value
So now on Thread 1 will contain 2 halfs of different values.
Is this scenario possible or does the CPU make any guarantees that this won't happen ?
Here's my non-expert anwser...
It is rather complex to make misaligned accesses atomic, so I believe most architectures don't give any sort of guarantee of atomicity in this case. (I don't know of any architecture that can do atomic misaligned fetches, but they just might exist. I don't know enough to tell.)
It's even complex just to fetch misaligned data; so architectures that want to keep things really simple don't even allow misaligned access (For example the very "RISC-y" old Alpha architecture).
It might be possible on some architectures to do it by somehow locking (or protecting, see below) two cache lines simultaneously in a local cache, but such things are AFAIK usually not available in 'user-land', i.e. non-OS threads.
See http://en.wikipedia.org/wiki/Load-link/store-conditional for the modern way to achieve load-store (i.e. NOT tear-free read of two misaligned areas) atomicity for one (aligned) word. Now if a thread was somehow allowed to issue two connected (atomic) 'protect' instructions like that, I suppose it could be done, but then again, that would be complex. I don't know if that exists on any CPU.

single file reader/multiple consumer model: good idea for multithreaded program?

I have a simple task that is easily parallelizable. Basically, the same operation must be performed repeatedly on each line of a (large, several Gb) input file. While I've made a multithreaded version of this, I noticed my I/O was the bottleneck. I decided to build a utility class that involves a single "file reader" thread that simply goes and reads straight ahead as fast as it can into a circular buffer. Then, multiple consumers can call this class and get their 'next line'. Given n threads, each thread i's starting line is line i in the file, and each subsequent line for that thread is found by adding n. It turns out that locks are not needed for this, a couple key atomic ops are enough to preserve invariants.
I've tested the code and it seems faster, but upon second thought, I'm not sure why. Wouldn't it be just as fast to divide the large file into n input files ( you can 'seek' ahead into the same file to achieve the same thing, minimal preprocessing ), and then have each process simply call iostream::readLine on its own chunk? ( since iostream reads into its own buffer as well ). It doesn't seem that sharing a single buffer amongst multiple threads has any inherent advantage, since the workers are not actually operating on the same lines of data. Plus, there's no good way I don't think to parallelize so that they do work on the same lines. I just want to understand the performance gain I'm seeing, and know whether it is 'flukey' or scalable/reproducible across platforms...
When you are I/O limited, you can get a good speedup by using two threads, one reading the file, second doing the processing. This way the reading will never wait for processing (expect for the very last line) and you will be doing reading 100 %.
The buffer should be large enough to give the consumer thread enough work in one go, which most often means it should consist of multiple lines (I would recommend at least 4000 characters, but probably even more). This will prevent thread context switching cost to be impractically high.
Single threaded:
read 1
process 1
read 2
process 2
read 3
process 3
Double threaded:
read 1
process 1/read 2
process 2/read 3
process 3
On some platforms you can get the same speedup also without threads, using overlapped I/O, but using threads can be often clearer.
Using more than one consumer thread will bring no benefit as long as you are really I/O bound.
In your case, there are at least two resources that your program competes for, the CPU and the harddisk. In a single-threaded approach, you request data then wait with an idle CPU for the HD to deliver it. Then, you handle the data, while the HD is idle. This is bad, because one of the two resources is always idle. This changes a bit if you have multiple CPUs or multiple HDs. Also, in some cases the memory bandwidth (i.e. the RAM connection) is also a limiting resource.
Now, your solution is right, you use one thread to keep the HD busy. If this threads blocks waiting for the HD, the OS just switches to a different thread that handles some data. If it doesn't have any data, it will wait for some. That way, CPU and HD will work in parallel, at least some of the time, increasing the overall throughput. Note that you can't increase the throughput with more than two threads, unless you also have multiple CPUs and the CPU is the limiting factor and not the HD. If you are writing back some data, too, you could improve performance with a third thread that writes to a second harddisk. Otherwise, you don't get any advantage from more threads.

Resources