Is a dirty read from memory possible when multi-threading? - multithreading

In this case, I define dirty read as reading from memory when it's currently being written to by another thread.
So if thread #1 writes 100 to a variable that thread #2 can also see, that writes 50 to the same variable, where both threads do so in a loop.
Is it possible to read the variable and get a number that is neither 50 or 100?
Using no locks etc for sync.
More detail about my setup: Intel i3 CPU, I'm programming in C#.
Below is an example of what I mean:
using System;
using System.Collections.Generic;
using System.Threading;
namespace Threading001
{
class Program
{
static void Main(string[] args)
{
long min = long.MinValue;
long max = long.MaxValue;
object number = max;
new Thread(() =>
{
long current2 = (long)number;
if (current2 != min && current2 != max)
{
Console.WriteLine("Unexpected number from thread 2: {0}.", current2);
}
number = min;
}).Start();
while (true)
{
long current = (long)number;
if (current != min && current != max)
{
Console.WriteLine("Unexpected number from thread 1: {0}.", current);
}
number = max;
}
}
}
}
I made number an object so the memory is allocated on the heap and not the stack to try and increase memory access latency times. Althrough cpu caching will probably stop that anyway.

You're actually trying to rely on several different things here.
Firstly, there's the matter of atomicity. ECMA-335 states:
A conforming CLI shall guarantee that read and write access to properly aligned memory
locations no larger than the native word size (the size of type native int) is atomic
(see §I.12.6.2) when all the write accesses to a location are the same size. Atomic writes shall
alter no bits other than those written. Unless explicit layout control (see
Partition II (Controlling Instance Layout)) is used to alter the default behavior, data elements no
larger than the natural word size (the size of a native int) shall be properly aligned. Object
references shall be treated as though they are stored in the native word size.
So for a 32-bit integer, you're always fine - and for a 64-bit integer, you're fine if you're running on a 64-bit CLR... assuming your variables are aligned, which they are normally.
However, you've also got boxing involved - I don't think that will actually cause any problems here, but it does mean you're dealing with the visibility of multiple writes: one for the number variable itself, and one for the data within the box. With the .NET implementation, I believe that's still safe due to the stronger memory guarantees it provides - I wouldn't like to absolutely guarantee that it's safe within the ECMA memory model.
Finally, there's the matter of whether a write is visible or not, which is beyond atomicity. If thread T1 changes an int value from 0 to 100, then thread T2 reads it, atomicity guarantees that T2 will see either 0 or 100, never some other bit pattern - but there usually has to be some kind of memory barrier involved to guarantee that T2 will actually see the new value instead of a stale value. This is a really complex area - if you want to know more, I suggest you start with Joe Duffy's 2007 blog post and work from there.
Note that min, max and number will be on the heap anyway, as they've been captured by your lambda expressions... although the stack/heap is an implementation detail.

It depends what the variable is (and what processor etc.etc.), but generally:Yes, dirty reads are possible.

I am not familiar with details of your specific CPU but in general it depends on whether or not the READ/WRITE is atomic, which in turn depends on the architecture and how the variable is stored.
If the variable has larger size than CPU word size, it may not be atomic.
Modern cpu may guarantee atomic access to aligned memory address; if it needs more consideration if it has no HW support for misaligned memory access. If misaligned memory access is handled by software, the read or write would not be atomic: one load/store may lead to actually two operations. One example is PowerPC/Linux where kernel handles misaligned memory access in the exception handler.

this is all about thread safe or not it depends on datatype of that variable you want to read,
some types such as long, int, byte etc are thread safe and you can read and write them in multiple threads.
you can find more information here
http://msdn.microsoft.com/en-us/library/dd997305(v=vs.110).aspx
Are primitive data types in c# atomic (thread safe)?
What is thread safe (C#) ? (Strings, arrays, ... ?)

Related

How a thread uses the cache memory in a processor?

I have read the below statement somewhere which I cannot really follow -
There is a slight gain in performance for more than 16 and more than 32
cores. The seeds are integer values, i.e., they require 4 bytes of memory. A
cache line in our system has 64 bytes. Therefore 16 seeds fit into a single
cache line. When going to 17/33 threads, the additional seed is placed in its
own cache line so that the threads are not further obstructed.
The code referred for this question is provided below -
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main(int argc, char *argv[]) {
long long int tosses = atoll(argv[1]);
long long int hits = 0;
int threads = atoi(argv[2]);
double start, end;
int i;
unsigned int seeds[threads];
for (i = 0; i < threads; i++)
seeds[i] = i + 1;
start = omp_get_wtime();
#pragma omp parallel reduction(+:hits) num_threads(threads)
{
int myrank = omp_get_thread_num();
long long int local_hits = 0, toss;
double x, y;
#pragma omp for
for (toss = 0; toss < tosses; toss++) {
x = rand_r(&seeds[myrank])/(double)RAND_MAX * 2 - 1;
y = rand_r(&seeds[myrank])/(double)RAND_MAX * 2 - 1;
if (x*x + y*y < 1)
local_hits++;
}
hits += local_hits;
}
end = omp_get_wtime();
printf("Pi: %f\n", 4.0 * hits / tosses);
printf("Duration: %f\n", end-start);
return 0;
}
The actual asked question was - Why this code scales so badly over multiple cores?
My questions are as follows:-
What is conveyed by the above statement? The cache line for 17th/33rd core can be also invalidated correct? So how is it different from the cores 1 to 16?
The own independent memory of the threads (stack memory/private memory) is a part of the cache memory or the main memory?
How can I relate cache line and block in terms of cache memories?
The seeds are integer values, i.e., they require 4 bytes of memory
This is not always true though it is often the case on most platforms. The C/C++ languages does not prevent sizeof(int) to be 8 or even 2 for example.
Therefore 16 seeds fit into a single cache line.
While this is true, there is no alignment requirements of the seeds array besides it must be aligned to sizeof(unsigned int). This means the array can theoretically cross two cache lines even with 16 seeds. The alignment is dependent of the target platform and more specifically the target compiler. In C11, you can use alignas to specify some alignment constraints to ensure the array is aligned to a 64-byte cache line.
This is an important point since adding a new seed will not impact the result much if the array is not aligned in memory. Moreover, threads can work twice faster if 8 seeds are on a cache line and 8 others are on another cache line (assuming there is no NUMA effects making things even more complex).
Lets assume the array alignment is 64 bytes.
What is conveyed by the above statement? The cache line for 17th/33rd core can be also invalidated correct? So how is it different from the cores 1 to 16?
With 17 seeds and 17 threads, the 16 threads will continue to compete for the same cache line. This effect is called cache-line bouncing and it makes threads run much more slowly. Meanwhile, the new other thread can operate on its own cache line resulting in a much faster execution (only for this specific thread). AFAIK, there is no reason to believe the that other thread will not be obstructed anymore assuming the array alignment is the same.
The own independent memory of the threads (stack memory/private memory) is a part of the cache memory or the main memory?
On modern compute the memory is nearly always cached in usual use-cases. One can bypass the cache using specific instructions (eg. non-temporal instructions though it is an hint) or a special kind of allocated memory (see write-combining memory), but standard accesses on standard allocated memory (eg. using malloc) is always cached on mainstream architectures. This is also true for the stack and the private memory.
How can I relate cache line and block in terms of cache memories?
I am not sure to understand your question. If you want to know where byte in RAM is associated to which cache line then... well... put it shortly you cannot. Modern processors use multi-level N-way set-associative caches. Based on an address, you can determine the index of the cache block, but not easily which in way of the cache. The later is dependent of the cache replacement policy. The thing is this algorithm is not officially well-documented for mainstream processors. For example Intel processors appears to use an adaptative replacement policy and no longer a pseudo-LRU strategy. One should also consider the mapping of virtual addresses VS to physical one. On top of that, Intel processors use an undocumented hash-based strategy so to uniformly distribute the accesses (physical addresses) to the L3 cache. Thus, if you really want to do that, then you need to pick a specific processor micro-architecture and a specific cache level in the first place for the problem to be manageable. You will certainly need a lot of time to try to understand and simulate what the target processors actually do. Alternatively, you can consider an unrealistic LRU replacement method to avoid the complexity (madness?) of real-world mainstream processors.
Related document/posts:
Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide -- Chapter 12
How does the CPU cache affect the performance of a C program

Why memory reordering is not a problem on single core/processor machines?

Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:
x = 0;
f = 0;
Thread #1:
while (f == 0);
print x;
Thread #2:
x = 42;
f = 1;
I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.
However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:
When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?
The CPU would not be aware that these are two threads. Threads are a software construct (1).
So the CPU sees these instructions, in this order:
store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x
If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.
In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.
(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.
The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)
The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.
(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)
But remember, other threads aren't observing memory directly with a logic analyzer, they're just running load instructions on that same CPU core that's doing and tracking the reordering.
If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.
Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache, can be a problem. Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering. (x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain. But some non-x86 OSes that had cache-control instructions designed in from the start do requires OSes to be aware of it. i.e. to make sure cache is invalidated before reading in new contents from disk, and to make sure it's at least written back to somewhere DMA can read from before asking a device to read from a page.)
And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/
so what makes sure the program order is preserved?
Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location1.
Without this, code like
x = 1;
int tmp = x;
might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.
We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.
Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction.
This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...

How can OpenMP's round robin scheduling hurt ccNUMA's performance?

I'm trying to understand ccNUMA systems but I'm a little bit confused about how OpenMP's scheduling can hurt the performance.Let's say we have the below code.What is happening if c1 is smaller than c0 or bigger?I understand the general idea that different chunk size leads to remote accesses but I read somewhere that for small chunk sizes something is happening with cache lines and I got really confused.
#pragma omp parallel for schedule(static,c0)
for(int i=0;i<N;i++)
A[i]=0;
#pragma omp parallel for schedule(static,c1)
for(int i=0;i<N;i++)
B[i]=A[i]*i;
When A[] has been allocated using malloc, the OS only promised that you will get the memory the pointer is pointing to. No actual memory allocation has been performed, that is, the physical memory pages have not been assigned yet. This happens when you execute the first parallel region where you touch the data for the first time (see also "first-touch policy"). When the first access happens, the OS creates the physical page on the same NUMA domain that executes the touching thread.
So, depending on how you choose c0 you get a certain distribution of the memory pages across the system. With a bit of math involved you can actually determine which value of c0 will lead to what distribution of the memory pages.
In the second loop, you're using a c1 that potentially different from c0. For certain values of c1 (esp., c1 equal to c0) you should see almost no NUMA traffic on the system, while for others you'll see a lot. Again, it's simple to determine those values mathematically.
Another thing you might be facing is false sharing. If c0 and c1 are chosen such that the data processed by a chunk is less than the size of a cache line, you'll see that a cache line is shared across multiple threads and thus is bouncing between the different caches of the system.

How many ABA tag bits are needed in lock-free data structures?

One popular solution to the ABA problem in lock-free data structures is to tag pointers with an additional monotonically incrementing tag.
struct aba {
void *ptr;
uint32_t tag;
};
However, this approach has a problem. It is really slow and has huge cache problems. I can obtain a speed-up of twice as much if I ditch the tag field. But this is unsafe?
So my next attempt stuff for 64 bit platforms stuffs bits in the ptr field.
struct aba {
uintptr __ptr;
};
uint32_t get_tag(struct aba aba) { return aba.__ptr >> 48U; }
But someone said to me that only 16 bits for the tag is unsafe. My new plan is to use pointer alignment to cache-lines to stuff more tag bits in but I want to know if that'll work.
If that fails to work my next plan is to use Linux's MAP_32BIT mmap flag to allocated data so I only need 32 bits of pointer space.
How many bits do I need for the ABA tag in lock-free data-structures?
The amount of tag bits that is practically safe can be estimated based on the preemption time and the frequency of pointer modifications.
To remind, the ABA problem happens when a thread reads the value it wants to change with compare-and-swap, gets preempted, and when it resumes the actual value of the pointer happens to be equal to what the thread read before. Therefore the compare-and-swap operation may succeed despite data structure modifications possibly done by other threads during the preemption time.
The idea of adding the monotonically incremented tag is to make each modification of the pointer unique. For it to succeed, increments must produce unique tag values during the time when a modifying thread might be preempted; i.e. for guaranteed correctness the tag may not wraparound during the whole preemption time.
Let's assume that preemption lasts a single OS scheduling time slice, which is typically tens to hundreds of milliseconds. The latency of CAS on modern systems is tens to hundreds of nanoseconds. So rough worst-case estimate is that there might be millions of pointer modifications while a thread is preempted, and so there should be 20+ bits in the tag in order for it to not wraparound.
In practice it can be possible to make a better estimate for a particular real use case, based on known frequency of CAS operations. One also need to estimate the worst-case preemption time more accurately; for example, a low-priority thread preempted by a higher-priority job might end up with much longer preemption time.
According to the paper
http://web.cecs.pdx.edu/~walpole/class/cs510/papers/11.pdf
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects (IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 6, JUNE 2004 p. 491) by PhD Maged M. Michael
tag bits should be sized to make wraparound impossible in real lockfree scenarios (I can read this as if you may have N threads running and each may access the structure, you should have N+1 different states for tags at least):
6.1.1 IBM ABA-Prevention Tags
The earliest and simplest lock-free method for node reuse is
the tag (update counter) method introduced with the
documentation of CAS on the IBM System 370 [11]. It
requires associating a tag with each location that is the
target of ABA-prone comparison operations. By incrementing
the tag when the value of the associated location is
written, comparison operations (e.g., CAS) can determine if
the location was written since it was last accessed by the
same thread, thus preventing the ABA problem.
The method requires that the tag contains enough bits to make
full wraparound impossible during the execution of any
single lock-free attempt. This method is very efficient and
allows the immediate reuse of retired nodes.
Depending on your data structure you could be able to steal some extra bits from the pointers. For example if the objects are 64 bytes and always aligned on 64 byte boundaries, the lower 6 bits of each pointer could be used for the tags (but that's probably what you already suggested for your new plan).
Another option would be to use an index into your objects instead of pointers.
In case of contiguous objects that would of course simply be an index into an array or vector. In case of lists or trees with objects allocated on the heap, you could use a custom allocator and use an index into your allocated block(s).
For say 17M objects you would only need 24 bits, leaving 40 bits for the tags.
This would need some (small and fast) extra calculation to get the address, but if the alignment is a power of 2 only a shift and an addition are needed.

Hazards of not protection shared variables in a threaded environment

I'm trying to understand the hazards of not locking shared variables in a threaded (or shared memory) environment. It is easy to argue that if you are doing two or more dependent operations on a variable it is important to hold some lock first. The typical example is the increment operation, which first reads the current value before adding one and writing back.
But what if you only have one writer (and lots of readers) and the write is not dependent on the previous value. So I have one thread storing a timestamp offset once every second. The offset holds the difference between local time and some other time base. A lot of readers use this offset to timestamp events and getting a read lock for each time is a little expensive. In this situation I don't care if the reader gets the value just before the write or just after, as long as the reader don't get garbage (that is an offset that was never set).
Say that the variable is a 32 bit integer. Is it possible to get a garbage read of the variable in the middle of a write? Or are writing a 32 bit integer an atomic operation? Will it depend on the Os or hardware? What a about a 64 bit integer on a 32 bit system?
What about shared memory instead of threading?
Writing a 64-bit integer on a 32-bit system is not atomic, and you could have incorrect data if you don't take a lock.
As an example, if your integer is
0x00000000 0xFFFFFFFF
and you are going to write the next int in sequence, you want to write:
0x00000001 0x00000000
But if you read the value after one of the ints is written and before the other is, then you could read
0x00000000 0x00000000
or
0x00000001 0xFFFFFFFF
which are wildly different than the correct value.
If you want to work without locks, you have to be very certain what constitutes an atomic operation on your OS/CPU/compiler combination.
In additions to the above comments, beware the register bank in a slightly more general setting. You may end up updating only the cpu register and not really write it back to main memory right away. Or the other way around where you use a cached register copy while the original value in memory has been updated. Some languages have a volatile keyword to mark a variable as "read-always-and-never-locally-register-cache".
The memory model of your language is important. It describes exactly under what conditions a given value is shared among several threads. Either this is the rules of the CPU architecture you are executing on, or it is determined by a virtual machine in which the language is running. Java for instance has a separate memory model you can look at to figure out what exactly to expect.
An 8-bit, 16-bit or 32-bit read/write is guaranteed to be atomic if it is aligned to it's size (on 486 and later) and unaligned but within a cache line (on P6 and later). Most compilers will guarantee stack (local, assuming C/C++) variables are aligned.
A 64-bit read/write is guaranteed to be atomic if it is aligned (on Pentium and later), however, this relies on the compiler generating a single instruction (for example, popping a 64-bit float from the FPU or using MMX). I expect most compilers will use two 32-bit accesses for compatibility, though it is certainly possible to check (the disassembly) and it may be possible to coerce different handling.
The next issue is caching and memory fencing. However, the effect of ignoring these is that some threads may see the old value even though it has been updated. The value won't be invalid, simply out of date (by microseconds, probably). If this is critical to your application, you will have to dig deeper, but I doubt it is.
(Source: Intel Software Developer Manual Volume 3A)
It very much depends on hardware and how you are talking to it. If you are writing assembler, you will know exactly what you get as processor manuals will tell you which operations are atomic and under what conditions. For example, in the Intel Pentium, 32-bit reads are atomic if the address is aligned, but not otherwise.
If you are working on any level above that, it will depend on how that ultimately gets translated into machine code. Be that a compiler, interpreter, or virtual machine.
The platform you run on determines the size of atomic reads/writes. Generally, a 32-bit (register) platform only supports 32-bit atomic operations. So, if you are writing more than 32-bits, you will probably have to use some other mechanism to coordinate access to that shared data.
One mechanism is to double or triple buffer the actual data and use a shared index to determine the "latest" version:
write(blah)
{
new_index= ...; // find a free entry in the global_data array.
global_data[new_index]= blah;
WriteBarrier(); // write-release
global_index= new_index;
}
read()
{
read_index= global_index;
ReadBarrier(); // read-acquire
return global_data[read_index];
}
You need the memory barriers to ensure that you don't read from global_data[...] until after you read global_index and you don't write to global_index until after you write to global_data[...].
This is a little awful since you can also run into the ABA issue with preemption, so don't use this directly.
Platforms often provide atomic read/write access (enforced at the hardware level) to primitive values (32-bit or 64-bit,as in your example) - see the Interlocked* APIs on Windows.
This can avoid the use of a heavier weight lock for threadsafe variable or member access, but should not be mixed up with other types of lock on the same instance or member. In other words, don't use a Mutex to mediate access in one place and use Interlocked* to modify or read it in another.

Resources