How a thread uses the cache memory in a processor? - multithreading

I have read the below statement somewhere which I cannot really follow -
There is a slight gain in performance for more than 16 and more than 32
cores. The seeds are integer values, i.e., they require 4 bytes of memory. A
cache line in our system has 64 bytes. Therefore 16 seeds fit into a single
cache line. When going to 17/33 threads, the additional seed is placed in its
own cache line so that the threads are not further obstructed.
The code referred for this question is provided below -
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main(int argc, char *argv[]) {
long long int tosses = atoll(argv[1]);
long long int hits = 0;
int threads = atoi(argv[2]);
double start, end;
int i;
unsigned int seeds[threads];
for (i = 0; i < threads; i++)
seeds[i] = i + 1;
start = omp_get_wtime();
#pragma omp parallel reduction(+:hits) num_threads(threads)
{
int myrank = omp_get_thread_num();
long long int local_hits = 0, toss;
double x, y;
#pragma omp for
for (toss = 0; toss < tosses; toss++) {
x = rand_r(&seeds[myrank])/(double)RAND_MAX * 2 - 1;
y = rand_r(&seeds[myrank])/(double)RAND_MAX * 2 - 1;
if (x*x + y*y < 1)
local_hits++;
}
hits += local_hits;
}
end = omp_get_wtime();
printf("Pi: %f\n", 4.0 * hits / tosses);
printf("Duration: %f\n", end-start);
return 0;
}
The actual asked question was - Why this code scales so badly over multiple cores?
My questions are as follows:-
What is conveyed by the above statement? The cache line for 17th/33rd core can be also invalidated correct? So how is it different from the cores 1 to 16?
The own independent memory of the threads (stack memory/private memory) is a part of the cache memory or the main memory?
How can I relate cache line and block in terms of cache memories?

The seeds are integer values, i.e., they require 4 bytes of memory
This is not always true though it is often the case on most platforms. The C/C++ languages does not prevent sizeof(int) to be 8 or even 2 for example.
Therefore 16 seeds fit into a single cache line.
While this is true, there is no alignment requirements of the seeds array besides it must be aligned to sizeof(unsigned int). This means the array can theoretically cross two cache lines even with 16 seeds. The alignment is dependent of the target platform and more specifically the target compiler. In C11, you can use alignas to specify some alignment constraints to ensure the array is aligned to a 64-byte cache line.
This is an important point since adding a new seed will not impact the result much if the array is not aligned in memory. Moreover, threads can work twice faster if 8 seeds are on a cache line and 8 others are on another cache line (assuming there is no NUMA effects making things even more complex).
Lets assume the array alignment is 64 bytes.
What is conveyed by the above statement? The cache line for 17th/33rd core can be also invalidated correct? So how is it different from the cores 1 to 16?
With 17 seeds and 17 threads, the 16 threads will continue to compete for the same cache line. This effect is called cache-line bouncing and it makes threads run much more slowly. Meanwhile, the new other thread can operate on its own cache line resulting in a much faster execution (only for this specific thread). AFAIK, there is no reason to believe the that other thread will not be obstructed anymore assuming the array alignment is the same.
The own independent memory of the threads (stack memory/private memory) is a part of the cache memory or the main memory?
On modern compute the memory is nearly always cached in usual use-cases. One can bypass the cache using specific instructions (eg. non-temporal instructions though it is an hint) or a special kind of allocated memory (see write-combining memory), but standard accesses on standard allocated memory (eg. using malloc) is always cached on mainstream architectures. This is also true for the stack and the private memory.
How can I relate cache line and block in terms of cache memories?
I am not sure to understand your question. If you want to know where byte in RAM is associated to which cache line then... well... put it shortly you cannot. Modern processors use multi-level N-way set-associative caches. Based on an address, you can determine the index of the cache block, but not easily which in way of the cache. The later is dependent of the cache replacement policy. The thing is this algorithm is not officially well-documented for mainstream processors. For example Intel processors appears to use an adaptative replacement policy and no longer a pseudo-LRU strategy. One should also consider the mapping of virtual addresses VS to physical one. On top of that, Intel processors use an undocumented hash-based strategy so to uniformly distribute the accesses (physical addresses) to the L3 cache. Thus, if you really want to do that, then you need to pick a specific processor micro-architecture and a specific cache level in the first place for the problem to be manageable. You will certainly need a lot of time to try to understand and simulate what the target processors actually do. Alternatively, you can consider an unrealistic LRU replacement method to avoid the complexity (madness?) of real-world mainstream processors.
Related document/posts:
Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide -- Chapter 12
How does the CPU cache affect the performance of a C program

Related

Do the instruction cache and data cache sync with each other? [duplicate]

I like examples, so I wrote a bit of self-modifying code in c...
#include <stdio.h>
#include <sys/mman.h> // linux
int main(void) {
unsigned char *c = mmap(NULL, 7, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|
MAP_ANONYMOUS, -1, 0); // get executable memory
c[0] = 0b11000111; // mov (x86_64), immediate mode, full-sized (32 bits)
c[1] = 0b11000000; // to register rax (000) which holds the return value
// according to linux x86_64 calling convention
c[6] = 0b11000011; // return
for (c[2] = 0; c[2] < 30; c[2]++) { // incr immediate data after every run
// rest of immediate data (c[3:6]) are already set to 0 by MAP_ANONYMOUS
printf("%d ", ((int (*)(void)) c)()); // cast c to func ptr, call ptr
}
putchar('\n');
return 0;
}
...which works, apparently:
>>> gcc -Wall -Wextra -std=c11 -D_GNU_SOURCE -o test test.c; ./test
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
But honestly, I didn't expect it to work at all. I expected the instruction containing c[2] = 0 to be cached upon the first call to c, after which all consecutive calls to c would ignore the repeated changes made to c (unless I somehow explicitedly invalidated the cache). Luckily, my cpu appears to be smarter than that.
I guess the cpu compares RAM (assuming c even resides in RAM) with the instruction cache whenever the instruction pointer makes a large-ish jump (as with the call to the mmapped memory above), and invalidates the cache when it doesn't match (all of it?), but I'm hoping to get more precise information on that. In particular, I'd like to know if this behavior can be considered predictable (barring any differences of hardware and os), and relied on?
(I probably should refer to the Intel manual, but that thing is thousands of pages long and I tend to get lost in it...)
What you do is usually referred as self-modifying code. Intel's platforms (and probably AMD's too) do the job for you of maintaining an i/d cache-coherency, as the manual points it out (Manual 3A, System Programming)
11.6 SELF-MODIFYING CODE
A write to a memory location in a code segment that is currently cached in the
processor causes the associated cache line (or lines) to be invalidated.
But this assertion is valid as long as the same linear address is used for modifying and fetching, which is not the case for debuggers and binary loaders since they don't run in the same address-space:
Applications that include self-modifying code use the same
linear address for modifying and fetching the instruction. Systems software, such as
a debugger, that might possibly modify an instruction using a different linear address
than that used to fetch the instruction, will execute a serializing operation, such as a
CPUID instruction, before the modified instruction is executed, which will automatically
resynchronize the instruction cache and prefetch queue.
For instance, serialization operation are always requested by many other architectures such as PowerPC, where it must be done explicitely (E500 Core Manual):
3.3.1.2.1 Self-Modifying Code
When a processor modifies any memory location that can contain an instruction, software must
ensure that the instruction cache is made consistent with data memory and that the modifications
are made visible to the instruction fetching mechanism. This must be done even if the cache is
disabled or if the page is marked caching-inhibited.
It is interesting to notice that PowerPC requires the issue of a context-synchronizing instruction even when caches are disabled; I suspect it enforces a flush of deeper data processing units such as the load/store buffers.
The code you proposed is unreliable on architectures without snooping or advanced cache-coherency facilities, and therefore likely to fail.
Hope this help.
It's pretty simple; the write to an address that's in one of the cache lines in the instruction cache invalidates it from the instruction cache. No "synchronization" is involved.
The CPU handles cache invalidation automatically, you don't have to do anything manually. Software can't reasonably predict what will or will not be in CPU cache at any point in time, so it's up to the hardware to take care of this. When the CPU saw that you modified data, it updated its various caches accordingly.
By the way, many x86 processors (that I worked on) snoop not only the instruction cache but also the pipeline, instruction window - the instructions that are currently in flight. So self modifying code will take effect the very next instruction. But, you are encouraged to use a serializing instruction like CPUID to ensure that your newly written code will be executed.
I just reached this page in one of my Search and want to share my knowledge on this area of Linux kernel!
Your code executes as expected and there are no surprises for me here. The mmap() syscall and processor Cache coherency protocol does this trick for you. The flags "PROT_READ|PROT_WRITE|PROT_EXEC" asks the mmamp() to set the iTLB, dTLB of L1 Cache and TLB of L2 cache of this physical page correctly. This low level architecture specific kernel code does this differently depending on processor architecture(x86,AMD,ARM,SPARC etc...). Any kernel bug here will mess up your program!
This is just for explanation purpose.
Assume that your system is not doing much and there are no process switches between between "a[0]=0b01000000;" and start of "printf("\n"):"...
Also, assume that You have 1K of L1 iCache, 1K dCache in your processor and some L2 cache in the core, . (Now a days these are in the order of few MBs)
mmap() sets up your virtual address space and iTLB1, dTLB1 and TLB2s.
"a[0]=0b01000000;" will actually Trap(H/W magic) into kernel code and your physical address will be setup and all Processor TLBs will be loaded by the kernel. Then, You will be back into user mode and your processor will actually Load 16 bytes(H/W magic a[0] to a[3]) into L1 dCache and L2 Cache. Processor will really go into Memory again, only when you refer a[4] and and so on(Ignore the prediction loading for now!). By the time you complete "a[7]=0b11000011;", Your processor had done 2 burst READs of 16 bytes each on the eternal Bus. Still no actual WRITEs into physical memory. All WRITEs are happening within L1 dCache(H/W magic, Processor knows) and L2 cache so for and the DIRTY bit is set for the Cache-line.
"a[3]++;" will have STORE Instruction in the Assembly code, but the Processor will store that only in L1 dCache&L2 and it will not go to Physical Memory.
Let's come to the function call "a()". Again the processor do the Instruction Fetch from L2 Cache into L1 iCache and so on.
Result of this user mode program will be the same on any Linux under any processor, due to correct implementation of low level mmap() syscall and Cache coherency protocol!
If You are writing this code under any embedded processor environment without OS assistance of mmap() syscall, you will find the problem you are expecting. This is because your are not using either H/W mechanism(TLBs) or software mechanism(memory barrier instructions).

How can OpenMP's round robin scheduling hurt ccNUMA's performance?

I'm trying to understand ccNUMA systems but I'm a little bit confused about how OpenMP's scheduling can hurt the performance.Let's say we have the below code.What is happening if c1 is smaller than c0 or bigger?I understand the general idea that different chunk size leads to remote accesses but I read somewhere that for small chunk sizes something is happening with cache lines and I got really confused.
#pragma omp parallel for schedule(static,c0)
for(int i=0;i<N;i++)
A[i]=0;
#pragma omp parallel for schedule(static,c1)
for(int i=0;i<N;i++)
B[i]=A[i]*i;
When A[] has been allocated using malloc, the OS only promised that you will get the memory the pointer is pointing to. No actual memory allocation has been performed, that is, the physical memory pages have not been assigned yet. This happens when you execute the first parallel region where you touch the data for the first time (see also "first-touch policy"). When the first access happens, the OS creates the physical page on the same NUMA domain that executes the touching thread.
So, depending on how you choose c0 you get a certain distribution of the memory pages across the system. With a bit of math involved you can actually determine which value of c0 will lead to what distribution of the memory pages.
In the second loop, you're using a c1 that potentially different from c0. For certain values of c1 (esp., c1 equal to c0) you should see almost no NUMA traffic on the system, while for others you'll see a lot. Again, it's simple to determine those values mathematically.
Another thing you might be facing is false sharing. If c0 and c1 are chosen such that the data processed by a chunk is less than the size of a cache line, you'll see that a cache line is shared across multiple threads and thus is bouncing between the different caches of the system.

Threading and vectorisation optimisations

I'm really new to code optimisation techniques, and I'm currently trying to optimise a loop section of a piece of code, which should be trivially easy.
for (int i = 0; i < N; i++)
{
array[i] = 0.0f;
array2[i] = 0.0f;
array3[i] = 0.0f;
}
I tried to implement vectorisation and threading as follows:
int i;
int loop_unroll = (int) (N/4)*4;
#pragma omp parallel for shared(array,array2,array3)
for(int i = 0; i < loop_unroll; i+=4)
{
__m128 array_vector = _mm_load_ps(array+i);
array_vector = _mm_set1_ps(0.0f);
_mm_store_ps(array+i, array_vector);
_mm_store_ps(array2+i, array_vector);
_mm_store_ps(array3+i, array_vector);
}
for(;i<N;i++)
{
array[i] = 0.0f;
array2[i] = 0.0f;
array3[i] = 0.0f;
}
Regardless of the input size N i run this with, the 'optimised' version always takes longer.
I thought this was due to the overhead associated with setting up the threads and registers, but for the largest N before the program becomes too slow to use, the overhead still isn't mitigated by the faster code.
This makes me wonder if the optimisation techniques used are implemented incorrectly?
Compiling + benchmarking with optimization disabled is your main problem. Un-optimized code with intrinsics is typically very slow. It's not useful to compare intrinsics vs. scalar with optimization disabled, gcc -O0 usually hurts intrinsics more.
Once you stop wasting your time with unoptimized code, you'll want to let gcc -O3 optimize the scalar loop into 3 memset operations, rather than interleaving 3 streams of stores in a loop.
Compilers (and optimized libc memset implementations) are good at optimizing memory zeroing, and can probably do a better job than your simple manual vectorization. It's probably not too bad, though, especially if you arrays aren't already hot in L1d cache. (If they were, then using only 128-bit stores is much slower than 256-bit or 512-bit vectors on CPUs with wide data paths.)
I'm not sure what the best way to multi-thread with OpenMP while still letting the compiler optimize to memset. It might not be terrible to let omp parallel for parallelize code that stores 3 streams from each thread, as long as each thread is working on contiguous chunks in each array. Especially if code that later updates the same arrays will be distributed the same way, so each thread is working on a part of the arrays that it zeroed earlier, and is maybe still hot in L1d or L2 cache.
Of course, if you can avoid it, do the zeroing on the fly as part of another loop that has some useful computation. If your next step is going to be a[i] += something, then instead optimize that to a[i] = something for the first pass through the array so you don't need to zero it first.
See Enhanced REP MOVSB for memcpy for lots of x86 memory performance details, especially the "latency bound platforms" section which explains why single-threaded memory bandwidth (to L3/DRAM) is worse on a big Xeon than a quad-core desktop, even though aggregate bandwidth from multiple threads can be much higher when you have enough threads to saturate the quad-channel memory.
For store performance (ignoring cache locality for later work) I think it's best to have each thread working on (a chunk of) 1 array; a single sequential stream is probably the best way to max out single-threaded store bandwidth. But caches are associative, and 3 streams is few enough that it won't usually cause conflict misses before you write a full line. Desktop Skylake's L2 cache is only 4-way associative, though.
There are DRAM page-switching effects from storing multiple streams, and 1 stream per thread is probably better than 3 streams per thread. So if you do want each thread zeroing a chunk of 3 separate arrays, ideally you'd want each thread to memset its whole chunk of the first array, then its whole chunk of the 2nd array, rather than interleaving 3 arrays.

Is a dirty read from memory possible when multi-threading?

In this case, I define dirty read as reading from memory when it's currently being written to by another thread.
So if thread #1 writes 100 to a variable that thread #2 can also see, that writes 50 to the same variable, where both threads do so in a loop.
Is it possible to read the variable and get a number that is neither 50 or 100?
Using no locks etc for sync.
More detail about my setup: Intel i3 CPU, I'm programming in C#.
Below is an example of what I mean:
using System;
using System.Collections.Generic;
using System.Threading;
namespace Threading001
{
class Program
{
static void Main(string[] args)
{
long min = long.MinValue;
long max = long.MaxValue;
object number = max;
new Thread(() =>
{
long current2 = (long)number;
if (current2 != min && current2 != max)
{
Console.WriteLine("Unexpected number from thread 2: {0}.", current2);
}
number = min;
}).Start();
while (true)
{
long current = (long)number;
if (current != min && current != max)
{
Console.WriteLine("Unexpected number from thread 1: {0}.", current);
}
number = max;
}
}
}
}
I made number an object so the memory is allocated on the heap and not the stack to try and increase memory access latency times. Althrough cpu caching will probably stop that anyway.
You're actually trying to rely on several different things here.
Firstly, there's the matter of atomicity. ECMA-335 states:
A conforming CLI shall guarantee that read and write access to properly aligned memory
locations no larger than the native word size (the size of type native int) is atomic
(see §I.12.6.2) when all the write accesses to a location are the same size. Atomic writes shall
alter no bits other than those written. Unless explicit layout control (see
Partition II (Controlling Instance Layout)) is used to alter the default behavior, data elements no
larger than the natural word size (the size of a native int) shall be properly aligned. Object
references shall be treated as though they are stored in the native word size.
So for a 32-bit integer, you're always fine - and for a 64-bit integer, you're fine if you're running on a 64-bit CLR... assuming your variables are aligned, which they are normally.
However, you've also got boxing involved - I don't think that will actually cause any problems here, but it does mean you're dealing with the visibility of multiple writes: one for the number variable itself, and one for the data within the box. With the .NET implementation, I believe that's still safe due to the stronger memory guarantees it provides - I wouldn't like to absolutely guarantee that it's safe within the ECMA memory model.
Finally, there's the matter of whether a write is visible or not, which is beyond atomicity. If thread T1 changes an int value from 0 to 100, then thread T2 reads it, atomicity guarantees that T2 will see either 0 or 100, never some other bit pattern - but there usually has to be some kind of memory barrier involved to guarantee that T2 will actually see the new value instead of a stale value. This is a really complex area - if you want to know more, I suggest you start with Joe Duffy's 2007 blog post and work from there.
Note that min, max and number will be on the heap anyway, as they've been captured by your lambda expressions... although the stack/heap is an implementation detail.
It depends what the variable is (and what processor etc.etc.), but generally:Yes, dirty reads are possible.
I am not familiar with details of your specific CPU but in general it depends on whether or not the READ/WRITE is atomic, which in turn depends on the architecture and how the variable is stored.
If the variable has larger size than CPU word size, it may not be atomic.
Modern cpu may guarantee atomic access to aligned memory address; if it needs more consideration if it has no HW support for misaligned memory access. If misaligned memory access is handled by software, the read or write would not be atomic: one load/store may lead to actually two operations. One example is PowerPC/Linux where kernel handles misaligned memory access in the exception handler.
this is all about thread safe or not it depends on datatype of that variable you want to read,
some types such as long, int, byte etc are thread safe and you can read and write them in multiple threads.
you can find more information here
http://msdn.microsoft.com/en-us/library/dd997305(v=vs.110).aspx
Are primitive data types in c# atomic (thread safe)?
What is thread safe (C#) ? (Strings, arrays, ... ?)

memory allocation inside a CUDA kernel

I have the following (snippet) of a kernel.
__global__ void plain(int* geneVec, float* probs, int* nComponents, float* randomNumbers,int *nGenes)
{
int xid = threadIdx.x + (blockDim.x * blockIdx.x);
float* currentProbs= (float*)malloc(sizeof(float)*tmp);
.....
.....
currentProbs[0] = probs[start];
for (k=1;k<nComponents[0]; k++)
{
currentProbs[k] = currentProbs[k-1] + prob;
}
...
...
free(currentProbs);
}
When it's static (even the same sizes) it's very fast, but when CurrentProbs is dynamically allocated (as above) performance is awful.
This question said I could do this inside a kernel: CUDA allocate memory in __device__ function
Here is a related question: Efficiency of Malloc function in CUDA
I was wondering if any other methods have solved this other than the one proposed in the paper?
It seems ridiculous that one cannot malloc/free inside a kernel without this sort of penalty.
I think the reason introducing malloc() slows your code down is that it allocates memory in global memory. When you use a fixed size array, the compiler is likely to put it in the register file, which is much faster.
Having to do a malloc inside your kernel may mean that you're trying to do too much work with a single kernel. If each thread allocates a different amount of memory, then each thread runs a different number of times in the for loop, and you get lots of warp divergence.
If each thread in a warp runs loops the same number of times, just allocate up front. Even if they run a different number of times, you can use a constant size. But instead, I think you should look at how you can refactor your code to entirely remove that loop from your kernel.

Resources