Threading and vectorisation optimisations

Threading and vectorisation optimisations - multithreading

I'm really new to code optimisation techniques, and I'm currently trying to optimise a loop section of a piece of code, which should be trivially easy.
for (int i = 0; i < N; i++)
{
array[i] = 0.0f;
array2[i] = 0.0f;
array3[i] = 0.0f;
}
I tried to implement vectorisation and threading as follows:
int i;
int loop_unroll = (int) (N/4)*4;
#pragma omp parallel for shared(array,array2,array3)
for(int i = 0; i < loop_unroll; i+=4)
{
__m128 array_vector = _mm_load_ps(array+i);
array_vector = _mm_set1_ps(0.0f);
_mm_store_ps(array+i, array_vector);
_mm_store_ps(array2+i, array_vector);
_mm_store_ps(array3+i, array_vector);
}
for(;i<N;i++)
{
array[i] = 0.0f;
array2[i] = 0.0f;
array3[i] = 0.0f;
}
Regardless of the input size N i run this with, the 'optimised' version always takes longer.
I thought this was due to the overhead associated with setting up the threads and registers, but for the largest N before the program becomes too slow to use, the overhead still isn't mitigated by the faster code.
This makes me wonder if the optimisation techniques used are implemented incorrectly?

Compiling + benchmarking with optimization disabled is your main problem. Un-optimized code with intrinsics is typically very slow. It's not useful to compare intrinsics vs. scalar with optimization disabled, gcc -O0 usually hurts intrinsics more.
Once you stop wasting your time with unoptimized code, you'll want to let gcc -O3 optimize the scalar loop into 3 memset operations, rather than interleaving 3 streams of stores in a loop.
Compilers (and optimized libc memset implementations) are good at optimizing memory zeroing, and can probably do a better job than your simple manual vectorization. It's probably not too bad, though, especially if you arrays aren't already hot in L1d cache. (If they were, then using only 128-bit stores is much slower than 256-bit or 512-bit vectors on CPUs with wide data paths.)
I'm not sure what the best way to multi-thread with OpenMP while still letting the compiler optimize to memset. It might not be terrible to let omp parallel for parallelize code that stores 3 streams from each thread, as long as each thread is working on contiguous chunks in each array. Especially if code that later updates the same arrays will be distributed the same way, so each thread is working on a part of the arrays that it zeroed earlier, and is maybe still hot in L1d or L2 cache.
Of course, if you can avoid it, do the zeroing on the fly as part of another loop that has some useful computation. If your next step is going to be a[i] += something, then instead optimize that to a[i] = something for the first pass through the array so you don't need to zero it first.
See Enhanced REP MOVSB for memcpy for lots of x86 memory performance details, especially the "latency bound platforms" section which explains why single-threaded memory bandwidth (to L3/DRAM) is worse on a big Xeon than a quad-core desktop, even though aggregate bandwidth from multiple threads can be much higher when you have enough threads to saturate the quad-channel memory.
For store performance (ignoring cache locality for later work) I think it's best to have each thread working on (a chunk of) 1 array; a single sequential stream is probably the best way to max out single-threaded store bandwidth. But caches are associative, and 3 streams is few enough that it won't usually cause conflict misses before you write a full line. Desktop Skylake's L2 cache is only 4-way associative, though.
There are DRAM page-switching effects from storing multiple streams, and 1 stream per thread is probably better than 3 streams per thread. So if you do want each thread zeroing a chunk of 3 separate arrays, ideally you'd want each thread to memset its whole chunk of the first array, then its whole chunk of the 2nd array, rather than interleaving 3 arrays.

Related

How a thread uses the cache memory in a processor?

I have read the below statement somewhere which I cannot really follow -
There is a slight gain in performance for more than 16 and more than 32
cores. The seeds are integer values, i.e., they require 4 bytes of memory. A
cache line in our system has 64 bytes. Therefore 16 seeds fit into a single
cache line. When going to 17/33 threads, the additional seed is placed in its
own cache line so that the threads are not further obstructed.
The code referred for this question is provided below -
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main(int argc, char *argv[]) {
long long int tosses = atoll(argv[1]);
long long int hits = 0;
int threads = atoi(argv[2]);
double start, end;
int i;
unsigned int seeds[threads];
for (i = 0; i < threads; i++)
seeds[i] = i + 1;
start = omp_get_wtime();
#pragma omp parallel reduction(+:hits) num_threads(threads)
{
int myrank = omp_get_thread_num();
long long int local_hits = 0, toss;
double x, y;
#pragma omp for
for (toss = 0; toss < tosses; toss++) {
x = rand_r(&seeds[myrank])/(double)RAND_MAX * 2 - 1;
y = rand_r(&seeds[myrank])/(double)RAND_MAX * 2 - 1;
if (x*x + y*y < 1)
local_hits++;
}
hits += local_hits;
}
end = omp_get_wtime();
printf("Pi: %f\n", 4.0 * hits / tosses);
printf("Duration: %f\n", end-start);
return 0;
}
The actual asked question was - Why this code scales so badly over multiple cores?
My questions are as follows:-
What is conveyed by the above statement? The cache line for 17th/33rd core can be also invalidated correct? So how is it different from the cores 1 to 16?
The own independent memory of the threads (stack memory/private memory) is a part of the cache memory or the main memory?
How can I relate cache line and block in terms of cache memories?

The seeds are integer values, i.e., they require 4 bytes of memory
This is not always true though it is often the case on most platforms. The C/C++ languages does not prevent sizeof(int) to be 8 or even 2 for example.
Therefore 16 seeds fit into a single cache line.
While this is true, there is no alignment requirements of the seeds array besides it must be aligned to sizeof(unsigned int). This means the array can theoretically cross two cache lines even with 16 seeds. The alignment is dependent of the target platform and more specifically the target compiler. In C11, you can use alignas to specify some alignment constraints to ensure the array is aligned to a 64-byte cache line.
This is an important point since adding a new seed will not impact the result much if the array is not aligned in memory. Moreover, threads can work twice faster if 8 seeds are on a cache line and 8 others are on another cache line (assuming there is no NUMA effects making things even more complex).
Lets assume the array alignment is 64 bytes.
What is conveyed by the above statement? The cache line for 17th/33rd core can be also invalidated correct? So how is it different from the cores 1 to 16?
With 17 seeds and 17 threads, the 16 threads will continue to compete for the same cache line. This effect is called cache-line bouncing and it makes threads run much more slowly. Meanwhile, the new other thread can operate on its own cache line resulting in a much faster execution (only for this specific thread). AFAIK, there is no reason to believe the that other thread will not be obstructed anymore assuming the array alignment is the same.
The own independent memory of the threads (stack memory/private memory) is a part of the cache memory or the main memory?
On modern compute the memory is nearly always cached in usual use-cases. One can bypass the cache using specific instructions (eg. non-temporal instructions though it is an hint) or a special kind of allocated memory (see write-combining memory), but standard accesses on standard allocated memory (eg. using malloc) is always cached on mainstream architectures. This is also true for the stack and the private memory.
How can I relate cache line and block in terms of cache memories?
I am not sure to understand your question. If you want to know where byte in RAM is associated to which cache line then... well... put it shortly you cannot. Modern processors use multi-level N-way set-associative caches. Based on an address, you can determine the index of the cache block, but not easily which in way of the cache. The later is dependent of the cache replacement policy. The thing is this algorithm is not officially well-documented for mainstream processors. For example Intel processors appears to use an adaptative replacement policy and no longer a pseudo-LRU strategy. One should also consider the mapping of virtual addresses VS to physical one. On top of that, Intel processors use an undocumented hash-based strategy so to uniformly distribute the accesses (physical addresses) to the L3 cache. Thus, if you really want to do that, then you need to pick a specific processor micro-architecture and a specific cache level in the first place for the problem to be manageable. You will certainly need a lot of time to try to understand and simulate what the target processors actually do. Alternatively, you can consider an unrealistic LRU replacement method to avoid the complexity (madness?) of real-world mainstream processors.
Related document/posts:
Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide -- Chapter 12
How does the CPU cache affect the performance of a C program

Low 'Average Physical Core Utilization' according to VTune when using OpenMP, not sure what the bigger picture is

I have been optimizing a ray tracer, and to get a nice speed up, I used OpenMP generally like follows (C++):
Accelerator accelerator; // Has the data to make tracing way faster
Rays rays; // Makes the rays so they're ready to go
#pragma omp parallel for
for (int y = 0; y < window->height; y++) {
for (int x = 0; x < window->width; x++) {
Ray& ray = rays.get(x, y);
accelerator.trace(ray);
}
}
I gained 4.85x performance on a 6 core/12 thread CPU. I thought I'd get more than that, maybe something like 6-8x... especially when this eats up >= 99% of the processing time of the application.
I want to find out where my performance bottleneck is, so I opened VTune and profiled. Note that I am new to profiling, so maybe this is normal but this is the graph I got:
In particular, this is the 2nd biggest time consumer:
where the 58% is the microarchitecture usage.
Trying to solve this on my own, I went looking for information on this, but the most I could find was on Intel's VTune wiki pages:
Average Physical Core Utilization
Metric Description
The metric shows average physical cores utilization by computations of the application. Spin and Overhead time are not counted. Ideal average CPU utilization is equal to the number of physical CPU cores.
I'm not sure what this is trying to tell me, which leads me to my question:
Is this normal for a result like this? Or is something going wrong somewhere? Is it okay to only see a 4.8x speedup (compared to a theoretical max of 12.0) for something that is embarrassingly parallel? While ray tracing itself can be unfriendly due to the rays bouncing everywhere, I have done what I can to compact the memory and be as cache friendly as possible, use libraries that utilize SIMD for calculations, done countless implementations from the literature to speed things up, and avoided branching as much as possible and do no recursion. I also parallelized the rays so that there's no false sharing AFAIK, since each row is done by one thread so there shouldn't be any cache line writing for any threads (especially since ray traversal is all const). Also the framebuffer is row major, so I was hoping false sharing wouldn't be an issue from that.
I do not know if a profiler will pick up the main loop that is threaded with OpenMP and this is an expected result, or if I have some kind of newbie mistake and I'm not getting the throughput that I want. I also checked that it spawns 12 threads, and OpenMP does.
I guess tl;dr, am I screwing up using OpenMP? From what I gathered, the average physical core utilization is supposed to be up near the average logical core utilization, but I almost certainly have no idea what I'm talking about.

Imho you're doing it right and you overestimate the efficiency of parallel execution. You did not give details about the architecture you're using (CPU, memory etc), nor the code... but to say it simple I suppose that beyond 4.8x speed increase you're hitting the memory bandwidth limit, so RAM speed is your bottleneck.
Why?
As you said, ray tracing is not hard to run in parallel and you're doing it right, so if the CPU is not 100% busy my guess is your memory controller is.
Supposing you're tracing a model (triangles? voxels?) that is in RAM, your rays need to read bits of model when checking for hits. You should check your maximum RAM bandwith, then divide it by 12 (threads) then divide it by the number of rays per second... and find that even 40 GB/s are "not so much" when you trace a lot of rays. That's why GPUs are a better option for ray tracing.
Long story short, I suggest you try to profile memory usage.

Is a dirty read from memory possible when multi-threading?

In this case, I define dirty read as reading from memory when it's currently being written to by another thread.
So if thread #1 writes 100 to a variable that thread #2 can also see, that writes 50 to the same variable, where both threads do so in a loop.
Is it possible to read the variable and get a number that is neither 50 or 100?
Using no locks etc for sync.
More detail about my setup: Intel i3 CPU, I'm programming in C#.
Below is an example of what I mean:
using System;
using System.Collections.Generic;
using System.Threading;
namespace Threading001
{
class Program
{
static void Main(string[] args)
{
long min = long.MinValue;
long max = long.MaxValue;
object number = max;
new Thread(() =>
{
long current2 = (long)number;
if (current2 != min && current2 != max)
{
Console.WriteLine("Unexpected number from thread 2: {0}.", current2);
}
number = min;
}).Start();
while (true)
{
long current = (long)number;
if (current != min && current != max)
{
Console.WriteLine("Unexpected number from thread 1: {0}.", current);
}
number = max;
}
}
}
}
I made number an object so the memory is allocated on the heap and not the stack to try and increase memory access latency times. Althrough cpu caching will probably stop that anyway.

You're actually trying to rely on several different things here.
Firstly, there's the matter of atomicity. ECMA-335 states:
A conforming CLI shall guarantee that read and write access to properly aligned memory
locations no larger than the native word size (the size of type native int) is atomic
(see §I.12.6.2) when all the write accesses to a location are the same size. Atomic writes shall
alter no bits other than those written. Unless explicit layout control (see
Partition II (Controlling Instance Layout)) is used to alter the default behavior, data elements no
larger than the natural word size (the size of a native int) shall be properly aligned. Object
references shall be treated as though they are stored in the native word size.
So for a 32-bit integer, you're always fine - and for a 64-bit integer, you're fine if you're running on a 64-bit CLR... assuming your variables are aligned, which they are normally.
However, you've also got boxing involved - I don't think that will actually cause any problems here, but it does mean you're dealing with the visibility of multiple writes: one for the number variable itself, and one for the data within the box. With the .NET implementation, I believe that's still safe due to the stronger memory guarantees it provides - I wouldn't like to absolutely guarantee that it's safe within the ECMA memory model.
Finally, there's the matter of whether a write is visible or not, which is beyond atomicity. If thread T1 changes an int value from 0 to 100, then thread T2 reads it, atomicity guarantees that T2 will see either 0 or 100, never some other bit pattern - but there usually has to be some kind of memory barrier involved to guarantee that T2 will actually see the new value instead of a stale value. This is a really complex area - if you want to know more, I suggest you start with Joe Duffy's 2007 blog post and work from there.
Note that min, max and number will be on the heap anyway, as they've been captured by your lambda expressions... although the stack/heap is an implementation detail.

It depends what the variable is (and what processor etc.etc.), but generally:Yes, dirty reads are possible.

I am not familiar with details of your specific CPU but in general it depends on whether or not the READ/WRITE is atomic, which in turn depends on the architecture and how the variable is stored.
If the variable has larger size than CPU word size, it may not be atomic.
Modern cpu may guarantee atomic access to aligned memory address; if it needs more consideration if it has no HW support for misaligned memory access. If misaligned memory access is handled by software, the read or write would not be atomic: one load/store may lead to actually two operations. One example is PowerPC/Linux where kernel handles misaligned memory access in the exception handler.

this is all about thread safe or not it depends on datatype of that variable you want to read,
some types such as long, int, byte etc are thread safe and you can read and write them in multiple threads.
you can find more information here
http://msdn.microsoft.com/en-us/library/dd997305(v=vs.110).aspx
Are primitive data types in c# atomic (thread safe)?
What is thread safe (C#) ? (Strings, arrays, ... ?)

memory allocation inside a CUDA kernel

I have the following (snippet) of a kernel.
__global__ void plain(int* geneVec, float* probs, int* nComponents, float* randomNumbers,int *nGenes)
{
int xid = threadIdx.x + (blockDim.x * blockIdx.x);
float* currentProbs= (float*)malloc(sizeof(float)*tmp);
.....
.....
currentProbs[0] = probs[start];
for (k=1;k<nComponents[0]; k++)
{
currentProbs[k] = currentProbs[k-1] + prob;
}
...
...
free(currentProbs);
}
When it's static (even the same sizes) it's very fast, but when CurrentProbs is dynamically allocated (as above) performance is awful.
This question said I could do this inside a kernel: CUDA allocate memory in __device__ function
Here is a related question: Efficiency of Malloc function in CUDA
I was wondering if any other methods have solved this other than the one proposed in the paper?
It seems ridiculous that one cannot malloc/free inside a kernel without this sort of penalty.

I think the reason introducing malloc() slows your code down is that it allocates memory in global memory. When you use a fixed size array, the compiler is likely to put it in the register file, which is much faster.
Having to do a malloc inside your kernel may mean that you're trying to do too much work with a single kernel. If each thread allocates a different amount of memory, then each thread runs a different number of times in the for loop, and you get lots of warp divergence.
If each thread in a warp runs loops the same number of times, just allocate up front. Even if they run a different number of times, you can use a constant size. But instead, I think you should look at how you can refactor your code to entirely remove that loop from your kernel.

What kind of optimizations can be achieved on a traditional single-threaded game engine like ioquake3 with OpenMP?

Can you think of ways to achieve significant improvement on a traditional engine like id tech 3? By attempting to do it on the Audio Subsystem I noticed it inflicts a slow down rather than a speed up. I suspect it would need big chunks of data to be calculated in loops and only rarely communicate with the core.

I don't know anything about ioquake3 or id tech 3 but a fair bit about OpenMP so I'll fire the question right back to you.
OpenMP was, initially, developed to distribute loop iterations over large arrays across processors with access to shared memory. This is a requirement in a large fraction of scientific and engineering programs so it will be no surprise that OpenMP is much used for such programs.
More recently, with OpenMP 3.0, it has good facilities for director/worker task decomposition which extend its range of application. I don't have a lot of experience with these new features, but they look promising.
So the question for you is: how well does your computational core fit the model of computation that OpenMP supports ?

OpenMP is very effective when operating on data that doesn't depend on other elements in loop. For example:
std::vector<int> big_vector(1000, 0);
for (int i = 0; i < big_vector.size(); ++i)
{
big_vector[i] = i;
}
would optimize well with OpenMP, but
std::vector<int> big_vector(1000, 0);
for (int i = 1; i < big_vector.size(); ++i)
{
big_vector[i] = i * (big_vector[i - 1] + i);
}
would not.
You can also play around with the OpenMP settings to see if they improve your results. For more information, http://www.amazon.com/Multi-Threaded-Engine-Design-Jonathan-Harbour/dp/1435454170 has a whole chapter on OpenMP (as well as boost, posix-threads, and Windows Threads).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string