Enabling AVX512 support on compilation significantly decreases performance

Enabling AVX512 support on compilation significantly decreases performance - linux

I've got a C/C++ project that uses a static library. The library is built for 'skylake' architecture. The project is a data processing module, i.e. it performs many arithmetic operations, memory copying, searching, comparing, etc.
The CPU is Xeon Gold 6130T, it supports AVX512. I tried to compile my project with both -march=skylake and -march=skylake-avx512 and then link with the library.
In case of using -march=skylake-avx512 the project performance is significantly decreased (by 30% on average) in comparison to the project built with -march=skylake.
How can this be explained? What could be the reason?
Info:
Linux 3.10
gcc 9.2
Intel Xeon Gold 6130T

project performance is significantly decreased (by 30% on average)
In code that cannot be easily vectorized sporadic AVX instructions here and there downclock your CPU but do not provide any benefit. You may like to turn off AVX instructions completely in such scenarios.
See Advanced Vector Extensions, Downclocking:
Since AVX instructions are wider and generate more heat, Intel processors have provisions to reduce the Turbo Boost frequency limit when such instructions are being executed. The throttling is divided into three levels:
L0 (100%): The normal turbo boost limit.
L1 (~85%): The "AVX boost" limit. Soft-triggered by 256-bit "heavy" (floating-point unit: FP math and integer multiplication) instructions. Hard-triggered by "light" (all other) 512-bit instructions.
L2 (~60%): The "AVX-512 boost" limit. Soft-triggered by 512-bit heavy instructions.
The frequency transition can be soft or hard. Hard transition means the frequency is reduced as soon as such an instruction is spotted; soft transition means that the frequency is reduced only after reaching a threshold number of matching instructions. The limit is per-thread.
Downclocking means that using AVX in a mixed workload with an Intel processor can incur a frequency penalty despite it being faster in a "pure" context. Avoiding the use of wide and heavy instructions help minimize the impact in these cases. AVX-512VL is an example of only using 256-bit operands in AVX-512, making it a sensible default for mixed loads.
Also, see
On the dangers of Intel's frequency scaling.
Gathering Intel on Intel AVX-512 Transitions.
How to Fix Intel?.

Related

Where to find ipc (or cpi) value of Intel processors (say skylake) when diff no of physical and logical cores are used?

I am very new to this field and my question might be too stupid but please help me understand the fundamental here.
I want to know the instruction per cycle (ipc) or clock per instruction (cpi) of recent intel processors such as skylake or cascade lake. And I am also looking for these values when different no of physical cores are used and when hyper threading is used.
I thought spec cpu2017 benchmark results could help me here, but I could not find my ans there. They just compare the total execution time by time taken by some reference machine and gives the ratio. Am I missing something here?
I thought this is one of the very first performance parameters and should be calculated and published by some standard benchmark, but I could not find any. Am I missing something here?
Another related question which comes to my mind (and I think everybody might want to know) is what is the best it can provide using all the cores and threads (least cpi and max ipc)?
Please help me find ipc / cpi value of skylake (any Intel processor) when say maximum (28) cores are used and when hyperthreading is also enabled.

The IPC cost of hyperthreading (or SMT in general on non-Intel CPUs) totally depends on the workload.
If you're already bottlenecked on branch mispredicts, cache misses, or long dependency chains (low ILP), having 2 threads running on the same core leads to minimal interference.
(Partitioning the ROB reduces the ability to find ILP in either thread, though, so again it depends on the details.)
Competitive sharing of uop cache and L1d/L1i / L2 caches also might or might not be a problem, depending on cache footprint.
There is no general answer independent of workload
Some workloads get a major speedup from using HT to double the number of logical cores. Some high-ILP workloads actually do worse because of cache conflicts. (Workloads that can already come close to saturating the front-end at 4 uops per clock on Intel before Icelake, for example).
Agner Fog's microarch guide says a bit about this for some microarchitectures that support hyperthreading. https://agner.org/optimize/
IIRC, some AMD CPUs have higher front-end throughput with hyperthreading, but I think only Bulldozer-family.
Max throughput is not affected by HT, and each core is independent. e.g. 4 uops per clock for a Skylake core. Doubling the number of physical cores always doubles theoretical uops / clock. Obviously not all workloads parallelize efficiently, so running more threads might need more total instructions/uops, and/or create more memory stalls for communication.
Hyperthreading just helps you come closer to that more of the time by letting 2 threads fill each other's "bubbles" from stalls.

Paralellization vs vectorization performance bottlenec: Does AVX and MT compete?

I tried to compute the sum of all elements in a large matrix. Here are the test cases:
MT and AVX takes 37 s
MT and no AVX takes 40 s
AVX and no MT takes 49 s
Neither AVX or MT 105 s
In all cases, the CPU clock is fixed to 3.0 GHz (claimed by cpufreq-info):
current policy: frequency should be within 1.60 GHz and 3.40 GHz.
The governor "userspace" may decide which speed to use
within this range.
current CPU frequency is 3.00 GHz.
The matrix has 25000000 elements of type double and value 1.0. And the sum is computed repeatedly 4096 times in a loop. Without AVX, the speed improvement when using MT is 2.6. With AVX it is only 1.3. When running MT, the matrix is divided into 4 blocks, one per thread. If I reduce the CPU frequency, the MT improvement is larger for AVX, so there might be some issue with cache misses also, but that cannot explain the difference between (4)/(2) and (3)/(1). Does AVX and MT compete with each other in some way? The chip is i3570K.

It's quite possible that your baseline performance was bounded by execution latency, but either form of parallelization (MT or vectorization) allowed you to break that and reach the next bottleneck which is the memory BW of your CPU.
Check the peak BW your CPU can reach and compare with your data, looks like you're simply saturating at 20.5GB/s (25000000 elements * 4096 loops * 8Bytes assuming that's what your system uses for double / ~40 seconds), which seems a little low as this link says it should reach 25GB/s, but around the same ballpark so it could be due to other inefficiencies, like type of DDR, other apps / OS working in the background, frequency scaling done by the CPU to save power / reduce heat, etc..
You could also try running some memory benchmarks (lmbench, sandra, ..) and see if they do better under the same environment.

MT should not compete with MT, they are two different things. Although the summation idea is simple but depending on your implementation you can get very different numbers. I suggest you use the Stream benchmarks to test performance as they are the standard. I don't see your code but there are some issues:
you are initializing the matrix with 1.0 for all the elements. I think that is not a good idea. You should use random numbers or at lease initialize based on the index (e.g. (i%10)/10.0).
How do you measure time? you should place your timers out side the repetition loop and take the average over the number of repetition. Also do you use accurate timers?
Did you make sure that your code is actually vectorized? did you enable any compiler flags to display this information? Did you make sure that the AVX version of your code is used? maybe the compiler chose to use the scalar version.
you mentioned that the frequency is fixed, are you sure that the turbo mode is not enabled at any point of time?
What about thread affinity when measuring with MT?

How is memory segmentation bounds-checking done?

According to the wikipedia article on memory segmentation, x86 processors do segmentation bounds-checking in hardware. Are there any systems that do the bounds-checking in software? If so, what kind of overhead does that incur? In the hardware implementations, is there any way to skip the bounds checking to avoid the penalty (if there is a penalty)?

All modern languages do bounds checking in software, on top of segment bounds checking and memory map lookups. One benchmark suggests the overhead is about 0.5%. This is a small price to pay for stability and security.
A 486 could load a memory location in a single cycle, and CPUs have only gotten better at doing on-chip processing, so it's unlikely that segmentation bounds checking has any overhead at all.
Still, though, you can simply run in 64bit mode: "The processor does not perform segment limit checks at runtime in 64-bit mode" (Intel's developer manual, 3.2.4).

How to quantify the processing tradeoffs of CUDA devices for C kernels?

I recently upgraded from a GTX480 to a GTX680 in the hope that the tripled number of cores would manifest as significant performance gains in my CUDA code. To my horror, I've discovered that my memory intensive CUDA kernels run 30%-50% slower on the GTX680.
I realize that this is not strictly a programming question but it does directly impact on the performance of CUDA kernels on different devices. Can anyone provide some insight into the specifications of CUDA devices and how they can be used to deduce their performance on CUDA C kernels?

Not exactly an answer to your question, but some information that might be of help in understanding the performance of the GK104 (Kepler, GTX680) vs. the GF110 (Fermi, GTX580):
On Fermi, the cores run on double the frequency of the rest of the logic. On Kepler, they run at the same frequency. That effectively halves the number of cores on Kepler if one wants to do more of an apples to apples comparison to Fermi. So that leaves the GK104 (Kepler) with 1536 / 2 = 768 "Fermi equivalent cores", which is only 50% more than the 512 cores on the GF110 (Fermi).
Looking at the transistor counts, the GF110 has 3 billion transistors while the GK104 has 3.5 billion. So, even though the Kepler has 3 times as many cores, it only has slightly more transistors. So now, not only does the Kepler have only 50% more "Fermi equivalent cores" than Fermi, but each of those cores must be much simpler than the ones of Fermi.
So, those two issues probably explain why many projects see a slowdown when porting to Kepler.
Further, the GK104, being a version of Kepler made for graphics cards, has been tuned in such a way that cooperation between threads is slower than on Fermi (as such cooperation is not as important for graphics). Any potential potential performance gain, after taking the above facts into account, may be negated by this.
There is also the issue of double precision floating point performance. The version of GF110 used in Tesla cards can do double precision floating point at 1/2 the performance of single precision. When the chip is used in graphics cards, the double precision performance is artificially limited to 1/8 of single precision performance, but this is still much better than the 1/24 double precision performance of GK104.

One of the advances of new Kepler architecture is 1536 cores grouped into 8 192-core SMX'es but at the same time this number of cores is a big problem. Because shared memory is still limited to 48 kb. So if your application needs a lot of SMX resources then you can't execute 4 warps in parallel on single SMX. You can profile your code to find real occupancy of you GPU. The possible ways to improve you application:
use warp vote functions instead of shared memory communications;
increase a number of tread blocks and decrease a number threads in one block;
optimize global loads/stores. Kepler have 32 load/store modules for each SMX (twice more than on Kepler).

I am installing nvieuw and I use coolbits 2.0 to unlock your shader cores from default to max performance. Also, you must have both connectors of your device to 1 display, which can be enabled in nVidia control panel screen 1/2 and screen 2/2. Now you must clone this screen with the other, and Windows resolution config set screen mode to extended desktop.
With nVidia inspector 1.9 (BIOS level drivers), you can activate this mode by setting up a profile for the application (you need to add application's exe file to the profile). Now you have almost double performance (keep an eye on the temperature).
DX11 also features tesselation, so you want to override that and scale your native resolution.
Your native resolution can be achieved by rendering a lower like 960-540P and let the 3D pipelines do the rest to scale up to full hd (in nv control panel desktop size and position). Now scale the lower res to full screen with display, and you have full HD with double the amount of texture size rendering on the fly and everything should be good to rendering 3D textures with extreme LOD-bias (level of detail). Your display needs to be on auto zoom!
Also, you can beat sli config computers. This way I get higher scores than 3-way sli in tessmark. High AA settings like 32X mixed sample makes al look like hd in AAA quality (in tessmark and heavon benchies). There are no resolution setting in the endscore, so that shows it's not important that you render your native resolution!
This should give you some real results, so please read thoughtfully not literary.

I think the problem may lie in the number of Streaming Multiprocessors: The GTX 480 has 15 SMs, the GTX 680 only 8.
The number of SMs is important, since at most 8/16 blocks or 1536/2048 threads (compute capability 2.0/3.0) can reside on an single SM. The resources they share, e.g. shared memory and registers, can further limit the number of blocks per SM. Also, the higher number of cores per SM on the GTX 680 can only reasonably be exploited using instruction-level parallelism, i.e. by pipelining several independent operations.
To find out the number of blocks you can run concurrently per SM, you can use nVidia's CUDA Occupancy Calculator spreadsheet. To see the amount of shared memory and registers required by your kernel, add -Xptxas –v to the nvcc command line when compiling.

OpenCL, TBB, OpenMP

I have implemented few normal looping applications in OpenMP, TBB and OpenCL. In all these applications, OpeCL gives far better performance than others too when I am only running it on CPU with no specific optimizations done in kernels. OpenMP and TBB gives good performance too but far less than OpenCL, what could be reason for it because these both are CPU specialized frameworks and should gives at least a performance equal to OpenMP/TBB.
My second concern is that when it comes to OpenMP and TBB, OpenMP is always better in performance than TBB in my implementations in which I havent tuned it for a very good optimizations as I am not so expert. Is there a reason that OpenMP is normally better in performance than TBB? Because I think they both or even OpenCL too uses same kind of thread pooling at low level.... Any expert opinions? Thanks

One advantage that OpenCL has over TBB and OpenMP is that it can take better advantage of SIMD parallelism in your hardware. Some OpenCL implementations will run your code such that each work item runs in a SIMD vector lane of the machine, as well as running on separate cores. Depending on the algorithm, this could provide lots of performance benefits.
C compilers can also take some advantage of SIMD parallelism as well, using auto-vectorization, but the memory aliasing rules in C make it hard for this to work in some cases. Since OpenCL requires programmers to call out the work items and fence memory accesses explicitly, an OpenCL compiler can be more aggressive.
In the end, it depends on your code. One could find an algorithm for which any of OpenCL, OpenMP, or TBB are best.

OpenCL runtime for CPU and MIC provided by Intel uses TBB under the hood. It's far from just 'thread pooling at low level' since it takes advantage of sophisticated scheduling and partitioning algorithms provided by TBB for better load balance and so better utilization of CPUs.
As for TBB vs. OpenMP. Usually, it comes down to incorrect measurements. For example, TBB has no implicit barrier like in OpenMP, so a warm-up loop is not enough. You have to make sure all the threads are created and this overhead is not included into your measurements. Another example: sometimes, compilers are not able to vectorize the same code with TBB which is vectorized with OpenMP.

OpenCL kernels are compiled for the given hardware. The potential for vendor/hardware specific optimisations is huge.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string