AVX float4/double4 struct - struct

I am looking for a AVX-256/512 code for float4 / double4 struct that overloads the basic operations *,+,/,-,scale by scalar, etc to get a quick performance boost from vector operations in a code written using float4/double4. OpenCL has these data types as intrinsics but c++ code running on the XeonPhi needs new implementations taking advantage of the 512-bit SIMD units.

What you are seeking is Agner Fog's Vector Class Library(VCL). I have used this mostly replace the vector types in OpenCL.
With the VCL float4 is Vec4f and double4 is Vec4d. Like OpenCL you don't need to worry about AVX vs AVX512. If you use Vec8d and compile for AVX it will emulate AVX512 using two AVX registers.
The VCL has all the operations you want such as *,+,/,-,+=,-=,/=,*=, multiply and divide by scalar and many more features.
The main difference with OpenCL and the VCL is that OpenCL basically creates a CPU dispatcher. Whereas with the VCL you have to write a CPU dispatcher yourself (it includes some example code to do this with documentation). The VCL has optimized functions for SSE2 through AVX512 so you can target several different instruction sets. There is even a special version of the VCL for the Knights Corner Xeon Phi.
Another feature from OpenCL that I miss is the syntax for permuting. In OpenCL to reverse the order of the components of float4 you could do v.wzyx whereas with the VCL you would do permute4f<3,2,1,0>(v). I might be possible to create this syntax with C++ but I am not sure.
Using the VCL, OpenMP, and a custom CPU dispatcher I have largely replaced OpenCL on the CPU.

Related

Enabling AVX512 support on compilation significantly decreases performance

I've got a C/C++ project that uses a static library. The library is built for 'skylake' architecture. The project is a data processing module, i.e. it performs many arithmetic operations, memory copying, searching, comparing, etc.
The CPU is Xeon Gold 6130T, it supports AVX512. I tried to compile my project with both -march=skylake and -march=skylake-avx512 and then link with the library.
In case of using -march=skylake-avx512 the project performance is significantly decreased (by 30% on average) in comparison to the project built with -march=skylake.
How can this be explained? What could be the reason?
Info:
Linux 3.10
gcc 9.2
Intel Xeon Gold 6130T
project performance is significantly decreased (by 30% on average)
In code that cannot be easily vectorized sporadic AVX instructions here and there downclock your CPU but do not provide any benefit. You may like to turn off AVX instructions completely in such scenarios.
See Advanced Vector Extensions, Downclocking:
Since AVX instructions are wider and generate more heat, Intel processors have provisions to reduce the Turbo Boost frequency limit when such instructions are being executed. The throttling is divided into three levels:
L0 (100%): The normal turbo boost limit.
L1 (~85%): The "AVX boost" limit. Soft-triggered by 256-bit "heavy" (floating-point unit: FP math and integer multiplication) instructions. Hard-triggered by "light" (all other) 512-bit instructions.
L2 (~60%): The "AVX-512 boost" limit. Soft-triggered by 512-bit heavy instructions.
The frequency transition can be soft or hard. Hard transition means the frequency is reduced as soon as such an instruction is spotted; soft transition means that the frequency is reduced only after reaching a threshold number of matching instructions. The limit is per-thread.
Downclocking means that using AVX in a mixed workload with an Intel processor can incur a frequency penalty despite it being faster in a "pure" context. Avoiding the use of wide and heavy instructions help minimize the impact in these cases. AVX-512VL is an example of only using 256-bit operands in AVX-512, making it a sensible default for mixed loads.
Also, see
On the dangers of Intel's frequency scaling.
Gathering Intel on Intel AVX-512 Transitions.
How to Fix Intel?.

using tbb atomic operations in Xeon Phi

I'm using tbb compare_and_swap operation in Xeon Phi in a lock-free algorithm. Since Xeon Phi is an in order machine, it doesn't support sfence operation. So will the atomic operations work correctly on Xeon Phi?
Yes, they definitely work correctly, most of TBB itself is based on the atomic operations. And sfence is not required for atomic operations to work correctly, it's a standalone memory barrier while atomic operations imply memory barriers themselves. TBB doesn't use sfence even on the regular Xeons, it uses mfence for full memory fence instead. For Xeon Phi, it is substituted by no-op atomic operation, e.g. mic_common.h of TBB contains the following definitions:
/** Intel(R) Many Integrated Core Architecture does not support mfence and pause instructions **/
#define __TBB_full_memory_fence() __asm__ __volatile__("lock; addl $0,(%%rsp)":::"memory")
#define __TBB_Pause(x) _mm_delay_32(16*(x))

Using Xeon Phi with only threads

Is it possible to use Xeon Phi by just launching many threads,
or there are special type of programming required to use Xeon Phi?
Intel have some fairly good math libraries, IPP / MKL. Reading between the lines of what Xeon Phi seems to be I imagine that Intel have a version of those libraries that would exploit the very wide SIMD unit that appears to have become part of the architecture.
Intel's compiler will also put in multiple threads to execute for loops in parallel instead of in sequence. That would be one way of exploiting the large number of cores that Phi seems to have.
So it could be that with the right compiler and libraries programming for Phi could be fairly normal, until you start needing routines that the libraries haven't got.
You can read these document for more information on how to tap the many available threads on Xeon Phi:
http://software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture
http://software.intel.com/en-us/articles/choosing-the-right-threading-framework
and more on http://software.intel.com/en-us/mic-developer
To summarize, either manage threads manually (via TBB / pthreads / etc.), or use one of the supported parallel programming models:
OpenMP
MPI
Cilk Plus
OpenCL
OpenACC
Or use libraries that can automatically offload to the device, such as MKL or ArrayFire.

OpenCL, TBB, OpenMP

I have implemented few normal looping applications in OpenMP, TBB and OpenCL. In all these applications, OpeCL gives far better performance than others too when I am only running it on CPU with no specific optimizations done in kernels. OpenMP and TBB gives good performance too but far less than OpenCL, what could be reason for it because these both are CPU specialized frameworks and should gives at least a performance equal to OpenMP/TBB.
My second concern is that when it comes to OpenMP and TBB, OpenMP is always better in performance than TBB in my implementations in which I havent tuned it for a very good optimizations as I am not so expert. Is there a reason that OpenMP is normally better in performance than TBB? Because I think they both or even OpenCL too uses same kind of thread pooling at low level.... Any expert opinions? Thanks
One advantage that OpenCL has over TBB and OpenMP is that it can take better advantage of SIMD parallelism in your hardware. Some OpenCL implementations will run your code such that each work item runs in a SIMD vector lane of the machine, as well as running on separate cores. Depending on the algorithm, this could provide lots of performance benefits.
C compilers can also take some advantage of SIMD parallelism as well, using auto-vectorization, but the memory aliasing rules in C make it hard for this to work in some cases. Since OpenCL requires programmers to call out the work items and fence memory accesses explicitly, an OpenCL compiler can be more aggressive.
In the end, it depends on your code. One could find an algorithm for which any of OpenCL, OpenMP, or TBB are best.
OpenCL runtime for CPU and MIC provided by Intel uses TBB under the hood. It's far from just 'thread pooling at low level' since it takes advantage of sophisticated scheduling and partitioning algorithms provided by TBB for better load balance and so better utilization of CPUs.
As for TBB vs. OpenMP. Usually, it comes down to incorrect measurements. For example, TBB has no implicit barrier like in OpenMP, so a warm-up loop is not enough. You have to make sure all the threads are created and this overhead is not included into your measurements. Another example: sometimes, compilers are not able to vectorize the same code with TBB which is vectorized with OpenMP.
OpenCL kernels are compiled for the given hardware. The potential for vendor/hardware specific optimisations is huge.

memory fences/barriers in C++: does boost or other libraries have them?

I am reading these days about memory fences and barriers as a way to synchronize multithreaded code and avoid code reordering.
I usually develop in C++ under Linux OS and I use boost libs massively but I am not able to find any class related to it. Do you know if memory barrier of fences are present in boost or if there is a way to achieve the same concept? If not what good library can I have a look to?
There are no low-level memory barriers in boost yet, but there is a proposed boost.atomic library that provides them.
Compilers provide their own either as intrinsics or library functions, such as gcc's __sync_synchronize() or _mm_mfence() for Visual Studio.
The C++0x library provides atomic operations including memory fences in the form of std::atomic_thread_fence. Though gcc has supplied various forms of the C++0x atomics since V4.4, neither V4.4 or V4.5 include this form of fence. My (commercial) just::thread library provides a full implementation of C++0x atomics, including fences for g++ 4.3 and 4.4, and Microsoft Visual Studio 2005, 2008 and 2010.
The place where memory barriers are required is when avoiding using kernel synchronisation mechanisms in an SMP environment - usually for performance reasons.
There is an implicit memory barrier in any kernel synchronisation operation (e.g. signalling semaphores, locking and unlocking mutices) and content switching to guard against data coherence hazards.
I have just found myself needing (moderately) portable memory barrier implementations (ARM and x86), and also found the linux source tree to be the best source for this. Linux has SMP variants of the mb(), rmb() and wmb() macros - which on some platforms result in more specific (and possible less costly) barriers than the non-SMP variants.
This doesn't appear to be a concern on x86 and particularly ARM where though where both are implemented the same way.
This is what I've cribbed together from linux header files (suitable for ARMv7 and non-ancient x86/x64 processors)
#if defined(__i386__ ) || defined(__x64__)
#define smp_mb() asm volatile("mfence":::"memory")
#define smp_rmb() asm volatile("lfence":::"memory")
#define smp_wmb() asm volatile("sfence" ::: "memory")
#endif
#if defined(__arm__)
#define dmb() __asm__ __volatile__ ("dmb" : : : "memory")
#define smp_mb() dmb()
#define smp_rmb() dmb()
#define smp_wmb() dmb()
#endif
Naturally, dabbling with memory barriers has the attendant risk that the resulting code is practically impossible to test, and any resulting bugs will be obscure and difficult to reproduce race conditions :/
There is, incidentally, a very good description of memory barriers in the Linux kernel documentation.
There is a boost::barrier class/concept but it's a bit high level. Speaking of which, why do you need a low level barrier? Synchronization primitives should be enough, shouldn't they? And they should be using memory barriers where necessary, either directly or indirectly through other, lower level primitives.
If you still think you need a low-level implementation, I know of no classes or libraries that implement barriers, but there's some implementation-specific code in the Linux kernel. Search for mb(), rmb() or wmb() in include/asm-{arch}/system.h.
Now, in 2019, the C++11 fences should be available on nearly every implementation of the C++ standard library. The header ist <atomic>.
You can issue a fence by calling std::atomic_thread_fence. In short:
std::atomic_thread_fence(std::memory_order_release); guarantees that no store operations are moved past the call. (All side effects will be made visible to other threads.)
std::atomic_thread_fence(std::memory_order_acquire); guarantees that no load operations are moved before the call. (All side effects of other threads will be made visible.)
There are more details in the documentation.

Resources