MSVC /arch:[instruction set] - SSE3, AVX, AVX2 - visual-c++

Here is an example of a class which shows supported instruction sets. https://msdn.microsoft.com/en-us/library/hskdteyh.aspx
I want to write three different implementations of a single function, each of them using different instruction set. But due to flag /ARCH:AVX2, for example, this app won't ever run anywhere but on 4th+ generation of Intel processors, so the whole point of checking is pointless.
So, question is: what exactly this flag does? Enables support or enables compiler optimizations using provided instruction sets?
In other words, can I completely remove this flag and keep using functions from immintrin.h, emmintrin.h, etc?

An using of option /ARCH:AVX2 allows to use YMM registers and AVX2 instructions of CPU by the best way. But if CPU is not support these instruction it will be a program crash. If you use AVX2 instructions and compiler flag /ARCH:SSE2 that will be a decreasing of performance (about 2x times).
So the best implementation when every implementation of your function is compiled with corresponding compiler options (/ARCH:AVX2, /ARCH:SSE2 and so on). The easiest way to do it - put your implementations (scalar, SSE, AVX) in different files and compile each file with specific compiler options.
Also it will be a good idea if you create a separate file where you can check CPU abilities and call corresponding implementation of your function.
There is an example of a library which does CPU checking and calling an one of implemented function.

Related

Build on AMD, run on Intel?

If I cargo build --release a Rust binary on an AMD CPU and then run it on an Intel (or vice versa), could that be a problem (compatibility issues and/or considerable performance sacrifice)? I know we can use a target-cpu=<cpu> flag and that should result in a potentially more optimized machine code for the target platform. My questions are:
Practically speaking, if we build for one platform but run on the other, should we expect a significant runtime performance penalty?
If we build on AMD with target-cpu=intel (or vice versa), could the compilation itself be:
slower?
restricted in how well it could optimize for the target platform?
Note: Linux will be the OS for both compilation and running.
In general, if you just do a cargo build --release without further configuration, then you will get a binary that runs on any machine of the relevant architecture. That is, it will run on either an Intel or AMD CPU that's x86-64. It will also be optimized for a general CPU of that architecture, regardless of what type of CPU you build it on. The specific settings are going to depend on whatever rustc and LLVM are configured for, but unless you've done a custom build, that's usually the case.
Usually that is sufficient for most people's needs and building for a target CPU is unnecessary. However, if you specify a particular CPU, then it will be optimized for that CPU and may contain instructions that don't run on other CPUs. For example, the architectural definition for x86-64 doesn't contain things like AVX, which is a later addition, so if you compile for a CPU providing those instructions then rustc may use them, which may cause it to perform worse or not at all elsewhere.
It's impossible to say more about your particular situation without knowing more about your code and performance needs. My recommendation is just to ues cargo build --release and not to optimize for a specific CPU unless you have measured the code and determined that there is a particular section which is slow and which would benefit from that. Most people benefit greatly from the additional portability of the code and don't need CPU-specific optimizations.
Everything I've said here is also true of other sets of architectures. If you compile for aarch64-unknown-linux-gnu or riscv64gc-unknown-linux-gnu, it will build for a generic CPU of that type which works on all systems of that type unless you specify different options. The exception tends to be on systems like macOS, where it is specifically known that all CPUs that run on that OS have some specific set of features, and thus a compilation for e.g. x86-64 CPUs on macOS might optimize for Intel CPUs with given features since macOS only runs on hardware that uses those CPUs.

Is there a __yield() intrinsic on Arm?

I am trying to compile some code for arm (v7a), that had a
#if defined(__arm__)
__yield();
#endif
added in this pull-request
The other branches have YieldProcessor() for MSC and _mm_pause() or __builtin_ia32_pause() for x86 and x86-64.
The symbol __yield is not found by the compiler, a arm-v7a-linux-gnueabihf-gcc 7.3.1 with -mcpu=cortex-a9 -mtune=cortex-a9 -march=armv7-a options. Is such symbol defined on some other ARM platforms, later Gcc, or Clang?
In the headers that come with the compiler all I could find is __gnu_parallel::__yield being an inline wrapper over sched_yield(), which I suppose is equivalent to the std::this_thread::yield() that the code calls after 100 iterations calling __yield(). So I think it's not that. But I didn't see __yield in gcc documentation either.
The __yield intrinsic is specified as part of the ARM C Language Extensions (see 8.4 "Hints"). It emits the yield instruction, which is the rough equivalent of x86 pause. It is intended precisely for situations like waiting on a spinlock; it keeps the CPU from hammering on the cache line excessively (which hurts performance), possibly saves some power, and, in case of a hyperthreading CPU, makes more computational units available to the other logical processor.
(Note that it is purely a CPU function, and not an OS or library call; it doesn't yield a CPU timeslice to the operating system like the similarly named pause() or sched_yield() or std::this_thread::yield() calls would do.)
Although GCC supports some of the ACLE intrinsics, it seems to be missing this one. You should be able to substitute with asm volatile("yield");. The yield instruction has no architectural effect (it executes like nop) so no register or memory clobbers are needed.
So in C++ you have std::this_thread::yield()
in C11 you have trd_yield,
https://en.cppreference.com/w/c/thread
both are available in ARM.

Dynamic cargo feature flags

Is there a way to change the feature flags of an included library at run time? Or more specifically change the feature flags depending on certain cpu feature flags? Can it be done pragmatically or can it be done via cli arguments?
We have this library (curve25519-dalek) that allows you to use default, or AVX2, or IFMA. But only if your cpu running the code can support it. Is there a way to create a binary that will pass the correct feature flag to the library so that your code will run the correct instruction set? You would want to run the fastest instruction set as the speed gains are significant.

Linux kernel assembly and logic

My question is somewhat weird but I will do my best to explain.
Looking at the languages the linux kernel has, I got C and assembly even though I read a text that said [quote] Second iteration of Unix is written completely in C [/quote]
I thought that was misleading but when I said that kernel has assembly code I got 2 questions of the start
What assembly files are in the kernel and what's their use?
Assembly is architecture dependant so how can linux be installed on more than one CPU architecture
And if linux kernel is truly written completely in C than how can it get GCC needed for compiling?
I did a complete find / -name *.s
and just got one assembly file (asm-offset.s) somewhere in the /usr/src/linux-headers-`uname -r/
Somehow I don't think that is helping with the GCC working, so how can linux work without assembly or if it uses assembly where is it and how can it be stable when it depends on the arch.
Thanks in advance
1. Why assembly is used?
Because there are certain things then can be done only in assembly and because assembly results in a faster code. For eg, "you can get access to unusual programming modes of your processor (e.g. 16 bit mode to interface startup, firmware, or legacy code on Intel PCs)".
Read here for more reasons.
2. What assembly file are used?
From: https://www.kernel.org/doc/Documentation/arm/README
"The initial entry into the kernel is via head.S, which uses machine
independent code. The machine is selected by the value of 'r1' on
entry, which must be kept unique."
From https://www.ibm.com/developerworks/library/l-linuxboot/
"When the bzImage (for an i386 image) is invoked, you begin at ./arch/i386/boot/head.S in the start assembly routine (see Figure 3 for the major flow). This routine does some basic hardware setup and invokes the startup_32 routine in ./arch/i386/boot/compressed/head.S. This routine sets up a basic environment (stack, etc.) and clears the Block Started by Symbol (BSS). The kernel is then decompressed through a call to a C function called decompress_kernel (located in ./arch/i386/boot/compressed/misc.c). When the kernel is decompressed into memory, it is called. This is yet another startup_32 function, but this function is in ./arch/i386/kernel/head.S."
Apart from these assembly files, lot of linux kernel code has usage of inline assembly.
3. Architecture dependence?
And you are right about it being architecture dependent, that's why the linux kernel code is ported to different architecture.
Linux porting guide
List of supported arch
Things written mainly in assembly in Linux:
Boot code: boots up the machine and sets it up in a state in which it can start executing C code (e.g: on some processors you may need to manually initialize caches and TLBs, on x86 you have to switch to protected mode, ...)
Interrupts/Exceptions/Traps entry points/returns: there you need to do very processor-specific things, e.g: saving registers and reenabling interrupts, and eventually restoring registers and properly returning to user mode. Some exceptions may be handled entirely in assembly.
Instruction emulation: some CPU models may not support certain instructions, may not support unaligned data access, or may not have an FPU. An option is using emulation when getting the corresponding exception.
VDSO: the VDSO is a virtual library that the kernel maps into userspace. It allows e.g: selecting the optimal syscall sequence for the current CPU (on x86 use sysenter/syscall instead of int 0x80 if available), and implementing certain system calls without requiring a context switch (e.g: gettimeofday()).
Atomic operations and locks: Maybe in a future some of these could be written using C11 support for atomic operations.
Copying memory from/to user mode: Besides using an optimized copy, these check for out-of-bounds access.
Optimized routines: the kernel has optimized version of some routines, e.g: crypto routines, memset, clear_page, csum_copy (checksum and copy to another place IP data in one pass), ...
Support for suspend/resume and other ACPI/EFI/firmware thingies
BPF JIT: newer kernels include a JIT compiler for BPF expressions (used for example by tcpdump, secmode mode 2, ...)
...
To support different architectures, Linux has assembly code (re-)written for each architecture it supports (and sometimes, there are several implementations of some code for different platforms using the same CPU architecture). Just look at all the subdirectories under arch/
Assembly is needed for a couple of reasons.
There are many instructions that are needed for the operation of an operating system that have no C equivalent, at least on most processors. A good example on Intel x86/64 processors is the iret instruciton, which returns from hardware/software interrupts. These interrupts are key to handling hardware events (like a keyboard press) and system calls from programs on older processors.
A computer does not start up in a state that is immediately ready for execution of C code. For an Intel example, when execution gets to the startup routine the processor may not be in 32-bit mode (or 64-bit mode), and the stack required by C also may not be ready. There are some other features present in some processors (like paging) which need to be turned on from assembly as well.
However, most of the Linux kernel is written in C, which interfaces with some platform specific C/assembly code through standardized interfaces. By separating the parts in this way, most of the logic of the Linux kernel can be shared between platforms. The build system simply compiles the platform independent and dependent parts together for specific platforms, which results in different executable kernel files for different platforms (and kernel configurations for that matter).
Assembly code in the kernel is generally used for low-level hardware interaction that can't be done directly from C. They're like a platform- specific foundation that's used by higher-level parts of the kernel that are written in C.
The kernel source tree contains assembly code for a variety of systems. When you compile a kernel for a particular type of system (such as an x86 PC), only the appropriate assembly code for that platform is included in the build process.
Linux is not the second version of Unix (or Unix in general). It is Unix compatible, but Unix and Linux have separate histories and, in terms of code base (of their kernels), are completely separate. Linus Torvald's idea was to write an open source Unix.
Some of the lower level things like some of the architecture dependent parts of memory management are done in assembly. The old (but still available) Linux kernel API for x86, int 0x80, is implemented in assembly. There are probably other places in the kernel that are implemented in assembly, but I don't know any others.
When you compile the kernel, you select an architecture to target. Depending on the target, the right assembly files for that architecture are included in the build.
The reason you don't find anything is because you're searching the headers, not the sources. Download a tar ball from kernel.org and search that.

Is multithreaded FFTW deterministic

I am getting slightly different results between runs in my program. It uses multi-threaded FFTW planned with FFTW_ESTIMATE flag. Is multi-threaded FFTW deterministic:
For fixed number of threads?
Between different numbers of threads used at different runs?
FFTW faq says, that FFTW_ESTIMATE flag results in same algorithm used between runs, but it does not explicitly say that it is deterministic in multi-threaded case.
The fftw documentation:
http://www.fftw.org/fftw3_doc/Thread-safety.html#Thread-safety
stipulates that only fftw_execute is reentrant. So it's hard to say without more info about your usage. Also:
"If you are configured FFTW with the --enable-debug or --enable-debug-malloc flags (see Installation on Unix), then fftw_execute is not thread-safe. These flags are not documented because they are intended only for developing and debugging FFTW, but if you must use --enable-debug then you should also specifically pass --disable-debug-malloc for fftw_execute to be thread-safe."

Resources