Build on AMD, run on Intel? - rust

If I cargo build --release a Rust binary on an AMD CPU and then run it on an Intel (or vice versa), could that be a problem (compatibility issues and/or considerable performance sacrifice)? I know we can use a target-cpu=<cpu> flag and that should result in a potentially more optimized machine code for the target platform. My questions are:
Practically speaking, if we build for one platform but run on the other, should we expect a significant runtime performance penalty?
If we build on AMD with target-cpu=intel (or vice versa), could the compilation itself be:
slower?
restricted in how well it could optimize for the target platform?
Note: Linux will be the OS for both compilation and running.

In general, if you just do a cargo build --release without further configuration, then you will get a binary that runs on any machine of the relevant architecture. That is, it will run on either an Intel or AMD CPU that's x86-64. It will also be optimized for a general CPU of that architecture, regardless of what type of CPU you build it on. The specific settings are going to depend on whatever rustc and LLVM are configured for, but unless you've done a custom build, that's usually the case.
Usually that is sufficient for most people's needs and building for a target CPU is unnecessary. However, if you specify a particular CPU, then it will be optimized for that CPU and may contain instructions that don't run on other CPUs. For example, the architectural definition for x86-64 doesn't contain things like AVX, which is a later addition, so if you compile for a CPU providing those instructions then rustc may use them, which may cause it to perform worse or not at all elsewhere.
It's impossible to say more about your particular situation without knowing more about your code and performance needs. My recommendation is just to ues cargo build --release and not to optimize for a specific CPU unless you have measured the code and determined that there is a particular section which is slow and which would benefit from that. Most people benefit greatly from the additional portability of the code and don't need CPU-specific optimizations.
Everything I've said here is also true of other sets of architectures. If you compile for aarch64-unknown-linux-gnu or riscv64gc-unknown-linux-gnu, it will build for a generic CPU of that type which works on all systems of that type unless you specify different options. The exception tends to be on systems like macOS, where it is specifically known that all CPUs that run on that OS have some specific set of features, and thus a compilation for e.g. x86-64 CPUs on macOS might optimize for Intel CPUs with given features since macOS only runs on hardware that uses those CPUs.

Related

Which linux OS supports AVX-512 VNNI (Vector Neural Network Instruction)?

I need to deploy an EC2 instance where VNNI (Vector Neural Network Instruction) is supported. There are some EC2 instance types that can support the same.
From AWS:
Intel Deep Learning Boost (Intel DL Boost): A new set of built-in processor technologies designed to accelerate AI deep learning use cases. The 2nd Gen Intel Xeon Scalable processors extend Intel AVX-512 with a new Vector Neural Network Instruction (VNNI/INT8) that significantly increases deep learning inference performance over previous generation Intel Xeon Scalable processors (with FP32), for image recognition/segmentation, object detection, speech recognition, language translation, recommendation systems, reinforcement learning and others. VNNI may not be compatible with all Linux distributions. Please check documentation before using.
It is mentioned that VNNI may not be compatible with all Linux distributions. So, which Linux distribution supports VNNI? I am also not sure as to which documentation this statement refers to.
No kernel support is needed beyond that for AVX-512 (i.e. context switch handling of the new AVX-512 zmm and k registers). AVX-512VNNI instructions just operate on those registers, so there's no new architectural state to save/restore on context switch. https://en.wikichip.org/wiki/x86/avx512_vnni / https://en.wikipedia.org/wiki/AVX-512#VNNI
(Unlike AMX (Advanced Matrix Extensions), new in Sapphire Rapids; that does introduce large new "2D tile" registers, 8x 1KiB, that context-switches need to handle1.)
The other relevant thing for distros are compilers versions, like GCC or clang. https://godbolt.org/z/668rvhWPx shows GCC 8.1 and clang 7.0 (both released in 2018) compiling AVX-512VNNI _mm512_dpbusd_epi32 with -march=icelake-server or -march=icelake-client. Versions before that fail, so those are the minimum versions. (Or clang6.0 for -mavx512vnni, but that doesn't enable other things an IceLake CPU supports, or set tuning options.)
So if you want to use the latest hotness, you need a compiler that's at least somewhat up to date. It's generally a good idea to use a compiler newer than the CPU you're using, so compiler devs have had a chance to tweak tuning settings for it. And code-gen from intrinsics, especially newish instruction-sets like AVX-512, has generally improved over compiler versions, so if you care about performance of the generated code, you typically want a newer compiler version. (Regressions happen for some releases for some loops/functions, and thus for some programs, but on average newer compilers make faster code than old ones. That's a big part of what compiler devs spend time improving.)
You can install a new compiler on an old distro via backport packages or manually. Or you can just use a distro release that isn't old and crusty.
Footnote 1: See also a phoronix article re: non-empty AMX register state keeping the CPU from doing a deep sleep. Normally CPUs fully power down the core in deeper sleep states, stashing registers somewhere that stays powered. I'm guessing that they didn't provide space for AMX tiles to do that, so having state there prevents sleep. So if you're using AMX, you'll want Linux kernel at least 5.19.
In AWS, the instance type and OS combination that worked for me:
EC2 instance type: m5n.large (m5n instance family supports AVX-512 VNNI)
OS: Amazon Linux 2 (other Linux distributions should work as well, as explained by #BasileStarynkevitch and #PeterCordes).
For curious minds: What Linux distribution is the Amazon Linux AMI based on?

"Illegal instruction" when running ARM code targeting my CPU

I'm compiling a rather large project for ARM. I'm using an AT91SAM9G25-EK as a devboard running a Debian ARM image. All libraries and executables in the image seem to be compiled for the armv4t instruction set.
My CPU is an ARM926EJ-S, which should run armv5tej code.
I'm using GCC to cross compile for my board. My CXX flags look like the following:
set(CMAKE_CXX_FLAGS "--signed-char --sysroot=${SYSROOT} -mcpu=arm926ej-s -mtune=arm926ej-s -mfloat-abi=softfp" CACHE STRING "" FORCE)
If I try to run this on my board, I get an Illegal Instruction signal (SIGILL) during initialization of one of my dependencies (using armv4t).
If I enable thumb mode (-mthumb -mthumb-interwork) it works, but uses Thumb for all the code, which in my case runs slower (I'm doing some serious number crunching).
In this case, if I specify one function to be compiled for ARM mode (using __attribute__((target("arm")))) it will run fine until that function is called, then exit with SIGILL.
I'm lost. Is it that bad I'm linking against libraries using armv4t? Am I misunderstanding the way ARM modes work? Is it something in the linux kernel?
What softfp means is to use using the soft-float calling convention between functions, but still use the hardware FPU within them. Assuming your cross-compiler is configured with a default -mfpu option other than "none" (run arm-whatever-gcc -v and look for --with-fpu= to check), then you've got a problem, because as far as I can see from the Atmel datasheet, SAM9G25 doesn't have an FPU.
My first instinct would be to slap GDB on there, catch the signal and disassemble the offending instruction to make sure, but the fact that Thumb code works OK is already a giveaway (Thumb before ARMv6T2 doesn't include any coprocessor instructions, and thus can't make use of an FPU).
In short, use -mfloat-abi=soft to ensure the ARM code actually uses software floating-point and avoids poking a non-existent FPU. And if the "serious number crunching" involves a lot of floating-point, perhaps consider getting a different MCU...

Linux kernel assembly and logic

My question is somewhat weird but I will do my best to explain.
Looking at the languages the linux kernel has, I got C and assembly even though I read a text that said [quote] Second iteration of Unix is written completely in C [/quote]
I thought that was misleading but when I said that kernel has assembly code I got 2 questions of the start
What assembly files are in the kernel and what's their use?
Assembly is architecture dependant so how can linux be installed on more than one CPU architecture
And if linux kernel is truly written completely in C than how can it get GCC needed for compiling?
I did a complete find / -name *.s
and just got one assembly file (asm-offset.s) somewhere in the /usr/src/linux-headers-`uname -r/
Somehow I don't think that is helping with the GCC working, so how can linux work without assembly or if it uses assembly where is it and how can it be stable when it depends on the arch.
Thanks in advance
1. Why assembly is used?
Because there are certain things then can be done only in assembly and because assembly results in a faster code. For eg, "you can get access to unusual programming modes of your processor (e.g. 16 bit mode to interface startup, firmware, or legacy code on Intel PCs)".
Read here for more reasons.
2. What assembly file are used?
From: https://www.kernel.org/doc/Documentation/arm/README
"The initial entry into the kernel is via head.S, which uses machine
independent code. The machine is selected by the value of 'r1' on
entry, which must be kept unique."
From https://www.ibm.com/developerworks/library/l-linuxboot/
"When the bzImage (for an i386 image) is invoked, you begin at ./arch/i386/boot/head.S in the start assembly routine (see Figure 3 for the major flow). This routine does some basic hardware setup and invokes the startup_32 routine in ./arch/i386/boot/compressed/head.S. This routine sets up a basic environment (stack, etc.) and clears the Block Started by Symbol (BSS). The kernel is then decompressed through a call to a C function called decompress_kernel (located in ./arch/i386/boot/compressed/misc.c). When the kernel is decompressed into memory, it is called. This is yet another startup_32 function, but this function is in ./arch/i386/kernel/head.S."
Apart from these assembly files, lot of linux kernel code has usage of inline assembly.
3. Architecture dependence?
And you are right about it being architecture dependent, that's why the linux kernel code is ported to different architecture.
Linux porting guide
List of supported arch
Things written mainly in assembly in Linux:
Boot code: boots up the machine and sets it up in a state in which it can start executing C code (e.g: on some processors you may need to manually initialize caches and TLBs, on x86 you have to switch to protected mode, ...)
Interrupts/Exceptions/Traps entry points/returns: there you need to do very processor-specific things, e.g: saving registers and reenabling interrupts, and eventually restoring registers and properly returning to user mode. Some exceptions may be handled entirely in assembly.
Instruction emulation: some CPU models may not support certain instructions, may not support unaligned data access, or may not have an FPU. An option is using emulation when getting the corresponding exception.
VDSO: the VDSO is a virtual library that the kernel maps into userspace. It allows e.g: selecting the optimal syscall sequence for the current CPU (on x86 use sysenter/syscall instead of int 0x80 if available), and implementing certain system calls without requiring a context switch (e.g: gettimeofday()).
Atomic operations and locks: Maybe in a future some of these could be written using C11 support for atomic operations.
Copying memory from/to user mode: Besides using an optimized copy, these check for out-of-bounds access.
Optimized routines: the kernel has optimized version of some routines, e.g: crypto routines, memset, clear_page, csum_copy (checksum and copy to another place IP data in one pass), ...
Support for suspend/resume and other ACPI/EFI/firmware thingies
BPF JIT: newer kernels include a JIT compiler for BPF expressions (used for example by tcpdump, secmode mode 2, ...)
...
To support different architectures, Linux has assembly code (re-)written for each architecture it supports (and sometimes, there are several implementations of some code for different platforms using the same CPU architecture). Just look at all the subdirectories under arch/
Assembly is needed for a couple of reasons.
There are many instructions that are needed for the operation of an operating system that have no C equivalent, at least on most processors. A good example on Intel x86/64 processors is the iret instruciton, which returns from hardware/software interrupts. These interrupts are key to handling hardware events (like a keyboard press) and system calls from programs on older processors.
A computer does not start up in a state that is immediately ready for execution of C code. For an Intel example, when execution gets to the startup routine the processor may not be in 32-bit mode (or 64-bit mode), and the stack required by C also may not be ready. There are some other features present in some processors (like paging) which need to be turned on from assembly as well.
However, most of the Linux kernel is written in C, which interfaces with some platform specific C/assembly code through standardized interfaces. By separating the parts in this way, most of the logic of the Linux kernel can be shared between platforms. The build system simply compiles the platform independent and dependent parts together for specific platforms, which results in different executable kernel files for different platforms (and kernel configurations for that matter).
Assembly code in the kernel is generally used for low-level hardware interaction that can't be done directly from C. They're like a platform- specific foundation that's used by higher-level parts of the kernel that are written in C.
The kernel source tree contains assembly code for a variety of systems. When you compile a kernel for a particular type of system (such as an x86 PC), only the appropriate assembly code for that platform is included in the build process.
Linux is not the second version of Unix (or Unix in general). It is Unix compatible, but Unix and Linux have separate histories and, in terms of code base (of their kernels), are completely separate. Linus Torvald's idea was to write an open source Unix.
Some of the lower level things like some of the architecture dependent parts of memory management are done in assembly. The old (but still available) Linux kernel API for x86, int 0x80, is implemented in assembly. There are probably other places in the kernel that are implemented in assembly, but I don't know any others.
When you compile the kernel, you select an architecture to target. Depending on the target, the right assembly files for that architecture are included in the build.
The reason you don't find anything is because you're searching the headers, not the sources. Download a tar ball from kernel.org and search that.

How to test the kernel for kernel panics?

I am testing the Linux Kernel on an embedded device and would like to find situations / scenarios in which Linux Kernel would issue panics.
Can you suggest some test steps (manual or code automated) to create Kernel panics?
There's a variety of tools that you can use to try to crash your machine:
crashme tries to execute random code; this is good for testing process lifecycle code.
fsx is a tool to try to exercise the filesystem code extensively; it's good for testing drivers, block io and filesystem code.
The Linux Test Project aims to create a large repository of kernel test cases; it might not be designed with crashing systems in particular, but it may go a long way towards helping you and your team keep everything working as planned. (Note that the LTP isn't proscriptive -- the kernel community doesn't treat their tests as anything important -- but the LTP team tries very hard to be descriptive about what the kernel does and doesn't do.)
If your device is network-connected, you can run nmap against it, using a variety of scanning options: -sV --version-all will try to find versions of all services running (this can be stressful), -O --osscan-guess will try to determine the operating system by throwing strange network packets at the machine and guessing by responses what the output is.
The nessus scanning tool also does version identification of running services; it may or may not offer any improvements over nmap, though.
You can also hand your device to users; they figure out the craziest things to do with software, they'll spot bugs you'd never even think to look for. :)
You can try following key combination
SysRq + c
or
echo c >/proc/sysrq-trigger
Crashme has been known to find unknown kernel panic situations, but it must be run in a potent way that creates a variety of signal exceptions handled within the process and a variety of process exit conditions.
The main purpose of the messages generated by Crashme is to determine if sufficiently interesting things are happening to indicate possible potency. For example, if the mprotect call is needed to allow memory allocated with malloc to be executed as instructions, and if you don't have the mprotect enabled in the source code crashme.c for your platform, then Crashme is impotent.
It seems that operating systems on x64 architectures tend to have execution turned off for data segments. Recently I have updated the crashme.c on http://crashme.codeplex.com/ to use mprotect in case of __APPLE__ and tested it on a MacBook Pro running MAC OS X Lion. This is the first serious update to Crashme since 1994. Expect to see updated Centos and Freebsd support soon.

Why does my code run slower with multiple threads than with a single thread when it is compiled for profiling (-pg)?

I'm writing a ray tracer.
Recently, I added threading to the program to exploit the additional cores on my i5 Quad Core.
In a weird turn of events the debug version of the application is now running slower, but the optimized build is running faster than before I added threading.
I'm passing the "-g -pg" flags to gcc for the debug build and the "-O3" flag for the optimized build.
Host system: Ubuntu Linux 10.4 AMD64.
I know that debug symbols add significant overhead to the program, but the relative performance has always been maintained. I.e. a faster algorithm will always run faster in both debug and optimization builds.
Any idea why I'm seeing this behavior?
Debug version is compiled with "-g3 -pg". Optimized version with "-O3".
Optimized no threading: 0m4.864s
Optimized threading: 0m2.075s
Debug no threading: 0m30.351s
Debug threading: 0m39.860s
Debug threading after "strip": 0m39.767s
Debug no threading (no-pg): 0m10.428s
Debug threading (no-pg): 0m4.045s
This convinces me that "-g3" is not to blame for the odd performance delta, but that it's rather the "-pg" switch. It's likely that the "-pg" option adds some sort of locking mechanism to measure thread performance.
Since "-pg" is broken on threaded applications anyway, I'll just remove it.
What do you get without the -pg flag? That's not debugging symbols (which don't affect the code generation), that's for profiling (which does).
It's quite plausible that profiling in a multithreaded process requires additional locking which slows the multithreaded version down, even to the point of making it slower than the non-multithreaded version.
You are talking about two different things here. Debug symbols and compiler optimization. If you use the strongest optimization settings the compiler has to offer, you do so at the consequence of losing symbols that are useful in debugging.
Your application is not running slower due to debugging symbols, its running slower because of less optimization done by the compiler.
Debugging symbols are not 'overhead' beyond the fact that they occupy more disk space. Code compiled at maximum optimization (-O3) should not be adding debug symbols. That's a flag that you would set when you have no need for said symbols.
If you need debugging symbols, you gain them at the expense of losing compiler optimization. However, once again, this is not 'overhead', its just the absence of compiler optimization.
Is the profile code inserting instrumentation calls in enough functions to hurt you?
If you single-step at the assembly language level, you'll find out pretty quick.
Multithreaded code execution time is not always measured as expected by gprof.
You should time your code with an other timer in addition to gprof to see the difference.
My example: Running LULESH CORAL benchmark on a 2NUMA nodes INTEL sandy bridge (8 cores + 8 cores) with size -s 50 and 20 iterations -i, compile with gcc 6.3.0, -O3, I have:
With 1 thread running: ~3,7 without -pg and ~3,8 with it, but according to gprof analysis the code has ran only for 3,5.
WIth 16 threads running: ~0,6 without -pg and ~0,8 with it, but according to gprof analysis the code has ran for ~4,5 ...
The time in bold has been measured gettimeofday, outside the parallel region (start and end of main function).
Therefore, maybe if you would have measure your application time the same way, you would have seen the same speeduo with and without -pg. It is just the gprof measure which is wrong in parallel. In LULESH openmp version either way.

Resources