Is multithreaded FFTW deterministic - multithreading

I am getting slightly different results between runs in my program. It uses multi-threaded FFTW planned with FFTW_ESTIMATE flag. Is multi-threaded FFTW deterministic:
For fixed number of threads?
Between different numbers of threads used at different runs?
FFTW faq says, that FFTW_ESTIMATE flag results in same algorithm used between runs, but it does not explicitly say that it is deterministic in multi-threaded case.

The fftw documentation:
http://www.fftw.org/fftw3_doc/Thread-safety.html#Thread-safety
stipulates that only fftw_execute is reentrant. So it's hard to say without more info about your usage. Also:
"If you are configured FFTW with the --enable-debug or --enable-debug-malloc flags (see Installation on Unix), then fftw_execute is not thread-safe. These flags are not documented because they are intended only for developing and debugging FFTW, but if you must use --enable-debug then you should also specifically pass --disable-debug-malloc for fftw_execute to be thread-safe."

Related

Build on AMD, run on Intel?

If I cargo build --release a Rust binary on an AMD CPU and then run it on an Intel (or vice versa), could that be a problem (compatibility issues and/or considerable performance sacrifice)? I know we can use a target-cpu=<cpu> flag and that should result in a potentially more optimized machine code for the target platform. My questions are:
Practically speaking, if we build for one platform but run on the other, should we expect a significant runtime performance penalty?
If we build on AMD with target-cpu=intel (or vice versa), could the compilation itself be:
slower?
restricted in how well it could optimize for the target platform?
Note: Linux will be the OS for both compilation and running.
In general, if you just do a cargo build --release without further configuration, then you will get a binary that runs on any machine of the relevant architecture. That is, it will run on either an Intel or AMD CPU that's x86-64. It will also be optimized for a general CPU of that architecture, regardless of what type of CPU you build it on. The specific settings are going to depend on whatever rustc and LLVM are configured for, but unless you've done a custom build, that's usually the case.
Usually that is sufficient for most people's needs and building for a target CPU is unnecessary. However, if you specify a particular CPU, then it will be optimized for that CPU and may contain instructions that don't run on other CPUs. For example, the architectural definition for x86-64 doesn't contain things like AVX, which is a later addition, so if you compile for a CPU providing those instructions then rustc may use them, which may cause it to perform worse or not at all elsewhere.
It's impossible to say more about your particular situation without knowing more about your code and performance needs. My recommendation is just to ues cargo build --release and not to optimize for a specific CPU unless you have measured the code and determined that there is a particular section which is slow and which would benefit from that. Most people benefit greatly from the additional portability of the code and don't need CPU-specific optimizations.
Everything I've said here is also true of other sets of architectures. If you compile for aarch64-unknown-linux-gnu or riscv64gc-unknown-linux-gnu, it will build for a generic CPU of that type which works on all systems of that type unless you specify different options. The exception tends to be on systems like macOS, where it is specifically known that all CPUs that run on that OS have some specific set of features, and thus a compilation for e.g. x86-64 CPUs on macOS might optimize for Intel CPUs with given features since macOS only runs on hardware that uses those CPUs.

Is there a __yield() intrinsic on Arm?

I am trying to compile some code for arm (v7a), that had a
#if defined(__arm__)
__yield();
#endif
added in this pull-request
The other branches have YieldProcessor() for MSC and _mm_pause() or __builtin_ia32_pause() for x86 and x86-64.
The symbol __yield is not found by the compiler, a arm-v7a-linux-gnueabihf-gcc 7.3.1 with -mcpu=cortex-a9 -mtune=cortex-a9 -march=armv7-a options. Is such symbol defined on some other ARM platforms, later Gcc, or Clang?
In the headers that come with the compiler all I could find is __gnu_parallel::__yield being an inline wrapper over sched_yield(), which I suppose is equivalent to the std::this_thread::yield() that the code calls after 100 iterations calling __yield(). So I think it's not that. But I didn't see __yield in gcc documentation either.
The __yield intrinsic is specified as part of the ARM C Language Extensions (see 8.4 "Hints"). It emits the yield instruction, which is the rough equivalent of x86 pause. It is intended precisely for situations like waiting on a spinlock; it keeps the CPU from hammering on the cache line excessively (which hurts performance), possibly saves some power, and, in case of a hyperthreading CPU, makes more computational units available to the other logical processor.
(Note that it is purely a CPU function, and not an OS or library call; it doesn't yield a CPU timeslice to the operating system like the similarly named pause() or sched_yield() or std::this_thread::yield() calls would do.)
Although GCC supports some of the ACLE intrinsics, it seems to be missing this one. You should be able to substitute with asm volatile("yield");. The yield instruction has no architectural effect (it executes like nop) so no register or memory clobbers are needed.
So in C++ you have std::this_thread::yield()
in C11 you have trd_yield,
https://en.cppreference.com/w/c/thread
both are available in ARM.

What are the default build option of Stack's package.yaml needed for?

When creating a new Stack project, the executable gets the following ghc-options
- -threaded
- -rtsopts
- -with-rtsopts=-N
Looking at the man ghc, I found the following:
-threaded
Use the threaded runtime
-rtsopts[=⟨none|some|all⟩]
Control whether the RTS behaviour can be tweaked via command-lineflags and the GHCRTS environment variable. Using none means no RTS
flags can be given; some means only a minimum of safe options can be given (the default), and all (or no argument at all) means
that all RTS flags are permitted.
-with-rtsopts=⟨opts⟩
Set the default RTS options to ⟨opts⟩.
This hardly tells me anything. Does the threaded runtime refer to the runtime of the compiler or a runtime of the resulting executable? What are these rtsopts for? Are these options somehow beneficial? If yes, why are they not the default? Why does the executable get them and the library code not:
library:
source-dirs: src
# notice no ghc-options field here!
It refers to the runtime of the resulting executable. These options allow the final program to run multithreaded, i.e. to exploit parallel or concurrent execution on a multicore machine. This is mostly benefitial for performance, and particularly for responsiveness of interactive applications (some are virtually or literally unusable when not run with the threaded runtime).
The runtime is always only linked to an executable; you may think of it as similar to the Java virtual machine that executes a bytecode program, except the Haskell runtime is much more stripped down and doesn't need to do actually do much code interpreting, it mostly just handles the memory requirements and assigns Haskell threads to OS threads.
It is thus not linked to libraries, except at the very end when you use the library in an executable. That's why the option doesn't appear in the library section. (Although, it would actually be quite handy if libraries had a way to indicate “when you use this in an exectuable, it should be linked with the threaded runtime”, but this would have ramifications and I don't think there is anything that allows something like this.)

MSVC /arch:[instruction set] - SSE3, AVX, AVX2

Here is an example of a class which shows supported instruction sets. https://msdn.microsoft.com/en-us/library/hskdteyh.aspx
I want to write three different implementations of a single function, each of them using different instruction set. But due to flag /ARCH:AVX2, for example, this app won't ever run anywhere but on 4th+ generation of Intel processors, so the whole point of checking is pointless.
So, question is: what exactly this flag does? Enables support or enables compiler optimizations using provided instruction sets?
In other words, can I completely remove this flag and keep using functions from immintrin.h, emmintrin.h, etc?
An using of option /ARCH:AVX2 allows to use YMM registers and AVX2 instructions of CPU by the best way. But if CPU is not support these instruction it will be a program crash. If you use AVX2 instructions and compiler flag /ARCH:SSE2 that will be a decreasing of performance (about 2x times).
So the best implementation when every implementation of your function is compiled with corresponding compiler options (/ARCH:AVX2, /ARCH:SSE2 and so on). The easiest way to do it - put your implementations (scalar, SSE, AVX) in different files and compile each file with specific compiler options.
Also it will be a good idea if you create a separate file where you can check CPU abilities and call corresponding implementation of your function.
There is an example of a library which does CPU checking and calling an one of implemented function.

Why does my code run slower with multiple threads than with a single thread when it is compiled for profiling (-pg)?

I'm writing a ray tracer.
Recently, I added threading to the program to exploit the additional cores on my i5 Quad Core.
In a weird turn of events the debug version of the application is now running slower, but the optimized build is running faster than before I added threading.
I'm passing the "-g -pg" flags to gcc for the debug build and the "-O3" flag for the optimized build.
Host system: Ubuntu Linux 10.4 AMD64.
I know that debug symbols add significant overhead to the program, but the relative performance has always been maintained. I.e. a faster algorithm will always run faster in both debug and optimization builds.
Any idea why I'm seeing this behavior?
Debug version is compiled with "-g3 -pg". Optimized version with "-O3".
Optimized no threading: 0m4.864s
Optimized threading: 0m2.075s
Debug no threading: 0m30.351s
Debug threading: 0m39.860s
Debug threading after "strip": 0m39.767s
Debug no threading (no-pg): 0m10.428s
Debug threading (no-pg): 0m4.045s
This convinces me that "-g3" is not to blame for the odd performance delta, but that it's rather the "-pg" switch. It's likely that the "-pg" option adds some sort of locking mechanism to measure thread performance.
Since "-pg" is broken on threaded applications anyway, I'll just remove it.
What do you get without the -pg flag? That's not debugging symbols (which don't affect the code generation), that's for profiling (which does).
It's quite plausible that profiling in a multithreaded process requires additional locking which slows the multithreaded version down, even to the point of making it slower than the non-multithreaded version.
You are talking about two different things here. Debug symbols and compiler optimization. If you use the strongest optimization settings the compiler has to offer, you do so at the consequence of losing symbols that are useful in debugging.
Your application is not running slower due to debugging symbols, its running slower because of less optimization done by the compiler.
Debugging symbols are not 'overhead' beyond the fact that they occupy more disk space. Code compiled at maximum optimization (-O3) should not be adding debug symbols. That's a flag that you would set when you have no need for said symbols.
If you need debugging symbols, you gain them at the expense of losing compiler optimization. However, once again, this is not 'overhead', its just the absence of compiler optimization.
Is the profile code inserting instrumentation calls in enough functions to hurt you?
If you single-step at the assembly language level, you'll find out pretty quick.
Multithreaded code execution time is not always measured as expected by gprof.
You should time your code with an other timer in addition to gprof to see the difference.
My example: Running LULESH CORAL benchmark on a 2NUMA nodes INTEL sandy bridge (8 cores + 8 cores) with size -s 50 and 20 iterations -i, compile with gcc 6.3.0, -O3, I have:
With 1 thread running: ~3,7 without -pg and ~3,8 with it, but according to gprof analysis the code has ran only for 3,5.
WIth 16 threads running: ~0,6 without -pg and ~0,8 with it, but according to gprof analysis the code has ran for ~4,5 ...
The time in bold has been measured gettimeofday, outside the parallel region (start and end of main function).
Therefore, maybe if you would have measure your application time the same way, you would have seen the same speeduo with and without -pg. It is just the gprof measure which is wrong in parallel. In LULESH openmp version either way.

Resources