What do the optimization levels `-Os` and `-Oz` do in rustc? - rust

Executing rustc -C help shows (among other things):
-C opt-level=val -- optimize with possible levels 0-3, s, or z
The levels 0 to 3 are fairly intuitive, I think: the higher the level, the more aggressive optimizations will be performed. However, I have no clue what the s and z options are doing and I couldn't find Rust-related information about them.

It seems like you are not the only one confused, as described in a Rust issue. It seems to follow the same pattern as Clang:
Os For optimising the size when compiling.
Oz For even more size optimisation.

Looking at these and these lines in Rust's source code, I can say that s means optimize for size, and z means optimize for size some more.
All optimizations seem to be performed by the LLVM code-generation engine.

These two sequences, Os and Oz, within LLVM, are pretty similar. Oz invokes 260 passes (I am using LLVM 12.0), whereas Os invokes 264. Oz' sequence of analyses and optimizations is almost a strict subsequence of Os', except for one pass (opt -loops), which appears in a different place within Os. This said, notice that the effects of the optimizations can still be different, because they use different cost models, e.g., constants that determine the behavior of optimizations. Thus, optimizations that have impact on size, like loop unrolling and inlining can behave differently in these two sequences.

Related

Are floating point operations deterministic when running in multiple threads?

Suppose I have a function that runs calculations, example being something like a dot product - I pass in an arrays A, B of vectors and a float array C, and the functions assigns:
C[i] = dot(A[i], B[i]);
If I create and start two threads that will run this function, and pass in the same three arrays to both the threads, under what circumstances is this type of action (perhaps using a different non-random mathematical operation etc.) not guaranteed give the same result (running the same application without any recompilation, and on the same machine)? I'm only interested in the context of a consumer PC.
I know that float operations are in general deterministic, but I do wonder whether perhaps something weird could happen and maybe on one thread the calculations will use an intermediate 80 bit register, but not in the other.
I would assume it's pretty much guaranteed the same binary code should run in both threads (is there some way this could not happen? The function being compiled multiple times for some reason, the compiler somehow figuring out it will run in multiple threads, and compiling it again, for some reason, for the second thread?).
But I'm a a bit more worried that CPU cores might not have the same instruction sets, even on consumer level PCs.
Side question - what about GPUs in a similar scenario?
//
I'm assuming x86_64, Windows, c++, and dot is a.x * b.x + a.y * b.y. Can't give more info than that - using Unity IL2CPP, don't know how it compiles/with what options.
Motivation for the question: I'm writing a computational geometry procedure that modifies a mesh - I'll call this the "geometric mesh". The issue is that it could happen that the "rendering mesh" has multiple vertices for certain geometric positions - it's needed for flat shading for example - you have multiple vertices with different normals. However, the actual computational geometry procedure only uses purely geometric data of the positions in space.
So I see two options:
Create a map from the rendering mesh to the geometric mesh (example - duplicate vertices being mapped to one unique vertex), run the procedure on the geometric mesh, then somehow modify the rendering mesh based on the result.
Work with the rendering mesh directly. Slightly more inefficient as the procedure does calculations for all vertices, but much easier from a code perspective. But most of all I'm a bit worried that I could get two different values for two vertices that actually have the same position and that shouldn't happen. Only the position is used, and the position would be the same for both such vertices.
Floating point (FP) operations are not associative (but it is commutative). As a result, (x+y)+z can give different results than x+(y+z). For example, (1e-13 + (1 - 1e-13)) == ((1e-13 + 1) - 1e-13) is false with 64-bit IEEE-754 floats. The C++ standard is not very restrictive about floating-point numbers. However, the widely-used IEEE-754 standard is. It specifies the precision of 32-bit and 64-bit number operations, including rounding modes. x86-64 processors are IEEE-754 compliant and mainstream compilers (eg. GCC, Clang and MSVC) are also IEEE-754 compliant by default. ICC is not compliant by default since it assumes the FP operations are associative for the sake of performance. Mainstream compilers have compilation flags to make such assumption so to speed up codes. It is generally combined with other ones like the assumption that all FP values are not NaN (eg. -ffast-math). Such flags break the IEEE-754 compliance, but they are often used in the 3D or video game industry so to speed up codes. IEEE-754 is not required by the C++ standard, but you can check this with std::numeric_limits<T>::is_iec559.
Threads can have different rounding modes by default. However, you can set the rounding mode using the C code provided in this answer. Also, please note that denormal numbers are sometimes disabled on some platforms because of their very-high overhead (see this for more information).
Assuming the IEEE-754 compliance is not broken, the rounding mode is the same and the threads does the operations in the same order, then the result should be identical up to at least 1 ULP. In practice, if they are compiled using a same mainstream compiler, the result should be exactly the same.
The thing is using multiple threads often result in a non-deterministic order of the applied FP operations which causes non-deterministic results. More specifically, atomic operations on FP variables often cause such an issue because the order of the operations often changes at runtime. If you want deterministic results, you need to use a static partitioning, avoid atomic operations on FP variables or more generally atomic operations that could result in a different ordering. The same thing applies for locks or any synchronization mechanisms.
The same thing is true for GPUs. In fact, such problem is very frequent when developers use atomic FP operations for example to sum values. They often do that because implementing fast reductions is complex (though it is more deterministic) and atomic operations as pretty fast on modern GPUs (since they use dedicated efficient units).
According to the accepted answer to floating point processor non-determinism?, C++ floating point is not non-deterministic. The same sequence of instructions will give the same results.
There are a few things to take into account though:
Firstly, the behavior (i.e. the result) of a particular piece of C++ source code doing a FP calculation may depend on the compiler and the chosen compiler options. For example, it may depend on whether the compiler chooses to emit 64 or 80 bit FP instructions. But this is deterministic.
Secondly, similar C++ source code may give different results; e.g. due to non-associative behavior of certain FP instructions. This also is deterministic.
Determinism won't be affected by multi-threading by default. The C++ compiler will probably be unaware of whether the code is multi-threaded or not. And it definitely has no reason to emit different FP code.
Admittedly, FP behavior depends on the rounding mode selected, and that can be set on a per-thread basis. However, for this to happen, something (application code) would have to explicitly set different rounding modes for different threads. Once again, that is deterministic. (And a pretty daft thing for the application code to do, IMO.)
The idea that a PC would would use different FP hardware with different behavior for different threads seems far-fetched to me. Sure a PC could have (say) an Intel chipset and an ARM chipset, but it is not plausible that different threads of the same C++ application (executable) would simultaneously run on both chipsets.
Likewise for GPUs. Indeed, given that you need to program GPUs in a way that is radically different to ordinary (or threaded) C++, I would doubt that they could even share the same source code.
In short, I think that you are worrying about a hypothetical problem that you are unlikely to encounter in reality ... given the current state of the art in hardware and C++ compilers.

How to do modulo with less memory in ARM embedded Rust

I have an embedded project in Rust on the STM32F446 MCU. Consider the next line:
leds::set_g(self.next_update_time % 2000 == 0)
The modulo is used and reading online, it appears that the Cortex M4 doesn't have a modulo instruction. Instead, a function gets added to the binary that does this in software. Using cargo bloat (based on Google's Bloaty), it can be found.
File .text Size Crate Name
...
0.1% 6.9% 990B compiler_builtins __udivmoddi4
...
Much to my surprise, it takes just under a kilobyte of memory. I think that's a lot. The code behind it is quite long as well, see this link. I assume this implementation is made to be fast. Luckily I have the memory to spare.
Using opt-level = 'z' doesn't change this.
But what if I couldn't afford this, how could I let it take up less memory?
Of course resorting to a solution like this would work, but then I'd lose the ability to use the % operator.
Not sure how clever the Rust linker is, but in many embedded linker implementations you would be able to swap in your own implementation of __udivmodi4 which used a smaller (but slower) method in preference to the version provided by the compiler.
In general generic division and modulo are expensive on embedded platforms, but division by a constant can often be specialized with a "fixed" implementation by a smart compiler (often with special cases for common divisors - 3, 5, 7, 10, etc).
If you can control the application then changing the code to divide or modulo by 2^N is obviously preferable (it collapses to either a "right shift" instruction for divide, or an "and" instruction for modulo). E.g. in this case 2048 might be acceptably close to 2000, and turns 1 KB of code into 4 bytes of code.
FWIW the Rust version of this does seem a little on the fat side - the GCC implementation for example is much smaller.

Parallelism in functional languages

One of FP features advertised is that a program is "parallel by default" and that naturally fits modern multi-core processors. Indeed, reducing a tree is parallel by its nature. However, I don't understand how it maps to multi-threading. Consider the following fragment (pseudo code):
let y = read-and-parse-a-number-from-console
let x = get-integer-from-web-service-call
let r = 5 * x - y * 4
write-r-to-file
How can a translator determine which of tree branches should be run on a thread? After you obtained x or y it would be stupid to reduce 5 * x or y * 4 expressions on a separate thread (even if we grab it from a thread pool), wouldn't it? So how different functional languages handle this?
We're not quite there yet.
Programs in pure declarative style (functional style is included in this category, but so are some other styles) tend to be much more amenable to parallelisation, because all data dependencies are explicit. This makes it very easy for the programmer to manually use primitives the language provides for specifying that two independent computations should be done in parallel, regardless of whether they share access to any data; if everything's immutable and there are no side effects, then changing the order in which things are done can't affect the result.
If purity is enforced by the language (as in Haskell, Mercury, etc, but unlike in Scala, F#, etc where purity is encouraged but unenforced), then it is possible for the compiler to attempt to automatically parallelise the program, but no existing language that I know of does this by default. If the language allows unchecked impure operations then it's generally impossible for the compiler to do the analysis necessary to prove that a given attempt to auto-parallelise the program is valid. So I do not expect any such language to ever support auto-parallelisation very effectively.
Note that the pseudo program you wrote is probably not pure declarative code. let y = read-and-parse-a-number-from-console and let x = get-integer-from-web-service-call are calculating x and y from impure external operations, and there's nothing in the program that fixes the order in which they should run. It's possible in general that executing two impure operations in either order will produce different results, and running those two operations in different threads gives up control of the order in which they run. So if a language like that was to auto-parallelise your program, it would almost certainly either introduce horrible concurrency bugs, or refuse to significantly parallelise anything.
However the functional style still makes it easy to manually parallelise such programs. A human programmer can tell that it almost certainly doesn't matter in which order you read from the console and the network. Knowing that there's no shared mutable state can decide to run those two operations in parallel without digging into their implementations (which you'd have to do in imperative algorithms where there might be mutable shared state even if it doesn't look like there is from the interface).
But the big complication that's in the way of auto-parallelising compilers for enforced-purity languages is knowing how much parallelisation to do. Running every computation possible in parallel vastly overwhelms any possible benefit with all the startup cost of spawning new threads (not to mention the context switches), as you try to run huge numbers of very short-lived threads on a small number of processors. The compiler needs to identify a much smaller number of "chunks" of computation that are reasonably large, and run the chunks in parallel while running the sub-computations of each chunk sequentially.
But only "embarrassingly parallel" programs decompose nicely into very large completely independent computations. Most programs are much more interdependent. So unless you only want to be able to auto-parallelise programs that are very easy to manually parallelise, your auto-parallelisation probably needs to be able to identify and run "chunks" in parallel which are partially dependent on each other, having them wait when they get to points that really need a result that's supposed to be computed by another "chunk". This introduces extra overhead of synchronisation between the threads, so the logic that chooses what to run in parallel needs to be even better in order to beat the trivial strategy of just running everything sequentially.
The developers of Mercury (a pure logical programming language) are working on various methods of tackling these problem, from static analysis to using profiling data. If you're interested, their research papers have a lot more information. I presume other researches are working on this area in other languages, but I don't know much about any other projects.
In that specific example, the third statement depends on the first and the second, but there is no interdependency between the first and the second. Therefore, a runtime environment could execute read-and-parse-a-number-from-console on a different thread from get-integer-from-web-service-call, but the execution of the third statement would have to wait until the first two are finished.
Some languages or runtime environments may be able to calculate a partial result (such as y * 4) before obtaining an actual value of x. As a high level programmer though, you would be unlikely to be able to detect this.

Does 64-bit floating point numbers behave identically on all modern PCs?

I would like to know whether i can assume that same operations on same 64-bit floating point numbers gives exactly the same results on any modern PC and in most common programming languages? (C++, Java, C#, etc.). We can assume, that we are operating on numbers and result is also a number (no NaNs, INFs and so on).
I know there are two very simmilar standards of computation using floating point numbers (IEEE 854-1987 and IEEE 754-2008). However I don't know how it is in practice.
Modern processors that implement 64-bit floating-point typically implement something that is close to the IEEE 754-1985 standard, recently superseded by the 754-2008 standard.
The 754 standard specifies what result you should get from certain basic operations, notably addition, subtraction, multiplication, division, square root, and negation. In most cases, the numeric result is specified precisely: The result must be the representable number that is closest to the exact mathematical result in the direction specified by the rounding mode (to nearest, toward infinity, toward zero, or toward negative infinity). In "to nearest" mode, the standard also specifies how ties are broken.
Because of this, operations that do not involve exception conditions such as overflow will get the same results on different processors that conform to the standard.
However, there are several issues that interfere with getting identical results on different processors. One of them is that the compiler is often free to implement sequences of floating-point operations in a variety of ways. For example, if you write "a = bc + d" in C, where all variables are declared double, the compiler is free to compute "bc" in either double-precision arithmetic or something with more range or precision. If, for example, the processor has registers capable of holding extended-precision floating-point numbers and doing arithmetic with extended-precision does not take any more CPU time than doing arithmetic with double-precision, a compiler is likely to generate code using extended-precision. On such a processor, you might not get the same results as you would on another processor. Even if the compiler does this regularly, it might not in some circumstances because the registers are full during a complicated sequence, so it stores the intermediate results in memory temporarily. When it does that, it might write just the 64-bit double rather than the extended-precision number. So a routine containing floating-point arithmetic might give different results just because it was compiled with different code, perhaps inlined in one place, and the compiler needed registers for something else.
Some processors have instructions to compute a multiply and an add in one instruction, so "bc + d" might be computed with no intermediate rounding and get a more accurate result than on a processor that first computes bc and then adds d.
Your compiler might have switches to control behavior like this.
There are some places where the 754-1985 standard does not require a unique result. For example, when determining whether underflow has occurred (a result is too small to be represented accurately), the standard allows an implementation to make the determination either before or after it rounds the significand (the fraction bits) to the target precision. So some implementations will tell you underflow has occurred when other implementations will not.
A common feature in processors is to have an "almost IEEE 754" mode that eliminates the difficulty of dealing with underflow by substituting zero instead of returning the very small number that the standard requires. Naturally, you will get different numbers when executing in such a mode than when executing in the more compliant mode. The non-compliant mode may be the default set by your compiler and/or operating system, for reasons of performance.
Note that an IEEE 754 implementation is typically not provided just by hardware but by a combination of hardware and software. The processor may do the bulk of the work but rely on the software to handle certain exceptions, set certain modes, and so on.
When you move beyond the basic arithmetic operations to things like sine and cosine, you are very dependent on the library you use. Transcendental functions are generally calculated with carefully engineered approximations. The implementations are developed independently by various engineers and get different results from each other. On one system, the sin function may give results accurate within an ULP (unit of least precision) for small arguments (less than pi or so) but larger errors for large arguments. On another system, the sin function might give results accurate within several ULP for all arguments. No current math library is known to produce correctly rounded results for all inputs. There is a project, crlibm (Correctly Rounded Libm), that has done some good work toward this goal, and they have developed implementations for significant parts of the math library that are correctly rounded and have good performance, but not all of the math library yet.
In summary, if you have a manageable set of calculations, understand your compiler implementation, and are very careful, you can rely on identical results on different processors. Otherwise, getting completely identical results is not something you can rely on.
If you mean getting exactly the same result, then the answer is no.
You might even get different results for debug (non-optimized) builds vs. release builds (optimized) on the same machine in some cases, so don't even assume that the results might be always identical on different machines.
(This can happen e.g. on a computer with an Intel processor, if the optimizer keeps a variable for an intermediate result in a register, that is stored in memory in the unoptimized build. Since Intel FPU registers are 80 bit, and double variables are 64 bit, the intermediate result will be stored with greater precision in the optimized build, causing different values in later results.).
In practice, however, you may often get the same results, but you shouldn't rely on it.
Modern FPUs all implement IEEE754 floats in single and double formats, and some in extended format. A certain set of operations are supported (pretty much anything in math.h), with some special instructions floating around out there.
assuming you are talking about applying multiple operations, I do not think you will get exact numbers. CPU architecture, compiler use, optimization settings will change the results of your computations.
if you mean the exact order of operations (at the assembly level), I think you will still get variations.for example Intel chips use extended precision (80 bits) internally, which may not be the case for other CPUs. (I do not think extended precision is mandated)
The same C# program can bring out different numerical results on the same PC, once compiled in debug mode without optimization, second time compiled in release mode with optimization enabled. That's my personal experience. We did not regard this when we set up an automatic regression test suite for one of our programs for the first time, and were completely surprised that a lot of our tests failed without any apparent reason.
For C# on x86, 80-bit FP registers are used.
The C# standard says that the processor must operate at the same precision as, or greater than, the type itself (i.e. 64-bit in the case of a 'double'). Promotions are allowed, except for storage. That means that locals and parameters could be at greater than 64-bit precision.
In other words, assigning a member variable to a local variable could (and in fact will under certain circumstances) be enough to give an inequality.
See also: Float/double precision in debug/release modes
For the 64-bit data type, I only know of "double precision" / "binary64" from the IEEE 754 (1985 and 2008 don't differ much here for common cases) being used.
Note: The radix types defined in IEEE 854-1987 are superseded by IEEE 754-2008 anyways.

Why doesn't a swap / exchange operator exist in imperative or OO languages like C/C++/C#/Java...?

I was always wondering why such a simple and basic operation like swapping the contents of two variables is not built-in for many languages.
It is one of the most basic programming exercises in computer science classes; it is heavily used in many algorithms (e.g. sorting); every now and then one needs it and one must use a temporary variable or use a template/generic function.
It is even a basic machine instruction on many processors, so that the standard scheme with a temporary variable will get optimized.
Many less obvious operators have been created, like the assignment operators (e.g. +=, which was probably created for reflecting the cumulative machine instructions, e.g. add ax,bx), or the ?? operator in C#.
So, what is the reason? Or does it actually exist, and I always missed it?
In my experience, it isn't that commonly needed in actual applications, apart from the already-mentioned sort algorithms and occasionally in low level hardware poking, so in my view it's a bit too special-purpose to have in a general-purpose language.
As has also been mentioned, not all processors support it as an instruction (and many do not support it for objects bigger than a word). So if it were supported with some useful additional semantics (e.g. being an atomic operation) it would be difficult to support on some processors, and if it didn't have the additional semantics then it's just (seldom used) synatatic sugar.
The assigment operators (+= etc) were supported because these are much more common in real-world programs - and so the syntacic sugar they provide was more useful, and also as an optimisation - remember C dates from the late 60s/early 70s, and compiler optimisation wasn't as advanced (and the machines less capable, so you didn't want lengthy optimisation passes anyway).
Paul
C++ does have swapping.
#include <algorithm>
#include <cassert>
int
main()
{
using std::swap;
int a(3), b(5);
swap(a, b);
assert(a == 5 && b == 3);
}
Furthermore, you can specialise swap for custom types too!
It's a widely used example in computer science courses, but I almost never find myself needing it in real code - whereas I use += very frequently.
Yes, in sorting it would be handy - but you don't tend to need to implement sorting yourself, so the number of actual uses in source code would still be pretty low.
You do have the XOR operator that does a variable substitution for primitive type...
I think they just forgot to add it :-) Yes, not all CPUs have this kind of instructions, so what ? We have bunch of other things that most CPUs don't have instructions to compute. It would be much easier/clearer and also faster ( by intrinsic ) if we had it !!!

Resources