How to do modulo with less memory in ARM embedded Rust

How to do modulo with less memory in ARM embedded Rust - rust

I have an embedded project in Rust on the STM32F446 MCU. Consider the next line:
leds::set_g(self.next_update_time % 2000 == 0)
The modulo is used and reading online, it appears that the Cortex M4 doesn't have a modulo instruction. Instead, a function gets added to the binary that does this in software. Using cargo bloat (based on Google's Bloaty), it can be found.
File .text Size Crate Name
...
0.1% 6.9% 990B compiler_builtins __udivmoddi4
...
Much to my surprise, it takes just under a kilobyte of memory. I think that's a lot. The code behind it is quite long as well, see this link. I assume this implementation is made to be fast. Luckily I have the memory to spare.
Using opt-level = 'z' doesn't change this.
But what if I couldn't afford this, how could I let it take up less memory?
Of course resorting to a solution like this would work, but then I'd lose the ability to use the % operator.

Not sure how clever the Rust linker is, but in many embedded linker implementations you would be able to swap in your own implementation of __udivmodi4 which used a smaller (but slower) method in preference to the version provided by the compiler.
In general generic division and modulo are expensive on embedded platforms, but division by a constant can often be specialized with a "fixed" implementation by a smart compiler (often with special cases for common divisors - 3, 5, 7, 10, etc).
If you can control the application then changing the code to divide or modulo by 2^N is obviously preferable (it collapses to either a "right shift" instruction for divide, or an "and" instruction for modulo). E.g. in this case 2048 might be acceptably close to 2000, and turns 1 KB of code into 4 bytes of code.
FWIW the Rust version of this does seem a little on the fat side - the GCC implementation for example is much smaller.

Related

Fastest way to flip the value of a boolean

I am trying to find the fastest way to flip the value of a Boolean in rust? i.e.
false => true
true => false
For my application I do not care about the current value of the Boolean only that it is flipped. For my application (Sieve of Atkin - an improved version of the Sieve of Eratosthenes) this will need to be performed a large number of times so would be good to have it run as fast as possible. Currently my code is:
item[i] = !item[i]
Because, (as mentioned) the current value of item[i] is irrelevant I am sure there is a faster (possibly bitwise) way to do this. However, I'm a bit of a rust nubie and haven't been able to find it, can anyone advise me on a better way?
Thanks,

It doesn't matter. The compiler will outsmart you on this and pick the fastest method it knows. The exact syntax you use is not important. However this is only the case when you remember to turn on optimizations. You may think this is obvious, but it is an extremely common mistake. This is done using --release when building or running your project with cargo. If you forget this step, the compiler won't even attempt to speed up your code and timing code execution becomes meaningless.
What matters more is how you access the memory where the boolean resides. Try to keep memory you are working with on the stack if you are doing a lot of work on one value or small region at a time. Cache locality also means that it will be faster to read adjacent cache lines than jumping between places in memory. Memory is more likely to be in the cache if you accessed it recently or the CPU guesses you are about to access it.
There are also crates like bit-vec and bitvec which reduce each boolean to using a single bit. This is great for improving memory usage (8x improvement to be exact), but comes at a very small cost to performance. I would avoid the bitvec crate though. About a month ago I did some benchmarks and the performance was absolutely abysmal.
Do you need to work on a single boolean at a time? Try to work with entire words of memory if possible. Bitwise operations on a u64 will likely take the exact same amount of time, but you get 64x the productivity.

Since in Rust a boolean variable is represented as an 8 bit unsigned integer with 0 for false and 1 for true, the compiler can implement negation without a branch by computing the XOR of the value with 1.
That being said, while I'm not familiar with the Sieve of Atkin, at least for the Sieve of Eratosthenes, you really want to use bitfields rather than booleans. But the same trick can be used there to avoid a branch.

Are floating point operations deterministic when running in multiple threads?

Suppose I have a function that runs calculations, example being something like a dot product - I pass in an arrays A, B of vectors and a float array C, and the functions assigns:
C[i] = dot(A[i], B[i]);
If I create and start two threads that will run this function, and pass in the same three arrays to both the threads, under what circumstances is this type of action (perhaps using a different non-random mathematical operation etc.) not guaranteed give the same result (running the same application without any recompilation, and on the same machine)? I'm only interested in the context of a consumer PC.
I know that float operations are in general deterministic, but I do wonder whether perhaps something weird could happen and maybe on one thread the calculations will use an intermediate 80 bit register, but not in the other.
I would assume it's pretty much guaranteed the same binary code should run in both threads (is there some way this could not happen? The function being compiled multiple times for some reason, the compiler somehow figuring out it will run in multiple threads, and compiling it again, for some reason, for the second thread?).
But I'm a a bit more worried that CPU cores might not have the same instruction sets, even on consumer level PCs.
Side question - what about GPUs in a similar scenario?
//
I'm assuming x86_64, Windows, c++, and dot is a.x * b.x + a.y * b.y. Can't give more info than that - using Unity IL2CPP, don't know how it compiles/with what options.
Motivation for the question: I'm writing a computational geometry procedure that modifies a mesh - I'll call this the "geometric mesh". The issue is that it could happen that the "rendering mesh" has multiple vertices for certain geometric positions - it's needed for flat shading for example - you have multiple vertices with different normals. However, the actual computational geometry procedure only uses purely geometric data of the positions in space.
So I see two options:
Create a map from the rendering mesh to the geometric mesh (example - duplicate vertices being mapped to one unique vertex), run the procedure on the geometric mesh, then somehow modify the rendering mesh based on the result.
Work with the rendering mesh directly. Slightly more inefficient as the procedure does calculations for all vertices, but much easier from a code perspective. But most of all I'm a bit worried that I could get two different values for two vertices that actually have the same position and that shouldn't happen. Only the position is used, and the position would be the same for both such vertices.

Floating point (FP) operations are not associative (but it is commutative). As a result, (x+y)+z can give different results than x+(y+z). For example, (1e-13 + (1 - 1e-13)) == ((1e-13 + 1) - 1e-13) is false with 64-bit IEEE-754 floats. The C++ standard is not very restrictive about floating-point numbers. However, the widely-used IEEE-754 standard is. It specifies the precision of 32-bit and 64-bit number operations, including rounding modes. x86-64 processors are IEEE-754 compliant and mainstream compilers (eg. GCC, Clang and MSVC) are also IEEE-754 compliant by default. ICC is not compliant by default since it assumes the FP operations are associative for the sake of performance. Mainstream compilers have compilation flags to make such assumption so to speed up codes. It is generally combined with other ones like the assumption that all FP values are not NaN (eg. -ffast-math). Such flags break the IEEE-754 compliance, but they are often used in the 3D or video game industry so to speed up codes. IEEE-754 is not required by the C++ standard, but you can check this with std::numeric_limits<T>::is_iec559.
Threads can have different rounding modes by default. However, you can set the rounding mode using the C code provided in this answer. Also, please note that denormal numbers are sometimes disabled on some platforms because of their very-high overhead (see this for more information).
Assuming the IEEE-754 compliance is not broken, the rounding mode is the same and the threads does the operations in the same order, then the result should be identical up to at least 1 ULP. In practice, if they are compiled using a same mainstream compiler, the result should be exactly the same.
The thing is using multiple threads often result in a non-deterministic order of the applied FP operations which causes non-deterministic results. More specifically, atomic operations on FP variables often cause such an issue because the order of the operations often changes at runtime. If you want deterministic results, you need to use a static partitioning, avoid atomic operations on FP variables or more generally atomic operations that could result in a different ordering. The same thing applies for locks or any synchronization mechanisms.
The same thing is true for GPUs. In fact, such problem is very frequent when developers use atomic FP operations for example to sum values. They often do that because implementing fast reductions is complex (though it is more deterministic) and atomic operations as pretty fast on modern GPUs (since they use dedicated efficient units).

According to the accepted answer to floating point processor non-determinism?, C++ floating point is not non-deterministic. The same sequence of instructions will give the same results.
There are a few things to take into account though:
Firstly, the behavior (i.e. the result) of a particular piece of C++ source code doing a FP calculation may depend on the compiler and the chosen compiler options. For example, it may depend on whether the compiler chooses to emit 64 or 80 bit FP instructions. But this is deterministic.
Secondly, similar C++ source code may give different results; e.g. due to non-associative behavior of certain FP instructions. This also is deterministic.
Determinism won't be affected by multi-threading by default. The C++ compiler will probably be unaware of whether the code is multi-threaded or not. And it definitely has no reason to emit different FP code.
Admittedly, FP behavior depends on the rounding mode selected, and that can be set on a per-thread basis. However, for this to happen, something (application code) would have to explicitly set different rounding modes for different threads. Once again, that is deterministic. (And a pretty daft thing for the application code to do, IMO.)
The idea that a PC would would use different FP hardware with different behavior for different threads seems far-fetched to me. Sure a PC could have (say) an Intel chipset and an ARM chipset, but it is not plausible that different threads of the same C++ application (executable) would simultaneously run on both chipsets.
Likewise for GPUs. Indeed, given that you need to program GPUs in a way that is radically different to ordinary (or threaded) C++, I would doubt that they could even share the same source code.
In short, I think that you are worrying about a hypothetical problem that you are unlikely to encounter in reality ... given the current state of the art in hardware and C++ compilers.

What do the optimization levels `-Os` and `-Oz` do in rustc?

Executing rustc -C help shows (among other things):
-C opt-level=val -- optimize with possible levels 0-3, s, or z
The levels 0 to 3 are fairly intuitive, I think: the higher the level, the more aggressive optimizations will be performed. However, I have no clue what the s and z options are doing and I couldn't find Rust-related information about them.

It seems like you are not the only one confused, as described in a Rust issue. It seems to follow the same pattern as Clang:
Os For optimising the size when compiling.
Oz For even more size optimisation.

Looking at these and these lines in Rust's source code, I can say that s means optimize for size, and z means optimize for size some more.
All optimizations seem to be performed by the LLVM code-generation engine.

These two sequences, Os and Oz, within LLVM, are pretty similar. Oz invokes 260 passes (I am using LLVM 12.0), whereas Os invokes 264. Oz' sequence of analyses and optimizations is almost a strict subsequence of Os', except for one pass (opt -loops), which appears in a different place within Os. This said, notice that the effects of the optimizations can still be different, because they use different cost models, e.g., constants that determine the behavior of optimizations. Thus, optimizations that have impact on size, like loop unrolling and inlining can behave differently in these two sequences.

Why is bounds checking not implemented in some of the languages?

According to the Wikipedia (http://en.wikipedia.org/wiki/Buffer_overflow)
Programming languages commonly associated with buffer overflows include C and C++, which provide no built-in protection against accessing or overwriting data in any part of memory and do not automatically check that data written to an array (the built-in buffer type) is within the boundaries of that array. Bounds checking can prevent buffer overflows.
So, why are 'Bounds Checking' not implemented in some of the languages like C and C++?

Basically, it's because it means every time you change an index, you have to do an if statement.
Let's consider a simple C for loop:
int ary[X] = {...}; // Purposefully leaving size and initializer unknown
for(int ix=0; ix< 23; ix++){
printf("ary[%d]=%d\n", ix, ary[ix]);
}
if we have bounds checking, the generated code for ary[ix] has to be something like
LOOP:
INC IX ; add `1 to ix
CMP IX, 23 ; while test
CMP IX, X ; compare IX and X
JGE ERROR ; if IX >= X jump to ERROR
LD R1, IX ; put the value of IX into register 1
LD R2, ARY+IX ; put the array value in R2
LA R3, Str42 ; STR42 is the format string
JSR PRINTF ; now we call the printf routine
J LOOP ; go back to the top of the loop
;;; somewhere else in the code
ERROR:
HCF ; halt and catch fire
If we don't have that bounds check, then we can write instead:
LD R1, IX
LOOP:
CMP IX, 23
JGE END
LD R2, ARY+R1
JSR PRINTF
INC R1
J LOOP
This saves 3-4 instructions in the loop, which (especially in the old days) meant a lot.
In fact, in the PDP-11 machines, it was even better, because there was something called "auto-increment addressing". On a PDP, all of the register stuff etc turned into something like
CZ -(IX), END ; compare IX to zero, then decrement; jump to END if zero
(And anyone who happens to remember the PDP better than I do, don't give me trouble about the precise syntax etc; you're an old fart like me, you know how these things slip away.)

It's all about the performance. However, the assertion that C and C++ have no bounds checking is not entirely correct. It is quite common to have "debug" and "optimized" versions of each library, and it is not uncommon to find bounds-checking enabled in the debugging versions of various libraries.
This has the advantage of quickly and painlessly finding out-of-bounds errors when developing the application, while at the same time eliminating the performance hit when running the program for realz.
I should also add that the performance hit is non-negigible, and many languages other than C++ will provide various high-level functions operating on buffers that are implemented directly in C and C++ specifically to avoid the bounds checking. For example, in Java, if you compare the speed of copying one array into another using pure Java vs. using System.arrayCopy (which does bounds checking once, but then straight-up copies the array without bounds-checking each individual element), you will see a decently large difference in the performance of those two operations.

It is easier to implement and faster both to compile and at run-time. It also simplifies the language definition (as quite a few things can be left out if this is skipped).
Currently, when you do:
int *p = (int*)malloc(sizeof(int));
*p = 50;
C (and C++) just says, "Okey dokey! I'll put something in that spot in memory".
If bounds checking were required, C would have to say, "Ok, first let's see if I can put something there? Has it been allocated? Yes? Good. I'll insert now." By skipping the test to see whether there is something which can be written there, you are saving a very costly step. On the other hand, (she wore a glove), we now live in an era where "optimization is for those who cannot afford RAM," so the arguments about the speed are getting much weaker.

The primary reason is the performance overhead of adding bounds checking to C or C++. While this overhead can be reduced substantially with state-of-the-art techniques (to 20-100% overhead, depending upon the application), it is still large enough to make many folks hesitate. I'm not sure whether that reaction is rational -- I sometimes suspect that people focus too much on performance, simply because performance is quantifiable and measurable -- but regardless, it is a fact of life. This fact reduces the incentive for major compilers to put effort into integrating the latest work on bounds checking into their compilers.
A secondary reason involves concerns that bounds checking might break your app. Particularly if you do funky stuff with pointer arithmetic and casting that violate the standard, bounds checking might block something your application is currently doing. Large applications sometimes do amazingly crufty and ugly things. If the compiler breaks the application, then there's no point in pointing blaming the crufty code for the problem; people aren't going to keep using a compiler that breaks their application.
Another major reason is that bounds checking competes with ASLR + DEP. ASLR + DEP are perceived as solving, oh, 80% of the problem or so. That reduces the perceived need for full-fledged bounds checking.

Because it would cripple those general purpose languages for HPC requirements. There are plenty of applications where buffer overflows really do not matter one iota, simply because they do not happen. Such features are much better off in a library (where in fact you can already find examples for C/C++).
For domain specific languages it may make sense to bake such features into the language definition and trade the resulting performance hit for increased security.

Does 64-bit floating point numbers behave identically on all modern PCs?

I would like to know whether i can assume that same operations on same 64-bit floating point numbers gives exactly the same results on any modern PC and in most common programming languages? (C++, Java, C#, etc.). We can assume, that we are operating on numbers and result is also a number (no NaNs, INFs and so on).
I know there are two very simmilar standards of computation using floating point numbers (IEEE 854-1987 and IEEE 754-2008). However I don't know how it is in practice.

Modern processors that implement 64-bit floating-point typically implement something that is close to the IEEE 754-1985 standard, recently superseded by the 754-2008 standard.
The 754 standard specifies what result you should get from certain basic operations, notably addition, subtraction, multiplication, division, square root, and negation. In most cases, the numeric result is specified precisely: The result must be the representable number that is closest to the exact mathematical result in the direction specified by the rounding mode (to nearest, toward infinity, toward zero, or toward negative infinity). In "to nearest" mode, the standard also specifies how ties are broken.
Because of this, operations that do not involve exception conditions such as overflow will get the same results on different processors that conform to the standard.
However, there are several issues that interfere with getting identical results on different processors. One of them is that the compiler is often free to implement sequences of floating-point operations in a variety of ways. For example, if you write "a = bc + d" in C, where all variables are declared double, the compiler is free to compute "bc" in either double-precision arithmetic or something with more range or precision. If, for example, the processor has registers capable of holding extended-precision floating-point numbers and doing arithmetic with extended-precision does not take any more CPU time than doing arithmetic with double-precision, a compiler is likely to generate code using extended-precision. On such a processor, you might not get the same results as you would on another processor. Even if the compiler does this regularly, it might not in some circumstances because the registers are full during a complicated sequence, so it stores the intermediate results in memory temporarily. When it does that, it might write just the 64-bit double rather than the extended-precision number. So a routine containing floating-point arithmetic might give different results just because it was compiled with different code, perhaps inlined in one place, and the compiler needed registers for something else.
Some processors have instructions to compute a multiply and an add in one instruction, so "bc + d" might be computed with no intermediate rounding and get a more accurate result than on a processor that first computes bc and then adds d.
Your compiler might have switches to control behavior like this.
There are some places where the 754-1985 standard does not require a unique result. For example, when determining whether underflow has occurred (a result is too small to be represented accurately), the standard allows an implementation to make the determination either before or after it rounds the significand (the fraction bits) to the target precision. So some implementations will tell you underflow has occurred when other implementations will not.
A common feature in processors is to have an "almost IEEE 754" mode that eliminates the difficulty of dealing with underflow by substituting zero instead of returning the very small number that the standard requires. Naturally, you will get different numbers when executing in such a mode than when executing in the more compliant mode. The non-compliant mode may be the default set by your compiler and/or operating system, for reasons of performance.
Note that an IEEE 754 implementation is typically not provided just by hardware but by a combination of hardware and software. The processor may do the bulk of the work but rely on the software to handle certain exceptions, set certain modes, and so on.
When you move beyond the basic arithmetic operations to things like sine and cosine, you are very dependent on the library you use. Transcendental functions are generally calculated with carefully engineered approximations. The implementations are developed independently by various engineers and get different results from each other. On one system, the sin function may give results accurate within an ULP (unit of least precision) for small arguments (less than pi or so) but larger errors for large arguments. On another system, the sin function might give results accurate within several ULP for all arguments. No current math library is known to produce correctly rounded results for all inputs. There is a project, crlibm (Correctly Rounded Libm), that has done some good work toward this goal, and they have developed implementations for significant parts of the math library that are correctly rounded and have good performance, but not all of the math library yet.
In summary, if you have a manageable set of calculations, understand your compiler implementation, and are very careful, you can rely on identical results on different processors. Otherwise, getting completely identical results is not something you can rely on.

If you mean getting exactly the same result, then the answer is no.
You might even get different results for debug (non-optimized) builds vs. release builds (optimized) on the same machine in some cases, so don't even assume that the results might be always identical on different machines.
(This can happen e.g. on a computer with an Intel processor, if the optimizer keeps a variable for an intermediate result in a register, that is stored in memory in the unoptimized build. Since Intel FPU registers are 80 bit, and double variables are 64 bit, the intermediate result will be stored with greater precision in the optimized build, causing different values in later results.).
In practice, however, you may often get the same results, but you shouldn't rely on it.

Modern FPUs all implement IEEE754 floats in single and double formats, and some in extended format. A certain set of operations are supported (pretty much anything in math.h), with some special instructions floating around out there.

assuming you are talking about applying multiple operations, I do not think you will get exact numbers. CPU architecture, compiler use, optimization settings will change the results of your computations.
if you mean the exact order of operations (at the assembly level), I think you will still get variations.for example Intel chips use extended precision (80 bits) internally, which may not be the case for other CPUs. (I do not think extended precision is mandated)

The same C# program can bring out different numerical results on the same PC, once compiled in debug mode without optimization, second time compiled in release mode with optimization enabled. That's my personal experience. We did not regard this when we set up an automatic regression test suite for one of our programs for the first time, and were completely surprised that a lot of our tests failed without any apparent reason.

For C# on x86, 80-bit FP registers are used.
The C# standard says that the processor must operate at the same precision as, or greater than, the type itself (i.e. 64-bit in the case of a 'double'). Promotions are allowed, except for storage. That means that locals and parameters could be at greater than 64-bit precision.
In other words, assigning a member variable to a local variable could (and in fact will under certain circumstances) be enough to give an inequality.
See also: Float/double precision in debug/release modes

For the 64-bit data type, I only know of "double precision" / "binary64" from the IEEE 754 (1985 and 2008 don't differ much here for common cases) being used.
Note: The radix types defined in IEEE 854-1987 are superseded by IEEE 754-2008 anyways.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string