What's the term for "default rounding mode" for arithmetic ops in IEEE 754? - rounding

I know that IEEE 754 is the specification for floating-point number data types, their representation and the semantics of operations on them; but - I'm not knowledgeable in the standard itself.
I also know that IEEE 754 defines the following four rounding modes: to nearest, up, down, to zero; and that in typical environments supporting IEEE 754 you can say "please round x using rounding mode m" (e.g. here's how it's done with glibc). That's all well and good.
However, rounding also happens when you simply perform arithmetic: If you have a single-precision 0.9999999 (or something close to that) and you add this to 100000, you will either get 100001, or something close to 100000.99 .
My question: Is this rounding supposed to be an application of a "default rounding mode"? Is that the IEEE 754 term, or is there another, specific term for what I've described? And are the four rounding modes supposed to be supported for implicit rounding arithmetic, as well?

IEEE 754-2008 does not define “modes” for rounding; it changed the terminology from the 1985 standard to imply more flexibility than a global mode. In clause 4.3.3, it specifies round-to-nearest-ties-to-even as the default method for binary formats. For decimal formats, it says a default is not defined but should be round-to-nearest-ties-to-even.
IEEE 754-2008 subclause 4.3 specifies several rounding-direction attributes. It does not say there has to be a mode that provides a rounding-direction setting. A computing environment could, for example, including the rounding-direction attribute as a parameter on each individual operation. For example, a processor architecture could have separate instructions for add with rounding-to-nearest-ties-to-even, add with rounding toward +∞, add with rounding toward zero, and so on. Or it could include the rounding-direction attribute as a operand to the instruction. Or the instruction could be affected by a global mode set in some special processor register.
Many processor architectures provide rounding mode as a setting in a floating-point control register.
The IEEE 754-1985 standard described rounding modes and defined mode as “A variable that a user may set, sense, save, and restore to control the execution of subsequent arithmetic operations.” This may have influenced the development of rounding modes in control registers, but that has a detrimental effect on performance, as global registers cause dependencies between instructions: Every floating-point instruction depends on that global register, so any change to that register interferes with parallel execution of instructions.
However, it is desirable that the rounding direction be flexible, as one might want to alternately use different modes when implementing interval arithmetic or evaluating complicated routines such as implementations of sine or exponentiation. So the IEEE 754-2008 committee changed the standard not to define global modes. Clause 4 specifies some semantics for attributes, including that languages should provide ways to specify the attributes for all standard operations in a “block.” A block may be an entire program or a single operation; it is language-defined. Subclause 4.2 says languages should provide dynamic ways of specifying the attributes, so they can be determined at run-time.

Related

Are floating point operations deterministic when running in multiple threads?

Suppose I have a function that runs calculations, example being something like a dot product - I pass in an arrays A, B of vectors and a float array C, and the functions assigns:
C[i] = dot(A[i], B[i]);
If I create and start two threads that will run this function, and pass in the same three arrays to both the threads, under what circumstances is this type of action (perhaps using a different non-random mathematical operation etc.) not guaranteed give the same result (running the same application without any recompilation, and on the same machine)? I'm only interested in the context of a consumer PC.
I know that float operations are in general deterministic, but I do wonder whether perhaps something weird could happen and maybe on one thread the calculations will use an intermediate 80 bit register, but not in the other.
I would assume it's pretty much guaranteed the same binary code should run in both threads (is there some way this could not happen? The function being compiled multiple times for some reason, the compiler somehow figuring out it will run in multiple threads, and compiling it again, for some reason, for the second thread?).
But I'm a a bit more worried that CPU cores might not have the same instruction sets, even on consumer level PCs.
Side question - what about GPUs in a similar scenario?
//
I'm assuming x86_64, Windows, c++, and dot is a.x * b.x + a.y * b.y. Can't give more info than that - using Unity IL2CPP, don't know how it compiles/with what options.
Motivation for the question: I'm writing a computational geometry procedure that modifies a mesh - I'll call this the "geometric mesh". The issue is that it could happen that the "rendering mesh" has multiple vertices for certain geometric positions - it's needed for flat shading for example - you have multiple vertices with different normals. However, the actual computational geometry procedure only uses purely geometric data of the positions in space.
So I see two options:
Create a map from the rendering mesh to the geometric mesh (example - duplicate vertices being mapped to one unique vertex), run the procedure on the geometric mesh, then somehow modify the rendering mesh based on the result.
Work with the rendering mesh directly. Slightly more inefficient as the procedure does calculations for all vertices, but much easier from a code perspective. But most of all I'm a bit worried that I could get two different values for two vertices that actually have the same position and that shouldn't happen. Only the position is used, and the position would be the same for both such vertices.
Floating point (FP) operations are not associative (but it is commutative). As a result, (x+y)+z can give different results than x+(y+z). For example, (1e-13 + (1 - 1e-13)) == ((1e-13 + 1) - 1e-13) is false with 64-bit IEEE-754 floats. The C++ standard is not very restrictive about floating-point numbers. However, the widely-used IEEE-754 standard is. It specifies the precision of 32-bit and 64-bit number operations, including rounding modes. x86-64 processors are IEEE-754 compliant and mainstream compilers (eg. GCC, Clang and MSVC) are also IEEE-754 compliant by default. ICC is not compliant by default since it assumes the FP operations are associative for the sake of performance. Mainstream compilers have compilation flags to make such assumption so to speed up codes. It is generally combined with other ones like the assumption that all FP values are not NaN (eg. -ffast-math). Such flags break the IEEE-754 compliance, but they are often used in the 3D or video game industry so to speed up codes. IEEE-754 is not required by the C++ standard, but you can check this with std::numeric_limits<T>::is_iec559.
Threads can have different rounding modes by default. However, you can set the rounding mode using the C code provided in this answer. Also, please note that denormal numbers are sometimes disabled on some platforms because of their very-high overhead (see this for more information).
Assuming the IEEE-754 compliance is not broken, the rounding mode is the same and the threads does the operations in the same order, then the result should be identical up to at least 1 ULP. In practice, if they are compiled using a same mainstream compiler, the result should be exactly the same.
The thing is using multiple threads often result in a non-deterministic order of the applied FP operations which causes non-deterministic results. More specifically, atomic operations on FP variables often cause such an issue because the order of the operations often changes at runtime. If you want deterministic results, you need to use a static partitioning, avoid atomic operations on FP variables or more generally atomic operations that could result in a different ordering. The same thing applies for locks or any synchronization mechanisms.
The same thing is true for GPUs. In fact, such problem is very frequent when developers use atomic FP operations for example to sum values. They often do that because implementing fast reductions is complex (though it is more deterministic) and atomic operations as pretty fast on modern GPUs (since they use dedicated efficient units).
According to the accepted answer to floating point processor non-determinism?, C++ floating point is not non-deterministic. The same sequence of instructions will give the same results.
There are a few things to take into account though:
Firstly, the behavior (i.e. the result) of a particular piece of C++ source code doing a FP calculation may depend on the compiler and the chosen compiler options. For example, it may depend on whether the compiler chooses to emit 64 or 80 bit FP instructions. But this is deterministic.
Secondly, similar C++ source code may give different results; e.g. due to non-associative behavior of certain FP instructions. This also is deterministic.
Determinism won't be affected by multi-threading by default. The C++ compiler will probably be unaware of whether the code is multi-threaded or not. And it definitely has no reason to emit different FP code.
Admittedly, FP behavior depends on the rounding mode selected, and that can be set on a per-thread basis. However, for this to happen, something (application code) would have to explicitly set different rounding modes for different threads. Once again, that is deterministic. (And a pretty daft thing for the application code to do, IMO.)
The idea that a PC would would use different FP hardware with different behavior for different threads seems far-fetched to me. Sure a PC could have (say) an Intel chipset and an ARM chipset, but it is not plausible that different threads of the same C++ application (executable) would simultaneously run on both chipsets.
Likewise for GPUs. Indeed, given that you need to program GPUs in a way that is radically different to ordinary (or threaded) C++, I would doubt that they could even share the same source code.
In short, I think that you are worrying about a hypothetical problem that you are unlikely to encounter in reality ... given the current state of the art in hardware and C++ compilers.

State space size of state of the art model checkers

What is the approximate maximum state space size of modern model checkers, like NuSMV.
I do not need an exact number but some state size value, where the run time is still acceptable (say a few weeks).
What kind of improvements, beyond symbolic model checking, are used to raise that limit?
The answer varies wildly, depending, among other factors, on:
what model checking algorithm is used
how the system is represented
how the model checker (or other tool) is implemented
what hardware the software is running on (and parallelization etc).
Instead of mentioning some specific number of states, I will instead note a few relevant factors (I use "specification" below as a synonym to "model"):
Symbolic or enumerative: Symbolic algorithms scale differently than enumerative ones. Also, for the same problem, there are typically differences in the computational complexity of known symbolic and enumerative algorithms.
Enumeration is relatively predictable in behavior, in that a state space with N states will most likely take a shorter time to enumerate than a state space with 1000000 * N states.
Symbolic approaches based on binary-decision diagrams (BDDs) can behave in ways (nearly) unrelated to how many states are reachable based on the specification. The main factor is what kind of Boolean functions arise in the encoding of the specification.
For example, a specification that involves a multiplier would result in BDDs that are exponentially large in the number of bits that represent the state, so of size linear in the number of states (assuming that the reachable states are exponentially more than than the bits used to represent those states). In this case, a state space with 2^50 states that may otherwise be amenable to symbolic analysis becomes prohibitive.
In other words, it is not only the number of states, but the kind of Boolean function that the system action corresponds to that matters (an action in TLA+ corresponds to what a transition relation represents in other formalisms). In addition, choosing a different encoding (of integers by bits) can have an impact on BDD size.
Symmetry (e.g., partial order reduction), and abstraction are some improvements to approach the analysis of more complex systems.
Acceptable runtime is a relative notion. Whatever the model checking approach, there is always a limit where the model's fidelity reaches the available time.
Another approach is to write a specification that has unspecified parameters, then use a model checker to find errors in instances of the specification that correspond to small parameter values, and after correcting these, then use a theorem prover to ensure correctness of the specification. This approach is supported by tools for TLA+, namely the model checker TLC and the theorem prover TLAPS.
Regarding terminology ("specification" above), please see "What good is temporal logic?" by Leslie Lamport.
Also worth noting that, depending on the approach, the number of states, and the number of reachable states can be different notions. Usually, this matters in typed formalisms: we can specify a system with 1 reachable state, but declare variable types that result in many more states, most of which are unreachable from the initial conditions. In symbolic approaches, this affects the encoding, thus the size of BDDs.
References relevant to state space size:
Bwolen Yang, Randal E. Bryant, David R. O’Hallaron, Armin Biere, Olivier Coudert, Geert Janssen, Rajeev K. Ranjan, Fabio Somenzi, A performance study of BDD-based model checking, FMCAD, 1998, DOI: 10.1007/3-540-49519-3_18
Radek Pelánek, Properties of state spaces and their applications, STTT, 2008, DOI: 10.1007/s10009-008-0070-5 (and a relevant website)
Radek Pelánek, Typical structural properties of state spaces, SPIN, 2004, DOI: 10.1007/978-3-540-24732-6_2
Yunja Choi, From NuSMV to SPIN: Experiences with model checking flight guidance systems, FMSD, 2007, DOI: 10.1007/s10703-006-0027-9
J.R. Burch, E.M. Clarke, K.L. McMillan, D.L. Dill, L.J. Hwang, Symbolic model checking: 10^20 states and beyond, LICS, 1990, DOI: 10.1109/LICS.1990.113767

Why is a simple RGB to Lab conversion much slower using compute capability 1.3 than 1.0 even with -use_fast_math flag?

I am using GT 740M (CC 3.5) and I have a RGB to Lab conversion kernel. Using compute capability 1.0 - 1.2 the whole kernel is executed in 924 microseconds however using the compute capability of 1.3 or higher (up to 3.5) the kernel is executed in around 3 ms. According to the table from wikipedia http://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications I found out that it could be caused by the double-precision floating-point operations so I used the -use_fast_math flag, but it did not help.
What can be the reason of the performance hit?
The whole source code can be seen in http://pastebin.com/JjhH101y
cc 1.0 - 1.2 devices do not support double-precision floating point operations. Those operations will be "demoted" to single precision floating point operations on those devices.
At first glance, all of your variables are float not double, but your constants are all double-precision constants.
Therefore arithmetic like this:
a=(x-y)*500.0;
will involve a double-precision floating point multiply on compile targets that support it (which will be subsequently reduced to a float). On compile targets that don't support it, the above operation will be handled entirely via single-precision math.
The --use-fast-math option does not affect conversion between double and float as discussed above.
I would suggest that you start by decorating all your constants as floating-point constants:
a=(x-y)*500.0f;
You might also want to carefully review the CUDA math api to be sure that you get what you want from operations like this:
exp(log(x)/3.0)
in terms of single or double-precision arithmetic.

Does decimal math use the FPU

Does decimal math use the FPU?
I would think that the answer is yes, but I'm not sure, since a decimal is not a floating point, but a fixed precision number.
I'm looking mostly for .NET, but a general answer would be useful too.
With regards to .NET and more specifically C#, no, System.Decimal does not use the FPU because the type is emulated in software.
Also, System.Decimal is a floating point number, not a fixed precision number like commonly found in a database. The type is actually a decimal floating point that uses 10 for its base as opposed to a binary floating point (i.e. System.Single or System.Double) which uses 2 as its base. It still has the same precision problems if you attempt to store a fraction that cannot be exactly represented, for example, 1/3.
Yes, modern languages in general support floating point math and integer math, and that's it; there's no direct support for fixed point, BCD, etc. Floating-point math is going to be done using floating-point processor instructions; modern architectures don't include a separate FPU.

Does 64-bit floating point numbers behave identically on all modern PCs?

I would like to know whether i can assume that same operations on same 64-bit floating point numbers gives exactly the same results on any modern PC and in most common programming languages? (C++, Java, C#, etc.). We can assume, that we are operating on numbers and result is also a number (no NaNs, INFs and so on).
I know there are two very simmilar standards of computation using floating point numbers (IEEE 854-1987 and IEEE 754-2008). However I don't know how it is in practice.
Modern processors that implement 64-bit floating-point typically implement something that is close to the IEEE 754-1985 standard, recently superseded by the 754-2008 standard.
The 754 standard specifies what result you should get from certain basic operations, notably addition, subtraction, multiplication, division, square root, and negation. In most cases, the numeric result is specified precisely: The result must be the representable number that is closest to the exact mathematical result in the direction specified by the rounding mode (to nearest, toward infinity, toward zero, or toward negative infinity). In "to nearest" mode, the standard also specifies how ties are broken.
Because of this, operations that do not involve exception conditions such as overflow will get the same results on different processors that conform to the standard.
However, there are several issues that interfere with getting identical results on different processors. One of them is that the compiler is often free to implement sequences of floating-point operations in a variety of ways. For example, if you write "a = bc + d" in C, where all variables are declared double, the compiler is free to compute "bc" in either double-precision arithmetic or something with more range or precision. If, for example, the processor has registers capable of holding extended-precision floating-point numbers and doing arithmetic with extended-precision does not take any more CPU time than doing arithmetic with double-precision, a compiler is likely to generate code using extended-precision. On such a processor, you might not get the same results as you would on another processor. Even if the compiler does this regularly, it might not in some circumstances because the registers are full during a complicated sequence, so it stores the intermediate results in memory temporarily. When it does that, it might write just the 64-bit double rather than the extended-precision number. So a routine containing floating-point arithmetic might give different results just because it was compiled with different code, perhaps inlined in one place, and the compiler needed registers for something else.
Some processors have instructions to compute a multiply and an add in one instruction, so "bc + d" might be computed with no intermediate rounding and get a more accurate result than on a processor that first computes bc and then adds d.
Your compiler might have switches to control behavior like this.
There are some places where the 754-1985 standard does not require a unique result. For example, when determining whether underflow has occurred (a result is too small to be represented accurately), the standard allows an implementation to make the determination either before or after it rounds the significand (the fraction bits) to the target precision. So some implementations will tell you underflow has occurred when other implementations will not.
A common feature in processors is to have an "almost IEEE 754" mode that eliminates the difficulty of dealing with underflow by substituting zero instead of returning the very small number that the standard requires. Naturally, you will get different numbers when executing in such a mode than when executing in the more compliant mode. The non-compliant mode may be the default set by your compiler and/or operating system, for reasons of performance.
Note that an IEEE 754 implementation is typically not provided just by hardware but by a combination of hardware and software. The processor may do the bulk of the work but rely on the software to handle certain exceptions, set certain modes, and so on.
When you move beyond the basic arithmetic operations to things like sine and cosine, you are very dependent on the library you use. Transcendental functions are generally calculated with carefully engineered approximations. The implementations are developed independently by various engineers and get different results from each other. On one system, the sin function may give results accurate within an ULP (unit of least precision) for small arguments (less than pi or so) but larger errors for large arguments. On another system, the sin function might give results accurate within several ULP for all arguments. No current math library is known to produce correctly rounded results for all inputs. There is a project, crlibm (Correctly Rounded Libm), that has done some good work toward this goal, and they have developed implementations for significant parts of the math library that are correctly rounded and have good performance, but not all of the math library yet.
In summary, if you have a manageable set of calculations, understand your compiler implementation, and are very careful, you can rely on identical results on different processors. Otherwise, getting completely identical results is not something you can rely on.
If you mean getting exactly the same result, then the answer is no.
You might even get different results for debug (non-optimized) builds vs. release builds (optimized) on the same machine in some cases, so don't even assume that the results might be always identical on different machines.
(This can happen e.g. on a computer with an Intel processor, if the optimizer keeps a variable for an intermediate result in a register, that is stored in memory in the unoptimized build. Since Intel FPU registers are 80 bit, and double variables are 64 bit, the intermediate result will be stored with greater precision in the optimized build, causing different values in later results.).
In practice, however, you may often get the same results, but you shouldn't rely on it.
Modern FPUs all implement IEEE754 floats in single and double formats, and some in extended format. A certain set of operations are supported (pretty much anything in math.h), with some special instructions floating around out there.
assuming you are talking about applying multiple operations, I do not think you will get exact numbers. CPU architecture, compiler use, optimization settings will change the results of your computations.
if you mean the exact order of operations (at the assembly level), I think you will still get variations.for example Intel chips use extended precision (80 bits) internally, which may not be the case for other CPUs. (I do not think extended precision is mandated)
The same C# program can bring out different numerical results on the same PC, once compiled in debug mode without optimization, second time compiled in release mode with optimization enabled. That's my personal experience. We did not regard this when we set up an automatic regression test suite for one of our programs for the first time, and were completely surprised that a lot of our tests failed without any apparent reason.
For C# on x86, 80-bit FP registers are used.
The C# standard says that the processor must operate at the same precision as, or greater than, the type itself (i.e. 64-bit in the case of a 'double'). Promotions are allowed, except for storage. That means that locals and parameters could be at greater than 64-bit precision.
In other words, assigning a member variable to a local variable could (and in fact will under certain circumstances) be enough to give an inequality.
See also: Float/double precision in debug/release modes
For the 64-bit data type, I only know of "double precision" / "binary64" from the IEEE 754 (1985 and 2008 don't differ much here for common cases) being used.
Note: The radix types defined in IEEE 854-1987 are superseded by IEEE 754-2008 anyways.

Resources