I would like to know whether i can assume that same operations on same 64-bit floating point numbers gives exactly the same results on any modern PC and in most common programming languages? (C++, Java, C#, etc.). We can assume, that we are operating on numbers and result is also a number (no NaNs, INFs and so on).
I know there are two very simmilar standards of computation using floating point numbers (IEEE 854-1987 and IEEE 754-2008). However I don't know how it is in practice.
Modern processors that implement 64-bit floating-point typically implement something that is close to the IEEE 754-1985 standard, recently superseded by the 754-2008 standard.
The 754 standard specifies what result you should get from certain basic operations, notably addition, subtraction, multiplication, division, square root, and negation. In most cases, the numeric result is specified precisely: The result must be the representable number that is closest to the exact mathematical result in the direction specified by the rounding mode (to nearest, toward infinity, toward zero, or toward negative infinity). In "to nearest" mode, the standard also specifies how ties are broken.
Because of this, operations that do not involve exception conditions such as overflow will get the same results on different processors that conform to the standard.
However, there are several issues that interfere with getting identical results on different processors. One of them is that the compiler is often free to implement sequences of floating-point operations in a variety of ways. For example, if you write "a = bc + d" in C, where all variables are declared double, the compiler is free to compute "bc" in either double-precision arithmetic or something with more range or precision. If, for example, the processor has registers capable of holding extended-precision floating-point numbers and doing arithmetic with extended-precision does not take any more CPU time than doing arithmetic with double-precision, a compiler is likely to generate code using extended-precision. On such a processor, you might not get the same results as you would on another processor. Even if the compiler does this regularly, it might not in some circumstances because the registers are full during a complicated sequence, so it stores the intermediate results in memory temporarily. When it does that, it might write just the 64-bit double rather than the extended-precision number. So a routine containing floating-point arithmetic might give different results just because it was compiled with different code, perhaps inlined in one place, and the compiler needed registers for something else.
Some processors have instructions to compute a multiply and an add in one instruction, so "bc + d" might be computed with no intermediate rounding and get a more accurate result than on a processor that first computes bc and then adds d.
Your compiler might have switches to control behavior like this.
There are some places where the 754-1985 standard does not require a unique result. For example, when determining whether underflow has occurred (a result is too small to be represented accurately), the standard allows an implementation to make the determination either before or after it rounds the significand (the fraction bits) to the target precision. So some implementations will tell you underflow has occurred when other implementations will not.
A common feature in processors is to have an "almost IEEE 754" mode that eliminates the difficulty of dealing with underflow by substituting zero instead of returning the very small number that the standard requires. Naturally, you will get different numbers when executing in such a mode than when executing in the more compliant mode. The non-compliant mode may be the default set by your compiler and/or operating system, for reasons of performance.
Note that an IEEE 754 implementation is typically not provided just by hardware but by a combination of hardware and software. The processor may do the bulk of the work but rely on the software to handle certain exceptions, set certain modes, and so on.
When you move beyond the basic arithmetic operations to things like sine and cosine, you are very dependent on the library you use. Transcendental functions are generally calculated with carefully engineered approximations. The implementations are developed independently by various engineers and get different results from each other. On one system, the sin function may give results accurate within an ULP (unit of least precision) for small arguments (less than pi or so) but larger errors for large arguments. On another system, the sin function might give results accurate within several ULP for all arguments. No current math library is known to produce correctly rounded results for all inputs. There is a project, crlibm (Correctly Rounded Libm), that has done some good work toward this goal, and they have developed implementations for significant parts of the math library that are correctly rounded and have good performance, but not all of the math library yet.
In summary, if you have a manageable set of calculations, understand your compiler implementation, and are very careful, you can rely on identical results on different processors. Otherwise, getting completely identical results is not something you can rely on.
If you mean getting exactly the same result, then the answer is no.
You might even get different results for debug (non-optimized) builds vs. release builds (optimized) on the same machine in some cases, so don't even assume that the results might be always identical on different machines.
(This can happen e.g. on a computer with an Intel processor, if the optimizer keeps a variable for an intermediate result in a register, that is stored in memory in the unoptimized build. Since Intel FPU registers are 80 bit, and double variables are 64 bit, the intermediate result will be stored with greater precision in the optimized build, causing different values in later results.).
In practice, however, you may often get the same results, but you shouldn't rely on it.
Modern FPUs all implement IEEE754 floats in single and double formats, and some in extended format. A certain set of operations are supported (pretty much anything in math.h), with some special instructions floating around out there.
assuming you are talking about applying multiple operations, I do not think you will get exact numbers. CPU architecture, compiler use, optimization settings will change the results of your computations.
if you mean the exact order of operations (at the assembly level), I think you will still get variations.for example Intel chips use extended precision (80 bits) internally, which may not be the case for other CPUs. (I do not think extended precision is mandated)
The same C# program can bring out different numerical results on the same PC, once compiled in debug mode without optimization, second time compiled in release mode with optimization enabled. That's my personal experience. We did not regard this when we set up an automatic regression test suite for one of our programs for the first time, and were completely surprised that a lot of our tests failed without any apparent reason.
For C# on x86, 80-bit FP registers are used.
The C# standard says that the processor must operate at the same precision as, or greater than, the type itself (i.e. 64-bit in the case of a 'double'). Promotions are allowed, except for storage. That means that locals and parameters could be at greater than 64-bit precision.
In other words, assigning a member variable to a local variable could (and in fact will under certain circumstances) be enough to give an inequality.
See also: Float/double precision in debug/release modes
For the 64-bit data type, I only know of "double precision" / "binary64" from the IEEE 754 (1985 and 2008 don't differ much here for common cases) being used.
Note: The radix types defined in IEEE 854-1987 are superseded by IEEE 754-2008 anyways.
Related
Suppose I have a function that runs calculations, example being something like a dot product - I pass in an arrays A, B of vectors and a float array C, and the functions assigns:
C[i] = dot(A[i], B[i]);
If I create and start two threads that will run this function, and pass in the same three arrays to both the threads, under what circumstances is this type of action (perhaps using a different non-random mathematical operation etc.) not guaranteed give the same result (running the same application without any recompilation, and on the same machine)? I'm only interested in the context of a consumer PC.
I know that float operations are in general deterministic, but I do wonder whether perhaps something weird could happen and maybe on one thread the calculations will use an intermediate 80 bit register, but not in the other.
I would assume it's pretty much guaranteed the same binary code should run in both threads (is there some way this could not happen? The function being compiled multiple times for some reason, the compiler somehow figuring out it will run in multiple threads, and compiling it again, for some reason, for the second thread?).
But I'm a a bit more worried that CPU cores might not have the same instruction sets, even on consumer level PCs.
Side question - what about GPUs in a similar scenario?
//
I'm assuming x86_64, Windows, c++, and dot is a.x * b.x + a.y * b.y. Can't give more info than that - using Unity IL2CPP, don't know how it compiles/with what options.
Motivation for the question: I'm writing a computational geometry procedure that modifies a mesh - I'll call this the "geometric mesh". The issue is that it could happen that the "rendering mesh" has multiple vertices for certain geometric positions - it's needed for flat shading for example - you have multiple vertices with different normals. However, the actual computational geometry procedure only uses purely geometric data of the positions in space.
So I see two options:
Create a map from the rendering mesh to the geometric mesh (example - duplicate vertices being mapped to one unique vertex), run the procedure on the geometric mesh, then somehow modify the rendering mesh based on the result.
Work with the rendering mesh directly. Slightly more inefficient as the procedure does calculations for all vertices, but much easier from a code perspective. But most of all I'm a bit worried that I could get two different values for two vertices that actually have the same position and that shouldn't happen. Only the position is used, and the position would be the same for both such vertices.
Floating point (FP) operations are not associative (but it is commutative). As a result, (x+y)+z can give different results than x+(y+z). For example, (1e-13 + (1 - 1e-13)) == ((1e-13 + 1) - 1e-13) is false with 64-bit IEEE-754 floats. The C++ standard is not very restrictive about floating-point numbers. However, the widely-used IEEE-754 standard is. It specifies the precision of 32-bit and 64-bit number operations, including rounding modes. x86-64 processors are IEEE-754 compliant and mainstream compilers (eg. GCC, Clang and MSVC) are also IEEE-754 compliant by default. ICC is not compliant by default since it assumes the FP operations are associative for the sake of performance. Mainstream compilers have compilation flags to make such assumption so to speed up codes. It is generally combined with other ones like the assumption that all FP values are not NaN (eg. -ffast-math). Such flags break the IEEE-754 compliance, but they are often used in the 3D or video game industry so to speed up codes. IEEE-754 is not required by the C++ standard, but you can check this with std::numeric_limits<T>::is_iec559.
Threads can have different rounding modes by default. However, you can set the rounding mode using the C code provided in this answer. Also, please note that denormal numbers are sometimes disabled on some platforms because of their very-high overhead (see this for more information).
Assuming the IEEE-754 compliance is not broken, the rounding mode is the same and the threads does the operations in the same order, then the result should be identical up to at least 1 ULP. In practice, if they are compiled using a same mainstream compiler, the result should be exactly the same.
The thing is using multiple threads often result in a non-deterministic order of the applied FP operations which causes non-deterministic results. More specifically, atomic operations on FP variables often cause such an issue because the order of the operations often changes at runtime. If you want deterministic results, you need to use a static partitioning, avoid atomic operations on FP variables or more generally atomic operations that could result in a different ordering. The same thing applies for locks or any synchronization mechanisms.
The same thing is true for GPUs. In fact, such problem is very frequent when developers use atomic FP operations for example to sum values. They often do that because implementing fast reductions is complex (though it is more deterministic) and atomic operations as pretty fast on modern GPUs (since they use dedicated efficient units).
According to the accepted answer to floating point processor non-determinism?, C++ floating point is not non-deterministic. The same sequence of instructions will give the same results.
There are a few things to take into account though:
Firstly, the behavior (i.e. the result) of a particular piece of C++ source code doing a FP calculation may depend on the compiler and the chosen compiler options. For example, it may depend on whether the compiler chooses to emit 64 or 80 bit FP instructions. But this is deterministic.
Secondly, similar C++ source code may give different results; e.g. due to non-associative behavior of certain FP instructions. This also is deterministic.
Determinism won't be affected by multi-threading by default. The C++ compiler will probably be unaware of whether the code is multi-threaded or not. And it definitely has no reason to emit different FP code.
Admittedly, FP behavior depends on the rounding mode selected, and that can be set on a per-thread basis. However, for this to happen, something (application code) would have to explicitly set different rounding modes for different threads. Once again, that is deterministic. (And a pretty daft thing for the application code to do, IMO.)
The idea that a PC would would use different FP hardware with different behavior for different threads seems far-fetched to me. Sure a PC could have (say) an Intel chipset and an ARM chipset, but it is not plausible that different threads of the same C++ application (executable) would simultaneously run on both chipsets.
Likewise for GPUs. Indeed, given that you need to program GPUs in a way that is radically different to ordinary (or threaded) C++, I would doubt that they could even share the same source code.
In short, I think that you are worrying about a hypothetical problem that you are unlikely to encounter in reality ... given the current state of the art in hardware and C++ compilers.
I know that IEEE 754 is the specification for floating-point number data types, their representation and the semantics of operations on them; but - I'm not knowledgeable in the standard itself.
I also know that IEEE 754 defines the following four rounding modes: to nearest, up, down, to zero; and that in typical environments supporting IEEE 754 you can say "please round x using rounding mode m" (e.g. here's how it's done with glibc). That's all well and good.
However, rounding also happens when you simply perform arithmetic: If you have a single-precision 0.9999999 (or something close to that) and you add this to 100000, you will either get 100001, or something close to 100000.99 .
My question: Is this rounding supposed to be an application of a "default rounding mode"? Is that the IEEE 754 term, or is there another, specific term for what I've described? And are the four rounding modes supposed to be supported for implicit rounding arithmetic, as well?
IEEE 754-2008 does not define “modes” for rounding; it changed the terminology from the 1985 standard to imply more flexibility than a global mode. In clause 4.3.3, it specifies round-to-nearest-ties-to-even as the default method for binary formats. For decimal formats, it says a default is not defined but should be round-to-nearest-ties-to-even.
IEEE 754-2008 subclause 4.3 specifies several rounding-direction attributes. It does not say there has to be a mode that provides a rounding-direction setting. A computing environment could, for example, including the rounding-direction attribute as a parameter on each individual operation. For example, a processor architecture could have separate instructions for add with rounding-to-nearest-ties-to-even, add with rounding toward +∞, add with rounding toward zero, and so on. Or it could include the rounding-direction attribute as a operand to the instruction. Or the instruction could be affected by a global mode set in some special processor register.
Many processor architectures provide rounding mode as a setting in a floating-point control register.
The IEEE 754-1985 standard described rounding modes and defined mode as “A variable that a user may set, sense, save, and restore to control the execution of subsequent arithmetic operations.” This may have influenced the development of rounding modes in control registers, but that has a detrimental effect on performance, as global registers cause dependencies between instructions: Every floating-point instruction depends on that global register, so any change to that register interferes with parallel execution of instructions.
However, it is desirable that the rounding direction be flexible, as one might want to alternately use different modes when implementing interval arithmetic or evaluating complicated routines such as implementations of sine or exponentiation. So the IEEE 754-2008 committee changed the standard not to define global modes. Clause 4 specifies some semantics for attributes, including that languages should provide ways to specify the attributes for all standard operations in a “block.” A block may be an entire program or a single operation; it is language-defined. Subclause 4.2 says languages should provide dynamic ways of specifying the attributes, so they can be determined at run-time.
I've been working with System.Numerics.Complex recently, and I've started to notice the typical floating-point "drift" where the value stored gets calculated a tenth of a millionth off or something like that, which is well-known and common with the float type and even the double type. I looked into the Complex struct, and sure enough, it used double variables. Why does it use double values to store its data and not decimal values, which are designed to prevent this? How do I work around this?
To answer your question:
doubles are several orders of magnitude faster, as operations are done at the hardware level
base-2 floats can actually be more accurate for large computations, as there is less "wobble" when shifting up and down exponents: 1 bit of precision is less than 1 decimal digit. Moreover, base-2 can use an implicit leading bit, which means they can represent more numbers than other bases.
complex numbers are typically used for scientific/engineering applications, where small relative errors of approx 10-16 are outweighed by other sources of error (e.g. due to measurement or the model).
decimals on the other hand are typically used for "accounting" type operations, where round-off error is typically negligible (i.e. addition of small numbers, multiplication by integers, etc.)
Lua by default uses a double precision floating point (double) type as its only numeric type. That's nice and useful. However, I'm working on software that expects to see 64bit integers, for which I don't get around using actual 64bit integers one way or another.
The place where the integer type becomes relevant is for file sizes. Although I don't truly expect to see file sizes beyond what Lua can represent with full "integer" precision using a double, I want to be prepared.
What strategies can you recommend when using a 64bit integer type in parallel with the default numeric type of Lua? I don't really want to throw the default implementation overboard (and I'm not worried of its performance compared to integer arithmetics), but I need some way of representing 64bit integers up to their full precision without too much of a performance penalty.
My problem is that I'm unsure where to modify the behavior. Should I modify the syntax and extend the parser (numbers with appended LL or ULL come to mind, which to my knowledge doesn't exist in default Lua) or should I instead write my own C module and define a userdata type that represents the 64bit integer, along with library functions able to manipulate the values? ...
Note: yes, I am embedding Lua, so I am free to extend it whichever way I please.
As part of LuaJIT's port to ARM CPUs (which often have poor floating-point), LuaJIT implemented a "Dual-number VM", which allows it to switch between integers and floats dynamically as needed. You could use this yourself, just switch between 64-bit integers and doubles instead of 32-bit integers and floats.
It's currently live in builds, so you may want to consider using LuaJIT as your Lua "interpreter." Or you could use it as a way to learn how to do this sort of thing.
However, I do agree with Marcelo; the 53-bit mantissa should be plenty. You shouldn't really need this for a good 10 years or so.
I'd suggest storing your data outside of Lua and use some type of reference to retrieve it when calling your other libraries. You can then push various results onto the Lua stack for the user the see, you can even retrieve the value as a string to be precise, but I would avoid modifying them in Lua and relying on the Lua values when calling your external library.
If you're not going to need floating-point precision at any point in the program, you can just redefine LUA_NUMBER to __int64 (or whatever 64-bit int may be in your environment) in luaconf.h.
Otherwise, you can just bring in another library to handle your integers- for infinite precision, you can use a bignum library such as lhf's lbn.
I have a project where I have to perform some mathematics calculations with double variables.
The problem is that I get different results on SUN Solaris 9 and Linux.
There are a lot of ways (explained here and other forums) how to make Linux work as Sun, but not the other way around.
I cannot touch the Linux code, so it is only SUN I can change.
Is there any way to make SUN to behave as Linux?
The code I run(compile with gcc on both systems):
int hash_func(char *long_id)
{
double product, lnum, gold;
while (*long_id)
lnum = lnum * 10.0 + (*long_id++ - '0');
printf("lnum => %20.20f\n", lnum);
lnum = lnum * 10.0E-8;
printf("lnum => %20.20f\n", lnum);
gold = 0.6125423371582974;
product = lnum * gold;
printf("product => %20.20f\n", product);
...
}
if the input is 339886769243483
the output in Linux:
lnum => 339886769243**483**.00000000000000000000
lnum => 33988676.9243**4829473495483398**
product => 20819503.600158**59827399253845**
When on SUN:
lnum => 339886769243483.00000000000000000000
lnum => 33988676.92434830218553543091
product = 20819503.600158**60199928283691**
Note: The result is not always different, moreover most of the times it is the same. Just 10 15-digit numbers out of 60000 have this problem.
Please help!!!
The real answer here is another question: why do you think you need this? There may be a better way to accomplish what you're trying to do that doesn't depend on intricate details of platform floating-point. Having said that...
It's unfortunate that you can't change the Linux code, since it's really the Linux results that are deficient here. The SUN results are as good as they could possibly be: they're correctly rounded; each multiplication gives the unique (in this case) C double that's closest to the result. In contrast, the first Linux multiplication does not give a correctly rounded result.
Your Linux results come from a 32-bit system on x86 hardware, right? The results you show are consistent with, and likely caused by, the phenomenon of 'double rounding': the result of the first multiplication is first rounded to 64-bit precision (the precision used internally by the Intel x87 FPU), and then re-rounded to the usual 53-bit precision of a double. Most of the time (around 1999 times out of 2000 or so on average) this double round has the same effect as a single round to 53-bit precision would have had, but occasionally it can produce a different result, and that's what you're seeing here.
As you say, there are ways to fix the Linux results to match the Solaris ones: one of these is to use appropriate compiler flags to force the use of SSE2 instructions for floating-point operations if possible. The recent 4.5 release of gcc also fixes the difference by means of a new -fexcess-precision flag, though the fix may impact performance when not using SSE2.
[Edit: after several rereads of the gcc manuals, the gcc-patches mailing list thread at http://gcc.gnu.org/ml/gcc-patches/2008-11/msg00105.html, and the related gcc bug report, it's still not clear to me whether use of -fexcess-precision=standard does in fact eliminate double rounding on x87 systems; I think the answer depends on the value of FLT_EVAL_METHOD. I don't have a 32-bit Linux/x86 machine handy to test this on.]
But I don't know how you'd fix the Solaris results to match the Linux ones, and I'm not sure why you'd want to: you'd be making the Solaris results less accurate instead of making the Linux results more accurate.
[Edit: caf has a good suggestion here. On Solaris, try deliberately using long double for intermediate results, then forcing back to double. If done right, this should reproduce the double rounding effect that you're seeing in Linux.]
See David Monniaux's excellent paper The pitfalls of verifying floating-point computations for a good explanation of double rounding. It's essential reading after the Goldberg article mentioned in an earlier answer.
Two things:
you need to read this: http://docs.sun.com/source/806-3568/ncg_goldberg.html
you need to decide what your requirements are for numerical precision in your application
These values differ by less than one part in 252. But a double-precision number has just 52 bits behind the radix point. So they differ by just the last bit. It may be that one machine isn't always rounding correctly, but you're doing two multiplications, so the answer is going to have that much error anyway.