Why do some numbers lose accuracy when stored as floating point numbers?
For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers (92/10), both of which can be expressed exactly in binary (0b1011100/0b1010). However, the same ratio stored as a floating point number is never exactly equal to 9.2:
32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875
How can such an apparently simple number be "too big" to express in 64 bits of memory?
In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2, is actually this fraction:
5179139571476070 * 2 -49
Where the exponent is -49 and the mantissa is 5179139571476070. The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.
9.2 may be simply 92/10, but 10 cannot be expressed as 2n if n is limited to integer values.
Seeing the Data
First, a few functions to see the components that make a 32- and 64-bit float. Gloss over these if you only care about the output (example in Python):
def float_to_bin_parts(number, bits=64):
if bits == 32: # single precision
int_pack = 'I'
float_pack = 'f'
exponent_bits = 8
mantissa_bits = 23
exponent_bias = 127
elif bits == 64: # double precision. all python floats are this
int_pack = 'Q'
float_pack = 'd'
exponent_bits = 11
mantissa_bits = 52
exponent_bias = 1023
else:
raise ValueError, 'bits argument must be 32 or 64'
bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]
There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.
Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double, which is often implemented as 64 bits.
When we call that function with our example, 9.2, here's what we get:
>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']
Interpreting the Data
You'll see I've split the return value into three components. These components are:
Sign
Exponent
Mantissa (also called Significand, or Fraction)
Sign
The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0.
Exponent
The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010. In decimal, that represents the value 1026. A quirk of this component is that you must subtract a number equal to 2(# of bits) - 1 - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023) to get the true exponent, 0b00000000011 (decimal number 3).
Mantissa
The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:
6.0221413x1023
The mantissa would be the 6.0221413. Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1. So the binary mantissa always starts with 1! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:
1.0010011001100110011001100110011001100110011001100110
This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.
When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 252 to move it 52 places to the right:
0.0010011001100110011001100110011001100110011001100110
In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999. (This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal; for more detail, see: 675539944105574 / 4503599627370496.)
Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.
Recapping the Components
Sign (first component): 0 for positive, 1 for negative
Exponent (middle component): Subtract 2(# of bits) - 1 - 1 to get the true exponent
Mantissa (last component): Divide by 2(# of bits) and add 1 to get the true mantissa
Calculating the Number
Putting all three parts together, we're given this binary number:
1.0010011001100110011001100110011001100110011001100110 x 1011
Which we can then convert from binary to decimal:
1.1499999999999999 x 23 (inexact!)
And multiply to reveal the final representation of the number we started with (9.2) after being stored as a floating point value:
9.1999999999999993
Representing as a Fraction
9.2
Now that we've built the number, it's possible to reconstruct it into a simple fraction:
1.0010011001100110011001100110011001100110011001100110 x 1011
Shift mantissa to a whole number:
10010011001100110011001100110011001100110011001100110 x 1011-110100
Convert to decimal:
5179139571476070 x 23-52
Subtract the exponent:
5179139571476070 x 2-49
Turn negative exponent into division:
5179139571476070 / 249
Multiply exponent:
5179139571476070 / 562949953421312
Which equals:
9.1999999999999993
9.5
>>> float_to_bin_parts(9.5)
['0', '10000000010', '0011000000000000000000000000000000000000000000000000']
Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.
Assemble the binary scientific notation:
1.0011 x 1011
Shift the decimal point:
10011 x 1011-100
Subtract the exponent:
10011 x 10-1
Binary to decimal:
19 x 2-1
Negative exponent to division:
19 / 21
Multiply exponent:
19 / 2
Equals:
9.5
Further reading
The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don’t my numbers add up? (floating-point-gui.de)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
IEEE Double-precision floating-point format (Wikipedia)
Floating Point Arithmetic: Issues and Limitations (docs.python.org)
Floating Point Binary
This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.
Consider the fraction 2/3
In good-ol' base 10, we typically write it out as something like
0.666...
0.666
0.667
When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs. (Yes, in some programs it matters.)
Number bases
So here's where number bases are crucial. If we were trying to represent 2/3 in base 3, then
(2/3)10 = 0.23
In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.
To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.
(1/2)10 = 0.510 = 0.12 = 0.1111...3
Why are floating point numbers inaccurate?
Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.
While all of the other answers are good there is still one thing missing:
It is impossible to represent irrational numbers (e.g. π, sqrt(2), log(3), etc.) precisely!
And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.
Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b. All your arithmetic would have to be done on fractions just like in highschool math (e.g. a/b * c/d = ac/bd).
But of course you would still run into the same kind of trouble when pi, sqrt, log, sin, etc. are involved.
TL;DR
For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (i.e. irrational) can never be represented no matter the system.
There are infinitely many real numbers (so many that you can't enumerate them), and there are infinitely many rational numbers (it is possible to enumerate them).
The floating-point representation is a finite one (like anything in a computer) so unavoidably many many many numbers are impossible to represent. In particular, 64 bits only allow you to distinguish among only 18,446,744,073,709,551,616 different values (which is nothing compared to infinity). With the standard convention, 9.2 is not one of them. Those that can are of the form m.2^e for some integers m and e.
You might come up with a different numeration system, 10 based for instance, where 9.2 would have an exact representation. But other numbers, say 1/3, would still be impossible to represent.
Also note that double-precision floating-points numbers are extremely accurate. They can represent any number in a very wide range with as much as 15 exact digits. For daily life computations, 4 or 5 digits are more than enough. You will never really need those 15, unless you want to count every millisecond of your lifetime.
Why can we not represent 9.2 in binary floating point?
Floating point numbers are (simplifying slightly) a positional numbering system with a restricted number of digits and a movable radix point.
A fraction can only be expressed exactly using a finite number of digits in a positional numbering system if the prime factors of the denominator (when the fraction is expressed in it's lowest terms) are factors of the base.
The prime factors of 10 are 5 and 2, so in base 10 we can represent any fraction of the form a/(2b5c).
On the other hand the only prime factor of 2 is 2, so in base 2 we can only represent fractions of the form a/(2b)
Why do computers use this representation?
Because it's a simple format to work with and it is sufficiently accurate for most purposes. Basically the same reason scientists use "scientific notation" and round their results to a reasonable number of digits at each step.
It would certainly be possible to define a fraction format, with (for example) a 32-bit numerator and a 32-bit denominator. It would be able to represent numbers that IEEE double precision floating point could not, but equally there would be many numbers that can be represented in double precision floating point that could not be represented in such a fixed-size fraction format.
However the big problem is that such a format is a pain to do calculations on. For two reasons.
If you want to have exactly one representation of each number then after each calculation you need to reduce the fraction to it's lowest terms. That means that for every operation you basically need to do a greatest common divisor calculation.
If after your calculation you end up with an unrepresentable result because the numerator or denominator you need to find the closest representable result. This is non-trivil.
Some Languages do offer fraction types, but usually they do it in combination with arbitary precision, this avoids needing to worry about approximating fractions but it creates it's own problem, when a number passes through a large number of calculation steps the size of the denominator and hence the storage needed for the fraction can explode.
Some languages also offer decimal floating point types, these are mainly used in scenarios where it is imporant that the results the computer gets match pre-existing rounding rules that were written with humans in mind (chiefly financial calculations). These are slightly more difficult to work with than binary floating point, but the biggest problem is that most computers don't offer hardware support for them.
Related
It is possible to print to several hundred decimal places a square root in bc, as it is in C. However in C it is only accurate to 15. I have checked the square root of 2 to 50 decimal places and it is accurate but what is the limit in bc? I can't find any reference to this.
To how many decimal places is bc accurate?
bc is an arbitrary precision calculator. Arbitrary precision just tells us how many digits it can represent (as many as will fit in memory), but doesn't tell us anything about accuracy.
However in C it is only accurate to 15
C uses your processor's built-in floating point hardware. This is fast, but has a fixed number of bits to represent each number, so is obviously fixed rather than arbitrary precision.
Any arbitrary precision system will have more ... precision than this, but could of course still be inaccurate. Knowing how many digits can be stored doesn't tell us whether they're correct.
However, the GNU implementation of bc is open source, so we can just see what it does.
The bc_sqrt function uses an iterative approximation (Newton's method, although the same technique was apparently known by the Babylonians in at least 1,000BC).
This approximation is just run, improving each time, until two consecutive guesses differ by less than the precision requested. That is, if you ask for 1,000 digits, it'll keep going until the difference is at most in the 1,001st digit.
The only exception is when you ask for an N-digit result and the original number has more than N digits. It'll use the larger of the two as its target precision.
Since the convergence rate of this algorithm is faster than one digit per iteration, there seems little risk of two consecutive iterations agreeing to some N digits without also being correct to N digits.
Consider the following terminating decimal numbers.
3.1^2 = 9.61
3.1^4 = 92.3521
3.1^8 = 8528.91037441
The following shows how Mathematica treats these expressions
In[1]:= 3.1^2
Out[1]= 9.61
In[2]:= 3.1^4
Out[2]= 92.352
So far so good, but
In[3]:= 3.1^8
Out[3]= 8528.91
doesn't provide enough precision.
So let's try N[], NumberForm[], and DecimalForm[] with a precision of 12
In[4]:= N[3.1^8,12]
Out[4]= 8528.91
In[5]:= NumberForm[3.1^8,12]
Out[5]= 8528.91037441
In[6]:= DecimalForm[3.1^8,12]
Out[6]= 8528.91037441
In this case DecimialForm[] and NumberForm[] work as expected, but N[] only provided the default precision of 6, even though I asked for 12. So DecimalForm[] or NumberForm[] seem to be the way to go if you want exact results when the inputs are terminating decimals.
Next consider rational numbers with infinite repeating decimals like 1/3.
In[7]:= N[1/3,20]
Out[7]= 0.33333333333333333333
In[9]:= NumberForm[1/3, 20]
Out[9]=
1/3
In[9]:= DecimalForm[1/3, 20]
Out[9]=
1/3
Unlike the previous case, N[] seems to be the proper way to go here, whereas NumberForm[] and DecimalForm[] do not respect precisions.
Finally consider irrational numbers like Sqrt[2] and Pi.
In[10]:= N[Sqrt[2],20]
Out[10]= 1.4142135623730950488
In[11]:= NumberForm[Sqrt[2], 20]
Out[11]=
sqrt(2)
In[12]:= DecimalForm[Sqrt[2], 20]
Out[12]=
sqrt(2)
In[13]:= N[π^12,30]
Out[13]= 924269.181523374186222579170358
In[14]:= NumberForm[Pi^12,30]
Out[14]=
π^12
In[15]:= DecimalForm[Pi^12,30]
Out[15]=
π^12
In these cases N[] works, but NumberForm[] and DecimalForm[] do not. However, note that N[] switches to scientific notation at π^13, even with a larger precision. Is there a way to avoid this switch?
In[16]:= N[π^13,40]
Out[16]= 2.903677270613283404988596199487803130470*10^6
So there doesn't seem to be a consistent way of formulating how to get decimal numbers with requested precisions and at the same time avoiding scientific notation. Sometimes N[] works, othertimes DecimalForm[] or NumberForm[] works, and at othertimes nothing seems to work.
Have I missed something or are there bugs in the system?
It isn't a bug because it is designed purposefully to behave this way. Precision is limited by the precision of your machine, your configuration of Mathematica, and the algorithm and performance constraints of the calculation.
The documentation for N[expr, n] states it attempts to give a result with n‐digit precision. When it cannot give the requested precision it gets as close as it can. DecimalForm and NumberForm work the same way.
https://reference.wolfram.com/language/ref/N.html explains the various cases behind this:
Unless numbers in expr are exact, or of sufficiently high precision, N[expr,n] may not be able to give results with n‐digit precision.
N[expr,n] may internally do computations to more than n digits of precision.
$MaxExtraPrecision specifies the maximum number of extra digits of precision that will ever be used internally.
The precision n is given in decimal digits; it need not be an integer.
n must lie between $MinPrecision and $MaxPrecision. $MaxPrecision can be set to Infinity.
n can be smaller than $MachinePrecision.
N[expr] gives a machine‐precision number, so long as its magnitude is between $MinMachineNumber and $MaxMachineNumber.
N[expr] is equivalent to N[expr,MachinePrecision].
N[0] gives the number 0. with machine precision.
N converts all nonzero numbers to Real or Complex form.
N converts each successive argument of any function it encounters to numerical form, unless the head of the function has an attribute such as NHoldAll.
You can define numerical values of functions using N[f[args]]:=value and N[f[args],n]:=value.
N[expr,{p,a}] attempts to generate a result with precision at most p and accuracy at most a.
N[expr,{Infinity,a}] attempts to generate a result with accuracy a.
N[expr,{Infinity,1}] attempts to find a numerical approximation to the integer part of expr.
Having a file test2.py with the following contents:
print(2.0000000000000003)
print(2.0000000000000002)
I get this output:
$ python3 test2.py
2.0000000000000004
2.0
I thought lack of memory allocated for float might be causing this but 2.0000000000000003 and 2.0000000000000002 need same amount of memory.
IEEE 754 64-bit binary floating point always uses 64 bits to store a number. It can exactly represent a finite subset of the binary fractions. Looking only at the normal numbers, if N is a power of two in its range, it can represent a number of the form, in binary, 1.s*N where s is a string of 52 zeros and ones.
All the 32 bit binary integers, including 2, are exactly representable.
The smallest exactly representable number greater than 2 is 2.000000000000000444089209850062616169452667236328125. It is twice the binary fraction 1.0000000000000000000000000000000000000000000000000001.
2.0000000000000003 is closer to 2.000000000000000444089209850062616169452667236328125 than to 2, so it rounds up and prints as 2.0000000000000004.
2.0000000000000002 is closer to 2.0, so it rounds down to 2.0.
To store numbers between 2.0 and 2.000000000000000444089209850062616169452667236328125 would require a different floating point format likely to take more than 64 bits for each number.
Floats are not stored as integers are, with each bit signaling a yes/no term of 1,2,4,8,16,32,... value that you add up to get the complete number. They are stored as sign + mantissa + exponent in base 2. Several combinations have special meaning (NaN, +-inf, -0,...). Positive and negative numbers are idential in mantissa and exponent, only the sign differs.
At any given time they have a specific bit-length they are "put into". They can not overflow.
They have however a minimal accuracy, if you try to fit numbers into them that would need a bigger accuracy you get rounding errors - thats what you see in your example.
More on floats and storage (with example):
http://stupidpythonideas.blogspot.de/2015/01/ieee-floats-and-python.html
(which links to a more technical https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html)
More on accuracy of floats:
- Floating Point Arithmetic: Issues and Limitations
In COBOL what is the result of
COMPUTE RESULT1 = 97 / 100
COMPUTE RESULT2 = 50 / 100
COMPUTE RESULT3 = 32 / 100
COMPUTE RESULT4 = -97 / 100
COMPUTE RESULT5 = -50 / 100
COMPUTE RESULT6 = -32 / 100
When RESULT1/2/3 are:
PIC S9(4)
PIC S9(4)V9
PIC S9(4)V99
Or, in other words, what is the default rounding mode for COBOL divisions?
Edit: What happens with negative values?
Even "discard" is sort of a rounding mode, is it equivalent to rounding towards negative infinity or towards zero?
COBOL does no rounding, unless you tell it to.
What it does do, if you don't tell it to do rounding, is low-order truncation. Some people may prefer to term that something else, it doesn't really matter, the effect is the same. Truncation.
Negative values are dealt with in the same way as positive values (retain a significant digit beyond what is required for the final result, and add one to that (also see later explanation): -0.009 would, to two decimal places, round to -0.01; -0.004 would round to -0.00.
If you specify no decimal places for a field, any fractional part will be simply discarded.
So, when all the targets of your COMPUTEs are 9(4), they will all contain zero, including the negative values.
When all the targets of your COMPUTEs are 9(4)V9, without rounding, they will contain 0.9, 0.5 and 0.3 respectively with the low-order (from the second decimal digit) decimal part being truncated.
And when all the targets of your COMPUTEs are 9(4)V99, they will contain 0.97, 0.50 and 0.32 with the low-order decimal part beyond that being truncated.
You do rounding in the language by using the ROUNDED phrase for the result of any arithmetic verb (ADD, SUBTRACT, MULTIPLY, DIVIDE, COMPUTE).
ADD some-name some-other-name GIVIING some-result
ROUNDED
COMPUTE some-result ROUNDED = some-name
+ some-other-name
The above are equivalent to each other.
To the 1985 Standard, ROUNDED takes the final result with one extra decimal place and adjusts the actual field with the defined decimal places by adding "one" at the smallest possible unit (for V99, it will add one hundredth, at V999 it will add one thousandth, at no decimal places it will add one, with any scaling amount (see PICture character P) it will add one).
You can consider the addition of one to be made to an absolute value, with the result retaining the original sign. Or you can consider it as done in any other way which achieves the same result. The Standard describes the rounding, the implementation meets the Standard in whatever way it likes. My description is a description for human understanding. No compiler need implement it in the way I have described, but logically the results will be the same.
Don't get hung up on how it is implemented.
The 2014 Standard, superseding the 2002 Standard, has further options for rounding, which for the 85 Standard you'd have to code for (very easy use of REDEFINES).
`ROUNDED` `MODE IS` `AWAY-FROM-ZERO` `NEAREST-AWAY-FROM-ZERO` `NEAREST-EVEN` `NEAREST-TOWARD-ZERO` `PROHIBITED` `TOWARD-GREATER` `TOWARD-LESSER TRUNCATION`
Only one "mode" may be specified at a time, and the default if MODE IS is not specified is TRUNCATION, establishing backward-compatibility (and satisfying those who feel the everything is rounding of some type.
The PROHIBITED option is interesting. If the result field has, for instance, two decimal places, then POHIBITED requires that the calculated result has only two high-order decimal places and all lower-order values are zero.
It is important to note with a COMPUTE that only the final result is rounded, intermediate results are not. If you need intermediate rounding, you need one COMPUTE (or other arithmetic verb) per rounded result.
I'm a bit confused of the local space coordinate system. Suppose I have a complex object in the local space. I know when I want to put it in the world space I have to multiply it with Scale,Rotate,Translate matrix. But the problem is the local coordinate only ranged from -1.0f to 1.0f, when I want to have vertex like (1/500,1/100,1/100) things will not work, everything will become 0 due to the float accuracy problem.
The only solution to me now is separate them into lots of local space systems and ProjectView each individually to put them together. It seems not the correct way of solving the problem. I've been checked lots of books but none of them mentioned this issue. I really want to know how to solve it.
when I want to have vertex like (1/500,1/100,1/100) things will not work
What makes you think that? The float accuracy problem does not mean something will coerce to 0 if it can't be accurately represented. It just means, it will coerce to the floating point number closest to the intended figure.
It's the very same as writing down, e.g., 3/9 with at most 6 significant decimal digits: 0.33334 – it didn't coerce to 0. And the very same goes for floating point.
Now you may be familiar with scientific notation: x·10^y – this is essentially decimal floating point, a mantissa x and an exponent y which essentially specifies the order of magnitude. In binary floating point it becomes x·2^y. In either case the significant digits are in the mantissa. Your typical floating point number (in OpenGL) has a mantissa of 23 bits, which boils down to an amount of 22 significant binary digits (which are about 7 decimal digits).
I really want to know how to solve it.
The real trouble with floating point numbers is, if you have to mix and merge numbers over a large range of orders of magnitudes. As long as the numbers are of similar order of magnitudes, everything happens with just the mantissa. And that one last change in order of magnitude to the [-1, 1] range will not hurt you; heck this can be done by "normalizing" the floating point value and then simply dropping the exponent.
Recommended read: http://floating-point-gui.de/
Update
One further thing: If you're writing 1/500 in a language like C, then you're performing an integer division and that will of course round down to 0. If you want this to be a floating point operation you either have to write floating point literals or cast to float, i.e.
1./500.
or
(float)1/(float)500
Note that casting one of the operands to float suffices to make this a floating point division.