I don't understand how integer_decode in num_traits works. For instance: we have
use num_traits::Float;
let num = 2.0f32;
// (8388608, -22, 1)
let (mantissa, exponent, sign) = Float::integer_decode(num);
But how we get those integers?
Binary representation of 2.0f32 has 0 sign bit, 1 bit as leading bit in exponent and mantissa consisting of zeros. How to get integer decode and why we choose this particular decomposition and not 8388608*2 as mantissa and -23 as exponent?
I didn't write the function, so take this answer with a grain of salt, as it's more of a gut feeling than knowledge. The rationale behind it is not explained in the comments in the function implementation, so unless the author of the code speaks up, we can't deliver more than educational guesses.
f32 is based on IEEE-754, which specifies that a 2.0 shall be represented as the following three parts:
the sign bit 0
indicates that 2.0 is positive
the exponent 128
it's one byte that indicates the exponent, with 127 representing 0. 128 means 1
the mantissa 0
the mantissa consists of 23 bits and has an implicit 1. in front of it. So 0 means 1.0.
To get the actual number, you need to do 0 * (-1) + 2 ^ (128 - 127) * 1.0, which is 2 ^ 1 * 1 = 2.
Now this is not the only way to compute that. You could also do:
map the sign bit to 1 and -1
instead of prefixing the mantissa with 1., add a 1 in front of it, making it an integer. (this avoids having to use a float to decode a float, which is nonsense for obvious reasons)
subtract 127 from the exponent, making it signed. Then, remove 23 from it to compensate that our mantissa is now shifted by 23 bits (because the mantissa is 23 bits long and we moved the comma all the way back to make it an integer).
This would, for 2.0 give us:
sign -1
mantissa 0b100000000000000000000000 = 8388608
exponent 128 - 127 - 23 = -22
Now we can do sign * mantissa * 2 ^ exponent, as specified in the documentation to get our value back.
Note how fast calculating those integers was: a binary decision for the sign, a binary or operation for the mantissa and a single u8 subtraction for the exponent (a single one because one can combine - 127 - 23 to - 150 beforehand).
why we choose this particular decomposition and not 8388608*2 as mantissa and -23 as exponent
The short version is that this guarantees that all possible mantissas can be treated the same way. It's 23 bits long and a 1 with the entire mantissa attached to it is always a valid integer. In the case of 0 this is a 1 with 23 0s, 0b100000000000000000000000, which is 8388608.
integer_decode() documentation is quite clear:
Returns the mantissa, base 2 exponent, and sign as integers, respectively. The original number can be recovered by sign * mantissa * 2 ^ exponent.
1 * 8388608 * 2^-22 == 2
Mostly IEEE 754 uses this description of finite floating-point numbers: A floating-point number has the form (−1)s×be×m, where:
s is 0 or 1.
e is any integer emin ≤ e ≤ emax.
m is a number represented by a digit string of the form d0.d1d2…dp−1 where di is an integer digit 0 ≤ di < b.
This form is described in IEEE-754 2019 clause 3.3. b is the base, p is the precision (number of digits in base b), and emin and emax are bounds on the exponent. This form is useful for certain things, such as describing the normalized form as starting with a leading “1.” or “0.” when the base is two. However, the standard also says, in the same clause:
It is also convenient for some purposes to view the significand as an 10 integer; in which case the finite floating-point numbers are described thus:
Signed zero and non-zero floating-point numbers of the form (−1)s×bp×c, where
s is 0 or 1.
q is any integer emin ≤ q+p−1 ≤ emax.
c is a number represented by a digit string of the form d0d1d2…dp−1 where di is an integer digit 0 ≤ di ≤ b (c is therefore an integer with 0 ≤ c < bp).
(My reproduction of the IEEE-754 text changes some of the typography slightly.) Note two things. First, this matches the results Float::integer_decode gives you. In the Float format, p is 24, so m should have 24 bits, not 25, so it can be 8,388,608 (223) and cannot be 16,777,216 (224).
Second, what makes this form useful is m is always an integer and can be any integer in that range—the low digit of m is immediately to the left of the radix point, so the consecutive values representable in this range of the floating-point format are consecutive integers, and we can analyze them and write proofs using number theory.
You could use alternate forms that are mathematically equivalent but let the significand be 224, but then the low digit in such a form would have to be zero (because there is no bit in the format to represent it having any other value), so it is not particularly useful.
Related
I have homework assignment with a piece to compute the bitwise negation of integer value. It say 512 go into -513.
I have a solution that does x = 512 y = 512*(-1)+(-1).
Is that correct way?
I think you need to first negate and add 1.
-x = ~x + 1
consequently
~x= -x -1
This property is based on the way negative number are represented in two's complement. To represent a negative number $A$ on n bits, one uses the complement of |A| to 2n, i.e. the number 2n-|A|
It is easy to see that A+~A=111...11 as bits in the addition will always be 0 and 1 and 111...111 is the number just before 2n, or 2n-1.
As -|A| is coded by 2n-|A|, and A +~A=2n-1, we can say that -A=~A+1 or equivalently ~A=-A-1
This is true for any number, positive or negative. And ~512=-512-1=-513
val = 512
print (~val)
output:
-513
~ bitwise complement
Sets the 1 bits to 0 and 1 to 0.
For example ~2 would result in -3.
This is because the bit-wise operator would first represent the number in sign and magnitude which is 0000 0010 (8 bit operator) where the MSB is the sign bit.
Then later it would take the negative number of 2 which is -2.
-2 is represented as 1000 0010 (8 bit operator) in sign and magnitude.
Later it adds a 1 to the LSB (1000 0010 + 1) which gives you 1000 0011.
Which is -3.
Otherwise:
y = -(512+1)
print (y)
output:
-513
To my great surprise, I found that rounding a NaN value in Haskell returns a gigantic negative number:
round (0/0)
-269653970229347386159395778618353710042696546841345985910145121736599013708251444699062715983611304031680170819807090036488184653221624933739271145959211186566651840137298227914453329401869141179179624428127508653257226023513694322210869665811240855745025766026879447359920868907719574457253034494436336205824
The same thing happens with floor and ceiling.
What is happening here? Is this behavior intended? Of course, I understand that anyone who doesn't want this behavior can always write another function that checks isNaN - but are there existing alternative standard library functions that handle NaN more sanely (for some definition of "more sanely")?
TL;DR: NaN have an arbitrary representation between 2 ^ 1024 and 2 ^ 1025 (bounds not included), and - 1.5 * 2 ^ 1024 (which is one possible) NaN happens to be the one you hit.
Why any reasoning is off
What is happening here?
You're entering the region of undefined behaviour. Or at least that is what you would call it in some other languages. The report defines round as follows:
6.4.6 Coercions and Component Extraction
The ceiling, floor, truncate, and round functions each take a real fractional argument and return an integral result. … round x returns the nearest integer to x, the even integer if x is equidistant between two integers.
In our case x does not represent a number to begin with. According to 6.4.6, y = round x should fulfil that any other z from round's codomain has an equal or greater distance:
y = round x ⇒ ∀z : dist(z,x) >= dist(y,x)
However, the distance (aka the subtraction) of numbers is defined only for, well, numbers. If we used
dist n d = fromIntegral n - d
we get in trouble soon: any operation that includes NaN will return NaN again, and comparisons on NaN fail, so our property above does not hold for any z if x was a NaN to begin with. If we check for NaN, we can return any value, but then our property holds for all pairs:
dist n d = if isNaN d then constant else fromIntegral n - d
So we're completely arbitrary in what round x shall return if x was not a number.
Why do we get that large number regardless?
"OK", I hear you say, "that's all fine and dandy, but why do I get that number?" That's a good question.
Is this behavior intended?
Somewhat. It isn't really intended, but to be expected. First of all, we have to know how Double works.
IEE 754 double precision floating point numbers
A Double in Haskell is usually a IEEE 754 compliant double precision floating point number, that is a number that has 64 bits and is represented with
x = s * m * (b ^ e)
where s is a single bit, m is the mantissa (52 bits) and e is the exponent (11 bits, floatRange). b is the base, and its usually 2 (you can check with floadRadix). Since the value of m is normalized, every well-formed Double has a unique representation.
IEEE 754 NaN
Except NaN. NaN is represented as the emax+1, as well as a non-zero mantissa. So if the bitfield
SEEEEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
represents a Double, what's a valid way to represent NaN?
?111111111111000000000000000000000000000000000000000000000000000
^
That is, a single M is set to 1, the other are not necessary to set this notion. The sign is arbitrary. Why only a single bit? Because its sufficient.
Interpret NaN as Double
Now, when we ignore the fact that this is a malformed Double—a NaN– and really, really, really want to interpret it as number, what number would we get?
m = 1.5
e = 1024
x = 1.5 * 2 ^ 1024
= 3 * 2 ^ 1024 / 2
= 3 * 2 ^ 1023
And lo and behold, that's exactly the number you get for round (0/0):
ghci> round $ 0 / 0
-269653970229347386159395778618353710042696546841345985910145121736599013708251444699062715983611304031680170819807090036488184653221624933739271145959211186566651840137298227914453329401869141179179624428127508653257226023513694322210869665811240855745025766026879447359920868907719574457253034494436336205824
ghci> negate $ 3 * 2 ^ 1023
-269653970229347386159395778618353710042696546841345985910145121736599013708251444699062715983611304031680170819807090036488184653221624933739271145959211186566651840137298227914453329401869141179179624428127508653257226023513694322210869665811240855745025766026879447359920868907719574457253034494436336205824
Which brings our small adventure to a halt. We have a NaN, which yields a 2 ^ 1024, and we have some non-zero mantissa, which yields a result with absolute value between 2 ^ 1024 < x < 2 ^ 1025.
Note that this isn't the only way NaN can get represented:
In IEEE 754, NaNs are often represented as floating-point numbers with the exponent emax + 1 and nonzero significands. Implementations are free to put system-dependent information into the significand. Thus there is not a unique NaN, but rather a whole family of NaNs.
For more information, see the classic paper on floating point numbers by Goldberg.
This has long been observed as a problem. Here're a few tickets filed against GHC on this very topic:
https://ghc.haskell.org/trac/ghc/ticket/3070
https://ghc.haskell.org/trac/ghc/ticket/11553
https://ghc.haskell.org/trac/ghc/ticket/3676
Unfortunately, this is a thorny issue with lots of ramifications. My personal belief is that this is a genuine bug and it should be fixed properly by throwing an error. But you can read the comments on these tickets to get an understanding of the tricky issues preventing GHC from implementing a proper solution. Essentially, it comes down to speed vs. correctness, and this is one point where (i) the Haskell report is woefully underspecified, and (ii) GHC compromises the latter for the former.
Right at the beginning of this page The OpenType Font File you'll find this table, with examples of the F2DOT14 format for a 16-bit signed fixed number with the low 14 bits of a fraction.
I couldn't obtain the hex value 0xffff for the decimal -0.000061. By the way the mantissa -1 seems to be wrong and the value for the fraction should be 1/16384, instead of 16383/16384, unless I'm missing something related to the two's complement notation used to express a negative value in code.
The mantissa and fraction values listed are entirely correct: the F2DOT14 field encodes numbers as the arithmetic computation mantissa + fraction, not as "signed mantissa with unsigned concatenated fraction remainder".
As such, if you want -0.000061, you have to start with the signed integer -1 in the first two bits (11) and then add the positive value 16383/16384 in the last 14 bits (11111111111111), such that mantissa + fraction = -1 + 16383/16384 = -1/16384, which in turn is encoded using the 16 bit code 0xFFFF
Now, floating and double-precision numbers, although they can approximate any sort of number (although the same could be said integers, floats are just more precise), they are represented as binary decimals internally. For example, one tenth would be approximated
0.00011001100110011... (... only goes to computers precision, not infinity)
Now, any number in binary with finite bits as something called a dyadic fraction representation in mathematics (has nothing to do with p-adic). This means you represent it as a fraction, where the denominator is a power of 2. For example, let's say our computer approximates one tenth as 0.00011. The dyadic fraction for that is 3/32 or 3/(2^5), which is close to one tenth. Now for my technical question. What would be the simplest way to extract the dyadic fraction from a floating number.
Irrelevant Note: If you are wondering why I would want to do this, it is because I am creating a surreal number library in Haskell. Dyadic fractions are easily translated into Surreal numbers, which is why it is convenient that binary is easily translated into dyadic, (I'll sure have trouble with the rational numbers though.)
The decodeFloat function seems useful for this. Technically, you should also check that floatRadix is 2, but as far I can see this is always the case in GHC.
Just be careful since it does not simplify mantissa and exponent. Here, if I evaluate decodeFloat (1.0 :: Double) I get an exponent of -52 and a mantissa of 2^52 which is not what I expected.
Also, toRational seems to generate a dyadic fraction. I am not sure this is always the case, though.
Hold your numbers in binary and convert to decimal for display.
Binary numbers are all dyatic. The numbers after the decimal place represent the number of powers of two for the denominator and the number evaluated without a decimal place is the numerator. That's binary numbers for you.
There is an ideal representation for surreal numbers in binary. I call them "sinary". It's this:
0s is Not a number
1s is zero
10s is neg one
11s is one
100s is neg two
101s is neg half
110s is half
111s is two
... etc...
so you see that the standard binary count matches the surreal birth order of numeric values when evaluated in sinary. The way to determine the numeric value of sinary is that the 1's are rights and the 0's are lefts. We start with +/-1's and then 1/2, 1/4, 1/8, etc. With sign equal to + for 1 and - for 0.
ex: evaluating sinary
1011011s
-> is the 91st surreal number (because 64+16+8+2+1 = 91)
-> with a value of −0.28125, because...
1011011
NLRRLRR
+-++-++
+ 0 − 1 + 1/2 + 1/4 − 1/8 + 1/16 + 1/32
= 0 − 32/32 + 16/32 + 8/32 − 4/32 + 2/32 + 1/32
= − 9/32
The surreal numbers form a binary tree, so there is an ideal binary format matching their location on the tree according to the Left/Right pattern to reach the number. Assign 1 to right and 0 to left. Then the birth order of surreal number is equal to the binary count of this representation. ie: the 15th surreal number value represented in sinary is the 15th number representation in the standard binary count. The value of a sinary is the surreal label value. Strip the leading bit from the representation, and start adding +1's or -1's depending on if the number starts with 1 or 0 after the first one. Then once the bit flips, begin adding and subtracting halved values (1/2, 1/4, 1/8, etc) using + or - values according to the bit value 1/0.
I have tested this format and it seems to work well. And there are some other secrets... such as the left and right of any sinary representation is the same binary format with the tail clipped to the last 0 and last 1 respectively. Conversion to decimal into a dyatic is NOT required in order to preform the recursive functions requested by Conway.
I've written a small function in C, which almost do the same work as standart function `fcvt'. As you may know, this function takes a float/double and make a string, representing this number in ANSI characters. Everything works ;-)
For example, for number 1.33334, my function gives me string: "133334" and set up special integer variable `decimal_part', in this example will be 1, which means in decimal part only 1 symbol, everything else is a fraction.
Now I'm curious about what to do standart C function `printf'. It can take %a or %e as format string. Let me cite for %e (link junked):
"double" argument is output in scientific notation
[-]m.nnnnnne+xx
... The exponent always contains two digits.
It said: "The exponent always contains two digits". But what is an Exponent? This is the main question. And also, how to get this 'exponent' from my function above or from `fcvt'.
The notation might be better explained if we expand the e:
[-]m.nnnnnn * (10^xx)
So you have one digit of m (from 0 to 9, but it will only ever be 0 if the entire value is 0), and several digits of n. I guess it might be best to show with examples:
1 = 1.0000 * 10^0 = 1e0
10 = 1.0000 * 10^1 = 1e1
10000 = 1.0000 * 10^4 = 1e4
0.1 = 1.0000 * 10^-1 = 1e-1
1,419 = 1.419 * 10^3 = 1.419e3
0.00000123 = 1.23 * 10^-5 = 1.23e-5
You can look up scientific notation off Google, but it is useful for expressing very large or small numbers like 1232100000000000000 would be 1.2321e24 (I didn't actually count, exponent may be inaccurate).
In C, I think you can actually extract the exponent from the top 12 bits (the first being the sign which you will have to ignore). See: IEEE758-1985 Floating Point
The exponent is the power 10 is raised to then multiplied by the base.
SI is explained at wikipeida. http://en.wikipedia.org/wiki/Scientific_notation
m.nnnnnne+xx is logically equal to m.nnnnnn * 10 ^ +xx
In scientific notation, the exponent is the ten to the XX power, so 1234.5678 can be represented as 1.2345678E03 where the normalized form is multiplied by 10^3 to get the "real" answer.
400 = 4 * 10 ^ 2
2 is the exponent.
If you write a number in scientific notation then the exponent is part of that notation.
You can see a full description here http://en.wikipedia.org/wiki/Scientific_notation, but basically its just another way to write a number, typically used for very large or very small numbers.
Say you have the number 300, that is equal to 3 * 100, or 3 * 10^2 in scientific notation.
If you use %e it will be printed as 3.0e+02