How to obtain the hex value 0xffff corresponding to the decimal value -0.000061 in the table below? - opentype

Right at the beginning of this page The OpenType Font File you'll find this table, with examples of the F2DOT14 format for a 16-bit signed fixed number with the low 14 bits of a fraction.
I couldn't obtain the hex value 0xffff for the decimal -0.000061. By the way the mantissa -1 seems to be wrong and the value for the fraction should be 1/16384, instead of 16383/16384, unless I'm missing something related to the two's complement notation used to express a negative value in code.

The mantissa and fraction values listed are entirely correct: the F2DOT14 field encodes numbers as the arithmetic computation mantissa + fraction, not as "signed mantissa with unsigned concatenated fraction remainder".
As such, if you want -0.000061, you have to start with the signed integer -1 in the first two bits (11) and then add the positive value 16383/16384 in the last 14 bits (11111111111111), such that mantissa + fraction = -1 + 16383/16384 = -1/16384, which in turn is encoded using the 16 bit code 0xFFFF

Related

Why does integer_decode in Rust work that way?

I don't understand how integer_decode in num_traits works. For instance: we have
use num_traits::Float;
let num = 2.0f32;
// (8388608, -22, 1)
let (mantissa, exponent, sign) = Float::integer_decode(num);
But how we get those integers?
Binary representation of 2.0f32 has 0 sign bit, 1 bit as leading bit in exponent and mantissa consisting of zeros. How to get integer decode and why we choose this particular decomposition and not 8388608*2 as mantissa and -23 as exponent?
I didn't write the function, so take this answer with a grain of salt, as it's more of a gut feeling than knowledge. The rationale behind it is not explained in the comments in the function implementation, so unless the author of the code speaks up, we can't deliver more than educational guesses.
f32 is based on IEEE-754, which specifies that a 2.0 shall be represented as the following three parts:
the sign bit 0
indicates that 2.0 is positive
the exponent 128
it's one byte that indicates the exponent, with 127 representing 0. 128 means 1
the mantissa 0
the mantissa consists of 23 bits and has an implicit 1. in front of it. So 0 means 1.0.
To get the actual number, you need to do 0 * (-1) + 2 ^ (128 - 127) * 1.0, which is 2 ^ 1 * 1 = 2.
Now this is not the only way to compute that. You could also do:
map the sign bit to 1 and -1
instead of prefixing the mantissa with 1., add a 1 in front of it, making it an integer. (this avoids having to use a float to decode a float, which is nonsense for obvious reasons)
subtract 127 from the exponent, making it signed. Then, remove 23 from it to compensate that our mantissa is now shifted by 23 bits (because the mantissa is 23 bits long and we moved the comma all the way back to make it an integer).
This would, for 2.0 give us:
sign -1
mantissa 0b100000000000000000000000 = 8388608
exponent 128 - 127 - 23 = -22
Now we can do sign * mantissa * 2 ^ exponent, as specified in the documentation to get our value back.
Note how fast calculating those integers was: a binary decision for the sign, a binary or operation for the mantissa and a single u8 subtraction for the exponent (a single one because one can combine - 127 - 23 to - 150 beforehand).
why we choose this particular decomposition and not 8388608*2 as mantissa and -23 as exponent
The short version is that this guarantees that all possible mantissas can be treated the same way. It's 23 bits long and a 1 with the entire mantissa attached to it is always a valid integer. In the case of 0 this is a 1 with 23 0s, 0b100000000000000000000000, which is 8388608.
integer_decode() documentation is quite clear:
Returns the mantissa, base 2 exponent, and sign as integers, respectively. The original number can be recovered by sign * mantissa * 2 ^ exponent.
1 * 8388608 * 2^-22 == 2
Mostly IEEE 754 uses this description of finite floating-point numbers: A floating-point number has the form (−1)s×be×m, where:
s is 0 or 1.
e is any integer emin ≤ e ≤ emax.
m is a number represented by a digit string of the form d0.d1d2…dp−1 where di is an integer digit 0 ≤ di < b.
This form is described in IEEE-754 2019 clause 3.3. b is the base, p is the precision (number of digits in base b), and emin and emax are bounds on the exponent. This form is useful for certain things, such as describing the normalized form as starting with a leading “1.” or “0.” when the base is two. However, the standard also says, in the same clause:
It is also convenient for some purposes to view the significand as an 10 integer; in which case the finite floating-point numbers are described thus:
Signed zero and non-zero floating-point numbers of the form (−1)s×bp×c, where
s is 0 or 1.
q is any integer emin ≤ q+p−1 ≤ emax.
c is a number represented by a digit string of the form d0d1d2…dp−1 where di is an integer digit 0 ≤ di ≤ b (c is therefore an integer with 0 ≤ c < bp).
(My reproduction of the IEEE-754 text changes some of the typography slightly.) Note two things. First, this matches the results Float::integer_decode gives you. In the Float format, p is 24, so m should have 24 bits, not 25, so it can be 8,388,608 (223) and cannot be 16,777,216 (224).
Second, what makes this form useful is m is always an integer and can be any integer in that range—the low digit of m is immediately to the left of the radix point, so the consecutive values representable in this range of the floating-point format are consecutive integers, and we can analyze them and write proofs using number theory.
You could use alternate forms that are mathematically equivalent but let the significand be 224, but then the low digit in such a form would have to be zero (because there is no bit in the format to represent it having any other value), so it is not particularly useful.

Understanding the Float conversion behaviour of pyspark

When I convert the python float 77422223.0 to a spark FloatType, I get 77422224. If I do so with DoubleType I get 77422223. How is this conversion working and is there a way to compute when such an error will happen?
df = spark.createDataFrame([77422223.0],FloatType())
display(df)
output
and as expected running
df = spark.createDataFrame([77422223.0],DoubleType())
display(df)
yields
How is this conversion working…
I assume the Spark FloatType is IEEE-754 binary32. This format uses a 24-bit significand and an exponent range from −126 to +127. Each number is represented as a sign and a 24-bit numeral with a “.” after its first digit multiplied by two to the power of an exponent, such as +1.010011000011111000000002•213.
In binary, 77,422,223 is 1001001110101011110100011112. That is 27 bits. So it cannot be represented in the binary32 format. When it is converted to the binary32 format, the conversion operating rounds it to the nearest representable value. That is 1001001110101011110100100002, which has 23 significant digits.
… and is there a way to compute when such an error will happen?
When the number is written in binary, if the number of bits from the first 1 to its last 1, including both of those, is more than 24, then it cannot be represented in the binary32 format.
Also, if the magnitude of the number is less than 2−126, it cannot be represented in binary32 unless it is a multiple of 2−149, including zero. Numbers in this range are subnormal and have a fixed exponent −126, and the lowest bit of the significand has a position value of 2−149. And, if the magnitude number is 2128 or greater, it cannot be represented, unless it is +∞ or −∞.

Why is the MSB for an int bit 31 in Java?

I am looking at my lecture notes and I see this:
Why is the MSB for an int the 31st bit and not the 32nd bit? If an int has 4 bytes, there are 32 bits and the leftmost bit is the 32nd bit right?
The notes say
the leftmost bit represents the sign of the integer... If the MSB is 1, the integer is negative. Note the MSB is the sign no matter what size of the integer type...For example, for an int, it is bit 31. For a long, it is bit 63. For a byte, it is bit 7. To get the two complements negative of a positive number, first invert all the bits, change the 0s to 1s and the 1s to 0s, then add 1.
Is that right?
Also I don't understand why inverting all bits and adding one gives me the negative number. Can someone explain this better?

HEX2OCT formula in MS Excel returns incorrect result

While converting the hexadecimal value "FFFFFFFF00" into octal value using Hex2Oct of MS Excel, it should return "Error string" as per the rules mentioned here:
If number is negative, HEX2OCT ignores places and returns a 10-character octal number.
If number is negative, it cannot be less than FFE0000000, and if number is positive, it cannot be greater than 1FFFFFFF.
If number is not a valid hexadecimal number, HEX2OCT returns the #NUM! error value.
If HEX2OCT requires more than places characters, it returns the #NUM! error value.
If places is not an integer, it is truncated.
If places is nonnumeric, HEX2OCT returns the #VALUE! error value.
If places is negative, HEX2OCT returns the #NUM! error value.
But it computes and returns as "7777777400" without considering the rules/remarks mentioned in the link.
For example:
While calculating HEX2OCT,
As per Excel rule, If number is positive, it cannot be greater than 1FFFFFFF(hex)<->3777777777(oct)<->536870911(decimal).
But while calculating the HEX2OCT for FFFFFFFF00(hex) <-> 7777777400(oct) <-> 1099511627520(decimal).
Here the hex value FFFFFFFF00 is greater than 1FFFFFFF, but MS Excel does not return the error string instead it returns the converted octal value.
Can anyone explain why?
FFFFFFFF00 is actually well within the range of hex2oct because it is a negative number.
According to that documentation the largest negative number it can handle is FFE0000000 which when converted to decimal is -536870912. Converting your "big" hex over to decimal yields -256.
The reason the value of FFFFFFFF00 looks so big is because it's a negative number. The first bit is set to 1 (when converted to binary) which signifies that the number is negative. Negatives are computed in binary using two's complement which is found by flipping each bit and then adding 1 to the number.
Undoing the two's complement:
For your big number, the binary representation is:
1111111111111111111111111111111100000000
Subtracting 1:
1111111111111111111111111111111011111111
Flipping all the bits:
0000000000000000000000000000000100000000
Which is 256
So.. basically if the hex looks big, but the first bit is 1 then it's actually a small negative and well within your range of allowable values.
Lastly, when you hex2oct you don't get a negative sign for these because we are still not in decimal notation. The first bit of your octal is still a 1 (when converted to binary) since it's still the same number, just represented in a different counting system.
The clue lies earlier in the documentation page you quote:
The HEX2OCT function syntax has the following arguments:
Number Required. The hexadecimal number you want to convert. Number cannot contain more than 10 characters. The most significant
bit of number is the sign bit. The remaining 39 bits are magnitude
bits. Negative numbers are represented using two's-complement notation.
The hex value FFFFFFFF00 corresponds the binary value
1111 1111 1111 1111 1111 1111 1111 1111 0000 0000
and as the documentation says, "the most significant bit is the sign bit ... two's complement notation". So this value represents a negative number. By the rules of two's complement, it actually represents -256. And this is fine, because it is not "less than FFE0000000", as FFE0000000 is -2097152.
If you actually want to treat FFFFFFFF00 as an unsigned quantity, and get the octal representation of decimal 1099511627520, you'll need to use another method.

How to extract dyadic fraction from float

Now, floating and double-precision numbers, although they can approximate any sort of number (although the same could be said integers, floats are just more precise), they are represented as binary decimals internally. For example, one tenth would be approximated
0.00011001100110011... (... only goes to computers precision, not infinity)
Now, any number in binary with finite bits as something called a dyadic fraction representation in mathematics (has nothing to do with p-adic). This means you represent it as a fraction, where the denominator is a power of 2. For example, let's say our computer approximates one tenth as 0.00011. The dyadic fraction for that is 3/32 or 3/(2^5), which is close to one tenth. Now for my technical question. What would be the simplest way to extract the dyadic fraction from a floating number.
Irrelevant Note: If you are wondering why I would want to do this, it is because I am creating a surreal number library in Haskell. Dyadic fractions are easily translated into Surreal numbers, which is why it is convenient that binary is easily translated into dyadic, (I'll sure have trouble with the rational numbers though.)
The decodeFloat function seems useful for this. Technically, you should also check that floatRadix is 2, but as far I can see this is always the case in GHC.
Just be careful since it does not simplify mantissa and exponent. Here, if I evaluate decodeFloat (1.0 :: Double) I get an exponent of -52 and a mantissa of 2^52 which is not what I expected.
Also, toRational seems to generate a dyadic fraction. I am not sure this is always the case, though.
Hold your numbers in binary and convert to decimal for display.
Binary numbers are all dyatic. The numbers after the decimal place represent the number of powers of two for the denominator and the number evaluated without a decimal place is the numerator. That's binary numbers for you.
There is an ideal representation for surreal numbers in binary. I call them "sinary". It's this:
0s is Not a number
1s is zero
10s is neg one
11s is one
100s is neg two
101s is neg half
110s is half
111s is two
... etc...
so you see that the standard binary count matches the surreal birth order of numeric values when evaluated in sinary. The way to determine the numeric value of sinary is that the 1's are rights and the 0's are lefts. We start with +/-1's and then 1/2, 1/4, 1/8, etc. With sign equal to + for 1 and - for 0.
ex: evaluating sinary
1011011s
-> is the 91st surreal number (because 64+16+8+2+1 = 91)
-> with a value of −0.28125, because...
1011011
NLRRLRR
+-++-++
+ 0 − 1 + 1/2 + 1/4 − 1/8 + 1/16 + 1/32
= 0 − 32/32 + 16/32 + 8/32 − 4/32 + 2/32 + 1/32
= − 9/32
The surreal numbers form a binary tree, so there is an ideal binary format matching their location on the tree according to the Left/Right pattern to reach the number. Assign 1 to right and 0 to left. Then the birth order of surreal number is equal to the binary count of this representation. ie: the 15th surreal number value represented in sinary is the 15th number representation in the standard binary count. The value of a sinary is the surreal label value. Strip the leading bit from the representation, and start adding +1's or -1's depending on if the number starts with 1 or 0 after the first one. Then once the bit flips, begin adding and subtracting halved values (1/2, 1/4, 1/8, etc) using + or - values according to the bit value 1/0.
I have tested this format and it seems to work well. And there are some other secrets... such as the left and right of any sinary representation is the same binary format with the tail clipped to the last 0 and last 1 respectively. Conversion to decimal into a dyatic is NOT required in order to preform the recursive functions requested by Conway.

Resources