Why is the MSB for an int bit 31 in Java? - long-integer

I am looking at my lecture notes and I see this:
Why is the MSB for an int the 31st bit and not the 32nd bit? If an int has 4 bytes, there are 32 bits and the leftmost bit is the 32nd bit right?
The notes say
the leftmost bit represents the sign of the integer... If the MSB is 1, the integer is negative. Note the MSB is the sign no matter what size of the integer type...For example, for an int, it is bit 31. For a long, it is bit 63. For a byte, it is bit 7. To get the two complements negative of a positive number, first invert all the bits, change the 0s to 1s and the 1s to 0s, then add 1.
Is that right?
Also I don't understand why inverting all bits and adding one gives me the negative number. Can someone explain this better?

Related

Why does integer_decode in Rust work that way?

I don't understand how integer_decode in num_traits works. For instance: we have
use num_traits::Float;
let num = 2.0f32;
// (8388608, -22, 1)
let (mantissa, exponent, sign) = Float::integer_decode(num);
But how we get those integers?
Binary representation of 2.0f32 has 0 sign bit, 1 bit as leading bit in exponent and mantissa consisting of zeros. How to get integer decode and why we choose this particular decomposition and not 8388608*2 as mantissa and -23 as exponent?
I didn't write the function, so take this answer with a grain of salt, as it's more of a gut feeling than knowledge. The rationale behind it is not explained in the comments in the function implementation, so unless the author of the code speaks up, we can't deliver more than educational guesses.
f32 is based on IEEE-754, which specifies that a 2.0 shall be represented as the following three parts:
the sign bit 0
indicates that 2.0 is positive
the exponent 128
it's one byte that indicates the exponent, with 127 representing 0. 128 means 1
the mantissa 0
the mantissa consists of 23 bits and has an implicit 1. in front of it. So 0 means 1.0.
To get the actual number, you need to do 0 * (-1) + 2 ^ (128 - 127) * 1.0, which is 2 ^ 1 * 1 = 2.
Now this is not the only way to compute that. You could also do:
map the sign bit to 1 and -1
instead of prefixing the mantissa with 1., add a 1 in front of it, making it an integer. (this avoids having to use a float to decode a float, which is nonsense for obvious reasons)
subtract 127 from the exponent, making it signed. Then, remove 23 from it to compensate that our mantissa is now shifted by 23 bits (because the mantissa is 23 bits long and we moved the comma all the way back to make it an integer).
This would, for 2.0 give us:
sign -1
mantissa 0b100000000000000000000000 = 8388608
exponent 128 - 127 - 23 = -22
Now we can do sign * mantissa * 2 ^ exponent, as specified in the documentation to get our value back.
Note how fast calculating those integers was: a binary decision for the sign, a binary or operation for the mantissa and a single u8 subtraction for the exponent (a single one because one can combine - 127 - 23 to - 150 beforehand).
why we choose this particular decomposition and not 8388608*2 as mantissa and -23 as exponent
The short version is that this guarantees that all possible mantissas can be treated the same way. It's 23 bits long and a 1 with the entire mantissa attached to it is always a valid integer. In the case of 0 this is a 1 with 23 0s, 0b100000000000000000000000, which is 8388608.
integer_decode() documentation is quite clear:
Returns the mantissa, base 2 exponent, and sign as integers, respectively. The original number can be recovered by sign * mantissa * 2 ^ exponent.
1 * 8388608 * 2^-22 == 2
Mostly IEEE 754 uses this description of finite floating-point numbers: A floating-point number has the form (−1)s×be×m, where:
s is 0 or 1.
e is any integer emin ≤ e ≤ emax.
m is a number represented by a digit string of the form d0.d1d2…dp−1 where di is an integer digit 0 ≤ di < b.
This form is described in IEEE-754 2019 clause 3.3. b is the base, p is the precision (number of digits in base b), and emin and emax are bounds on the exponent. This form is useful for certain things, such as describing the normalized form as starting with a leading “1.” or “0.” when the base is two. However, the standard also says, in the same clause:
It is also convenient for some purposes to view the significand as an 10 integer; in which case the finite floating-point numbers are described thus:
Signed zero and non-zero floating-point numbers of the form (−1)s×bp×c, where
s is 0 or 1.
q is any integer emin ≤ q+p−1 ≤ emax.
c is a number represented by a digit string of the form d0d1d2…dp−1 where di is an integer digit 0 ≤ di ≤ b (c is therefore an integer with 0 ≤ c < bp).
(My reproduction of the IEEE-754 text changes some of the typography slightly.) Note two things. First, this matches the results Float::integer_decode gives you. In the Float format, p is 24, so m should have 24 bits, not 25, so it can be 8,388,608 (223) and cannot be 16,777,216 (224).
Second, what makes this form useful is m is always an integer and can be any integer in that range—the low digit of m is immediately to the left of the radix point, so the consecutive values representable in this range of the floating-point format are consecutive integers, and we can analyze them and write proofs using number theory.
You could use alternate forms that are mathematically equivalent but let the significand be 224, but then the low digit in such a form would have to be zero (because there is no bit in the format to represent it having any other value), so it is not particularly useful.

Why does this hash calculating bit hack work?

For practice I've implemented the qoi specification in rust. In it there is a small hash function to store recently used pixels:
index_position = (r * 3 + g * 5 + b * 7 + a * 11) % 64
where r, g, b, and a are the red, green, blue and alpha channels respectively.
I assume this works as a hash because it creates a unique prime factorization for the numbers with the mod to limit the number of bytes. Anyways I implemented it naively in my code.
While looking at other implementations I came across this bit hack to optimize the hash calculation:
fn hash(rgba:[u8:4]) -> u8 {
let v = u32::from_ne_bytes(rgba);
let s = (((v as u64) << 32) | (v as u64)) & 0xFF00FF0000FF00FF;
s.wrapping_mul(0x030007000005000Bu64.to_le()).swap_bytes() as u8 & 63
}
I think I understand most of what's going on but I'm confused about the magic number (the multiplicand). To my understanding it should be flipped. As a step by step example:
let rgba = [0x12, 0x34, 0x56, 0x78].
On my machine (little endian) this gives v the value 0x78563412.
The bit shifting spreads the values, giving s = 0x7800340000560012.
Now here's where I get confused. The magic number has the values that should be multiplied aligned in a 64 bit field (3, 5, 7, 11), spaced the same way that the original values are. However they seem to be in reverse order from the values:
0x7800340000560012
0x030007000005000B
When multiplying it would seem that the highest value, the alpha channel (0x78), is being multiplied by 3, while the lowest value, the red channel (0x12), is being multiplied by 11. I'm also not entirely sure why this multiplication works anyway, after multiplying the values by various powers of 2.
I understand that the bytes are then swapped to big endian and trimmed, but that's not until after the multiplication step which loses me.
I know that the code produces the correct hash, but I don't understand why that's the case. Can anyone explain to me what I'm missing?
If you think about the way the math works, you want this flipped order, because it means all the results from each of the "logical" multiplications cluster in the same byte. The highest byte in the first value multiplied by the lowest byte in the second produces a result in the highest byte. The lowest byte in the first value's product with the highest byte in the second value produces a result in the same highest byte, and the same goes for the intermediate bytes.
Yes, the 0x78... and 0x03... are also multiplied by each other, but they overflow way past the top of the value and are lost. Having the order "backwards" means the result of the multiplications we care about all ends up summed in the uppermost byte (the total shift of the results we want is always 56 bits, because the 56th bit offset value is multiplied by the 0th, the 40th by the 16th, the 16th by the 40th, and the 0th by the 56th), with the rest of the multiplications we don't want having their results either overflow (and being lost) or appearing in lower bytes (which we ignore). If you flipped the bytes in the second value, the 0x78 * 0x0B (alpha value & multiplier) component would be lost to overflow, while the 0x12 * 0x03 (red value & multiplier) component wouldn't reach the target byte (every component we cared about would end up somewhat that wasn't the uppermost byte).
For a possibly more intuitive example, imagine doing the same work, but where all the bytes of one input except a single component are zero. If you multiply:
0x7800000000000000 * 0x030007000005000B
the logical result is:
0x1680348000258052800000000000000
but removing the overflow reduces that to:
0x2800000000000000
//^^ result we care about (actual product of 0x78 and 0x0B is 0x528, but only keeping low byte)
Similarly,
0x0000340000000000 * 0x030007000005000B
produces:
0x9c016c000104023c0000000000
overflowing to:
0x04023c0000000000
//^^ result we care about (actual product of 0x34 and 0x5 was 0x104, but only 04 kept)
In that case, the other multiplications did leave data in result (not all overflowed), but since we only look at the high byte, the rest gets ignored.
If you keep doing this math step by step and adding the results, you'll find that the high byte ends up the correct answer to the four individual multiplications you expected (mod 256); flip the order, and it won't work out that way.
The advantage to putting all the results in that high byte is that it allows you to use swap_bytes to move it cheaply to the low byte, and read the value directly (no need to even mask it on many architectures).

Maximum bit-width to store a summation of M n-bit binary numbers

I am trying to find the formula to calculate the maximum bit-width required to contain a sum of M n-bit unsigned binary numbers. Thanks!
The maximum bit-width needed should be ceil(log_2(M * (2^n - 1))).
Edit: Thanks to #MBurnham I realize now that it should be floor(log_2(M * (2^n - 1))) + 1 instead.
Assuming positive integers, you need floor(log2(x)) + 1 bits to store x. and the largest value the sum of m n-bit numbers can produce would be m * 2^n.
So I believe the formula should be
floor(log2(m * 2^n)) + 1
bits.
If I add 2 numbers the I need 1 bit more than the wider of the 2 numbers to store the result. So, if I add 2 n-bit numbers, I need n+1 bits to store the result.
if I add another n-bit number, I need (n+1)+1 bits to store the result (that's 3 n-bit numbers added so far)
if I add another n-bit number, I need ((n+1)+1)+1 bits to store the result (that's 4 n-bit numbers added so far)
if I add another n-bit number, I need (((n+1)+1)+1)+1 bits to store the result (that's 5 n-bit numbers added so far)
So, I think your formula is
n + M - 1

HEX2OCT formula in MS Excel returns incorrect result

While converting the hexadecimal value "FFFFFFFF00" into octal value using Hex2Oct of MS Excel, it should return "Error string" as per the rules mentioned here:
If number is negative, HEX2OCT ignores places and returns a 10-character octal number.
If number is negative, it cannot be less than FFE0000000, and if number is positive, it cannot be greater than 1FFFFFFF.
If number is not a valid hexadecimal number, HEX2OCT returns the #NUM! error value.
If HEX2OCT requires more than places characters, it returns the #NUM! error value.
If places is not an integer, it is truncated.
If places is nonnumeric, HEX2OCT returns the #VALUE! error value.
If places is negative, HEX2OCT returns the #NUM! error value.
But it computes and returns as "7777777400" without considering the rules/remarks mentioned in the link.
For example:
While calculating HEX2OCT,
As per Excel rule, If number is positive, it cannot be greater than 1FFFFFFF(hex)<->3777777777(oct)<->536870911(decimal).
But while calculating the HEX2OCT for FFFFFFFF00(hex) <-> 7777777400(oct) <-> 1099511627520(decimal).
Here the hex value FFFFFFFF00 is greater than 1FFFFFFF, but MS Excel does not return the error string instead it returns the converted octal value.
Can anyone explain why?
FFFFFFFF00 is actually well within the range of hex2oct because it is a negative number.
According to that documentation the largest negative number it can handle is FFE0000000 which when converted to decimal is -536870912. Converting your "big" hex over to decimal yields -256.
The reason the value of FFFFFFFF00 looks so big is because it's a negative number. The first bit is set to 1 (when converted to binary) which signifies that the number is negative. Negatives are computed in binary using two's complement which is found by flipping each bit and then adding 1 to the number.
Undoing the two's complement:
For your big number, the binary representation is:
1111111111111111111111111111111100000000
Subtracting 1:
1111111111111111111111111111111011111111
Flipping all the bits:
0000000000000000000000000000000100000000
Which is 256
So.. basically if the hex looks big, but the first bit is 1 then it's actually a small negative and well within your range of allowable values.
Lastly, when you hex2oct you don't get a negative sign for these because we are still not in decimal notation. The first bit of your octal is still a 1 (when converted to binary) since it's still the same number, just represented in a different counting system.
The clue lies earlier in the documentation page you quote:
The HEX2OCT function syntax has the following arguments:
Number Required. The hexadecimal number you want to convert. Number cannot contain more than 10 characters. The most significant
bit of number is the sign bit. The remaining 39 bits are magnitude
bits. Negative numbers are represented using two's-complement notation.
The hex value FFFFFFFF00 corresponds the binary value
1111 1111 1111 1111 1111 1111 1111 1111 0000 0000
and as the documentation says, "the most significant bit is the sign bit ... two's complement notation". So this value represents a negative number. By the rules of two's complement, it actually represents -256. And this is fine, because it is not "less than FFE0000000", as FFE0000000 is -2097152.
If you actually want to treat FFFFFFFF00 as an unsigned quantity, and get the octal representation of decimal 1099511627520, you'll need to use another method.

How to obtain the hex value 0xffff corresponding to the decimal value -0.000061 in the table below?

Right at the beginning of this page The OpenType Font File you'll find this table, with examples of the F2DOT14 format for a 16-bit signed fixed number with the low 14 bits of a fraction.
I couldn't obtain the hex value 0xffff for the decimal -0.000061. By the way the mantissa -1 seems to be wrong and the value for the fraction should be 1/16384, instead of 16383/16384, unless I'm missing something related to the two's complement notation used to express a negative value in code.
The mantissa and fraction values listed are entirely correct: the F2DOT14 field encodes numbers as the arithmetic computation mantissa + fraction, not as "signed mantissa with unsigned concatenated fraction remainder".
As such, if you want -0.000061, you have to start with the signed integer -1 in the first two bits (11) and then add the positive value 16383/16384 in the last 14 bits (11111111111111), such that mantissa + fraction = -1 + 16383/16384 = -1/16384, which in turn is encoded using the 16 bit code 0xFFFF

Resources