IEEE 754 Rounding to Positive Infinity - rounding

I'm having a difficult time understanding IEEE 754 Rounding conventions:
Round to positive Infinity
Round to negative Infinity
Unbiased to the nearest even
If I have a binary number composed of 9 bits to the right of a binary point and I need to use the 3 right most bits to determine rounding what would I do?
This is homework so that's why I'm being vague about the question...I need help with the concept.
Thank you!

Round towards positive infinity means the result of the rounding is never smaller than the argument.
Round towards negative infinity means the result of the rounding is never larger than the argument.
Round to nearest, ties to even means the result of rounding is sometimes larger, sometimes smaller than (and sometimes equal to) the argument.
Rounding the value +0.100101110 to six places after the binary point would result in
+0.100110 // for round towards positive infinity
+0.100101 // for round towards negative infinity
+0.100110 // for round to nearest, ties to even
The value is split
+0.100101 110
into the bits to be kept and the bits determining the result of the rounding.
Since the value is positive and the determining bits are not all 0, rounding towards positive infinity means incrementing the kept part by 1 ULP.
Since the value is positive, rounding towards negative infinity simply discards the last bits.
Since the first cut off bit is 1 and not all further bits are 0, the value +0.100110 is closer to the original than +0.100101, so the result is +0.100110.
More instructive for the nearest/even case would be an example or two where we actually have a tie, e.g. round +0.1001 to three bits after the binary point:
+0.100 1 // halfway between +0.100 and +0.101
here, the rule says to pick the one with last bit 0 (last bit even) of the two closest values, i.e. +0.100 and this value is rounded towards negative infinity. But rounding +0.1011 would round towards positive infinity, because this time the larger of the two closest values has last bit 0.

Related

Does Rust f64/f32 round correctly?

After a recent discussion about mathematical rounding with my colleagues, we analyzed the rounding of different languages.
For example:
MATLAB: "Round half away from zero"
Python: "Round half to even"
I would not say the one is correct or the other isn't but whats bothers me is the combination of the writing in the Rust Book and what the documentation say about the round function.
The book 1:
Floating-point numbers are represented according to the IEEE-754 standard. The f32 type is a single-precision float, and f64 has double precision.
The documentation 2:
Returns the nearest integer to self. Round half-way cases away from 0.0.
My concern is that the standard rounding for IEEE-754 is "Round half to even".
Most collegues who I ask tend to use and learned mostly/only "Round half away from zero" and they where actually confused as I came up with different rounding strategies. Does the developer of rust decided because of that possible confusion against the IEEE standard?
The documentation you cite is for an explicit function, round.
IEEE-754 specifies the default rounding method for floating-point operations should be round-to-nearest, ties-to-even (with some embellishment for an unusual case). The rounding method specifies how to adjust (conceptually) the mathematical result of the function or operation to a number representable in the floating-point format. It does not apply to what functions functions calculate.
Functions like round, floor, and trunc exist to calculate a specific integer from the argument. The mathematical calculation they perform is to determine that integer. A rounding rule only applies in determining what floating-point result to return when the ideal mathematical result is not representable in the floating-point type.
E.g., sin(x) is defined to return a result computed as if:
The sine of x were determined exactly, with “infinite” precision.
That sine were then rounded to a number representable in the floating-point format according to the rounding method.
Similarly, round(x) can be thought of to be defined to return a result computed as if:
The nearest integer of x, rounding a half-way case away from zero, were determined exactly, with “infinite” precision.
That nearest integer were then rounded to a number representable in the floating-point format according to the rounding method.
However, because of the nature of the routine, that second step is never necessary: The nearest integer is always representable, so rounding never changes it. (Except, you could have abnormal floating-point formats with limited exponent range so that rounding up did yield an unrepresentable integer. For example, in a format with four-bit significands but an exponent range that limited numbers to less than 4, rounding 3.75 to the nearest integer would yield 4, but that is not representable, so +∞ would have to be returned. I cannot say I have ever seen this case explicitly addressed in a specification of the round function.)
Nobody has contradicted IEEE-754, which defines five different valid rounding methods.
The two methods relevant to this question are referred to as nearest roundings.
Round to nearest, ties to even – rounds to the nearest value; if the number falls midway, it is rounded to the nearest value with an even least significant digit.
Round to nearest, ties away from zero (or ties to away) – rounds to the nearest value; if the number falls midway, it is rounded to the nearest value above (for positive numbers) or below (for negative numbers).
Python takes the first approach and Rust takes the second. Neither is contradicting the IEEE-754 standard, which defines and allows for both.
The other three are things we would probably more colloquially refer to as truncation, i.e. always rounding down, or always rounding up, or always rounding toward zero.

How is a packed ten bit value read and interpreted?

I would ideally like to store each component of normals and tangents in 10 bits each, and a format supported by graphics APIs is A2R10G10B10 format. However I don't understand how it works. I've seen questions such as this which show how the bits are laid out. I understand how the bits are laid out, but what I don't get is how the value is fetched and interpreted when unpacked by the graphics API (Vulkan/OpenGL).
I want to store each component in 10 bits, and read it from the shader as signed normalised (-1.f to 1.f), so I'm looking at VK_FORMAT_A2B10G10R10_SNORM_PACK32 in Vulkan. Is one of the bits of the 10 bits used to store the sign for the value? How does it know if it's a negative or positive value? For a 8, 16, 32 etc bit number the first bit represents its signedness. How does this work for a 10-bit number? Do I have to manually use two's complement to form the negative version of the value using the ten bits?
Sorry if this is a dumb question, I can't understand how this works.
Normalized integer conversions are based on the bit-width of a number. An X-bit unsigned, normalized integer maps from the range [0, 2X-1] to the floating-point range [0.0, 1.0]. This is true for any X bit-width.
Signed, normalized conversion just uses two's complement signed integers and yields a [-1.0, 1.0] floating-point range. The only oddity is the input range. The two's complement encodes from [-2X-1, 2X-1-1]; this is an uneven range, with slightly more negative storage than positive. A direct conversion would make an integer value of 0 a slightly positive floating point value.
Therefore, the system converts from the range [-2X-1-1, 2X-1-1] to [-1.0, 1.0]. The lowest input value of -2X-1 is also given the output value -1.0.

Limiting decimal precision in calculations

I currently have a very large dataset in SPSS where some variables have up to 8 decimal places. I know how to change the options so that SPSS only displays variables to 2 decimal places in the data view and output. However, SPSS still applies the 8 decimal precision in all calculations. Is there a way to change the precision to 2 decimal places without having to manually change every case?
You'll have to round each of the variables. You can loop through them like this:
do repeat vr=var1 var2 var3 var4 var5.
compute vr=rnd(vr, 0.01).
end repeat.
If the variables are consecutive in the file you can use to like this:
do repeat vr=var1 to var5.
....
SPSS Statistics treats all numeric variables as double-precision floating point numbers, regardless of the display formats. Any computations (other than those in one older procedure, ALSCAL) are done in double precision, and you don't have any way to avoid that.
The solution that you've applied will not actually make any calculations use only two decimals of precision. You'll start from that point with numbers rounded to approximately what they would be to two decimals, but most of them probably aren't exactly expressable as double precision floating point numbers, so what you're actually using isn't what you're seeing.

How to detect if a result is due to an approximation in floating division?

From a data set I'm working on I have produced the following graph:
The graph it is constructed as follows: to each element on the data set it is associated the ratio between two natural numbers (some times big numbers) where the numerator is lesser then the denominator. Let be this number k. Then for each value n in [0,1] it is counted how many elements have k>n.
So, while the exponential decay is expected, the jumps come out of the blue. To calculate the ratio between a and b I have just done:c=a/b
I'm asking if there is a way to check if the jumps are due to numerical approximation in the division or they are an actual property of my dataset.

Working out decimal in Sign and Magnitude

I know how to work out -10 in Sign and Magnitude in a 9 bits memory location:
I first calculated the decimal in positive (+10) and then add a sign(1) in front that indicates negative, which gives me the answer: 100001010
But when it comes to the decimal -256
I first calculated the positive number(256), which gives 100000000, but how can I turn this to negative? because there are no more spaces for me to add a sign in front, and the sum of the first 8 bits is only 255, which cannot give me the answer of 256.
I was just wondering is this even possible?
It is not possible to represent -256 with one sign bit and eight magnitude bits.
If you have eight bits to represent the number, then the largest number you can have is 11111111(2), i.e. 255(10) - where the number in parentheses represents the number base. If you have eight bits and a sign bit, then 100000000(2) represents -0 and 111111111(2) represents -255(10).
Given nine bits, you could use two's complement to represent numbers from -256 to 255.

Resources