I know how to work out -10 in Sign and Magnitude in a 9 bits memory location:
I first calculated the decimal in positive (+10) and then add a sign(1) in front that indicates negative, which gives me the answer: 100001010
But when it comes to the decimal -256
I first calculated the positive number(256), which gives 100000000, but how can I turn this to negative? because there are no more spaces for me to add a sign in front, and the sum of the first 8 bits is only 255, which cannot give me the answer of 256.
I was just wondering is this even possible?
It is not possible to represent -256 with one sign bit and eight magnitude bits.
If you have eight bits to represent the number, then the largest number you can have is 11111111(2), i.e. 255(10) - where the number in parentheses represents the number base. If you have eight bits and a sign bit, then 100000000(2) represents -0 and 111111111(2) represents -255(10).
Given nine bits, you could use two's complement to represent numbers from -256 to 255.
Related
After a recent discussion about mathematical rounding with my colleagues, we analyzed the rounding of different languages.
For example:
MATLAB: "Round half away from zero"
Python: "Round half to even"
I would not say the one is correct or the other isn't but whats bothers me is the combination of the writing in the Rust Book and what the documentation say about the round function.
The book 1:
Floating-point numbers are represented according to the IEEE-754 standard. The f32 type is a single-precision float, and f64 has double precision.
The documentation 2:
Returns the nearest integer to self. Round half-way cases away from 0.0.
My concern is that the standard rounding for IEEE-754 is "Round half to even".
Most collegues who I ask tend to use and learned mostly/only "Round half away from zero" and they where actually confused as I came up with different rounding strategies. Does the developer of rust decided because of that possible confusion against the IEEE standard?
The documentation you cite is for an explicit function, round.
IEEE-754 specifies the default rounding method for floating-point operations should be round-to-nearest, ties-to-even (with some embellishment for an unusual case). The rounding method specifies how to adjust (conceptually) the mathematical result of the function or operation to a number representable in the floating-point format. It does not apply to what functions functions calculate.
Functions like round, floor, and trunc exist to calculate a specific integer from the argument. The mathematical calculation they perform is to determine that integer. A rounding rule only applies in determining what floating-point result to return when the ideal mathematical result is not representable in the floating-point type.
E.g., sin(x) is defined to return a result computed as if:
The sine of x were determined exactly, with “infinite” precision.
That sine were then rounded to a number representable in the floating-point format according to the rounding method.
Similarly, round(x) can be thought of to be defined to return a result computed as if:
The nearest integer of x, rounding a half-way case away from zero, were determined exactly, with “infinite” precision.
That nearest integer were then rounded to a number representable in the floating-point format according to the rounding method.
However, because of the nature of the routine, that second step is never necessary: The nearest integer is always representable, so rounding never changes it. (Except, you could have abnormal floating-point formats with limited exponent range so that rounding up did yield an unrepresentable integer. For example, in a format with four-bit significands but an exponent range that limited numbers to less than 4, rounding 3.75 to the nearest integer would yield 4, but that is not representable, so +∞ would have to be returned. I cannot say I have ever seen this case explicitly addressed in a specification of the round function.)
Nobody has contradicted IEEE-754, which defines five different valid rounding methods.
The two methods relevant to this question are referred to as nearest roundings.
Round to nearest, ties to even – rounds to the nearest value; if the number falls midway, it is rounded to the nearest value with an even least significant digit.
Round to nearest, ties away from zero (or ties to away) – rounds to the nearest value; if the number falls midway, it is rounded to the nearest value above (for positive numbers) or below (for negative numbers).
Python takes the first approach and Rust takes the second. Neither is contradicting the IEEE-754 standard, which defines and allows for both.
The other three are things we would probably more colloquially refer to as truncation, i.e. always rounding down, or always rounding up, or always rounding toward zero.
Each RGB value is represented as an 8-bit integer (0-255). Why not store it as a decimal number to increase the color space? It should give more realistic looking picture.
Colours are sometimes represented by three floats instead of as 24 bits.
The 8 bit standard is historical: It goes back to the days of 8 bit architectures. 3 bytes can give the colour of a pixel without wasting any memory and having the same number of bits for each colour component.
This has some advantages: you can write the colour as a 6 digit hexadecimal number and have some idea of what the colour will be:
0xff0000 : Red
0x00ff00 : Green
0x0000ff : Blue
And so on. This is quite compact and efficient and has stuck around, as a colour can be held in a single integer value instead of three floats.
I would ideally like to store each component of normals and tangents in 10 bits each, and a format supported by graphics APIs is A2R10G10B10 format. However I don't understand how it works. I've seen questions such as this which show how the bits are laid out. I understand how the bits are laid out, but what I don't get is how the value is fetched and interpreted when unpacked by the graphics API (Vulkan/OpenGL).
I want to store each component in 10 bits, and read it from the shader as signed normalised (-1.f to 1.f), so I'm looking at VK_FORMAT_A2B10G10R10_SNORM_PACK32 in Vulkan. Is one of the bits of the 10 bits used to store the sign for the value? How does it know if it's a negative or positive value? For a 8, 16, 32 etc bit number the first bit represents its signedness. How does this work for a 10-bit number? Do I have to manually use two's complement to form the negative version of the value using the ten bits?
Sorry if this is a dumb question, I can't understand how this works.
Normalized integer conversions are based on the bit-width of a number. An X-bit unsigned, normalized integer maps from the range [0, 2X-1] to the floating-point range [0.0, 1.0]. This is true for any X bit-width.
Signed, normalized conversion just uses two's complement signed integers and yields a [-1.0, 1.0] floating-point range. The only oddity is the input range. The two's complement encodes from [-2X-1, 2X-1-1]; this is an uneven range, with slightly more negative storage than positive. A direct conversion would make an integer value of 0 a slightly positive floating point value.
Therefore, the system converts from the range [-2X-1-1, 2X-1-1] to [-1.0, 1.0]. The lowest input value of -2X-1 is also given the output value -1.0.
This question already has answers here:
How does hexadecimal color work?
(6 answers)
Closed 8 years ago.
I understand the hexadecimal system is built on 0123456789ABCDEF representing 16 degrees. 0 being the darkest up to F being a pure form of that color. But why are there 2 digits representing each color (red green blue)? And how those two digits work together to form each colors value.
It's because the colors are represented as R-G-B, each primary color have a value between 0 and 255, which makes 256 possibility. Hexadecimal is a way to write numbers, just like binary or decimal, and hexadecimal requires 2 digits (FF, to be precise) to represent 256.
00 to FF represents, in decimal 0-255. 256 values, which is also the number of unique values you can represent in a single byte.
In programming, colors typically consist of 4 bytes, each with a 00-FF hexadecimal value. There's a red byte, green byte, blue byte, and there's a byte to represent the alpha channel.
Sometimes, however, rather than RGB, the three non-alpha bytes are representative of Hue, Saturation, and Brightness. The fourth one still is for the alpha channel though.
I'm having a difficult time understanding IEEE 754 Rounding conventions:
Round to positive Infinity
Round to negative Infinity
Unbiased to the nearest even
If I have a binary number composed of 9 bits to the right of a binary point and I need to use the 3 right most bits to determine rounding what would I do?
This is homework so that's why I'm being vague about the question...I need help with the concept.
Thank you!
Round towards positive infinity means the result of the rounding is never smaller than the argument.
Round towards negative infinity means the result of the rounding is never larger than the argument.
Round to nearest, ties to even means the result of rounding is sometimes larger, sometimes smaller than (and sometimes equal to) the argument.
Rounding the value +0.100101110 to six places after the binary point would result in
+0.100110 // for round towards positive infinity
+0.100101 // for round towards negative infinity
+0.100110 // for round to nearest, ties to even
The value is split
+0.100101 110
into the bits to be kept and the bits determining the result of the rounding.
Since the value is positive and the determining bits are not all 0, rounding towards positive infinity means incrementing the kept part by 1 ULP.
Since the value is positive, rounding towards negative infinity simply discards the last bits.
Since the first cut off bit is 1 and not all further bits are 0, the value +0.100110 is closer to the original than +0.100101, so the result is +0.100110.
More instructive for the nearest/even case would be an example or two where we actually have a tie, e.g. round +0.1001 to three bits after the binary point:
+0.100 1 // halfway between +0.100 and +0.101
here, the rule says to pick the one with last bit 0 (last bit even) of the two closest values, i.e. +0.100 and this value is rounded towards negative infinity. But rounding +0.1011 would round towards positive infinity, because this time the larger of the two closest values has last bit 0.