What is the maximum value of the exponent of single precision floating point on MSVC? - visual-c++

I've been trying to figure out how biased exponents work.
8 bits are reserved for the exponent itsef, so it's either -127 to 127 or 0 to 255. When i want to store a number(exponent part) that doesn't fit into 8 bits, where it obtains additional bits to store that data from?
In case you are going to say it uses bias as an offset, then provide an additional info on where exactly the data is stored.

To a first approximation, the value of a float with exponent e and significand f is 1.f x 2^e. There are special cases to consider regarding subnormals, infinities, NaN etc; but you can ignore those for starters. Essentially, the exponent is really the exponent of the base-2 in IEEE754 notation. So, the comment you made regarding how 30020.3f fits in 8-bits is simple: Easily. You only need an exponent of 14 to represent that, which is nicely representable in the biased exponent with 8-bits; capable of covering it nicely.
In fact, here's the exact binary binary representation of 30020.3 as a single-precision IEEE-754 float:
3 2 1 0
1 09876543 21098765432109876543210
S ---E8--- ----------F23----------
Binary: 0 10001101 11010101000100010011010
Hex: 46EA 889A
Precision: SP
Sign: Positive
Exponent: 14 (Stored: 141, Bias: 127)
Hex-float: +0x1.d51134p14
Value: +30020.3 (NORMAL)
As you can see we just store 14 in the exponent. The sign is 0, and the fraction covers the rest so that 1.f * 2^14 gives you the correct value.

What is the maximum value of the exponent of single precision floating point on MSVC?
Maximum value of the binary exponent: 254-BIAS --> 127
For a decimal perspective, <float.h> FLT_MAX_10_EXP as the "maximum integer such that 10 raised to that power is in the range of representable finite floating-point numbers," 38
printf("%d %g\n", FLT_MAX_10_EXP, FLT_MAX);
// common result
38 3.40282e+38
8 bits are reserved for the exponent itself, so it's either -127 to 127 or 0 to 255.
Pretty close: for finite values, the raw exponent is more like [0...254] with the value of 0 having special meaning: as if raw exponent of 1 and 0. implied digit.
The exponent is then raw exponent - bias of 127 or [-126 to 127].
Recall this is an exponent for 2 raised to some power.
Using binary32, the maximum value of a biased exponent for finite values is 254.
Applying the bias of -127, the maximum value of the exponent is 254-127 or 127 in the form of:
biased_exponent > 0
-1neg_sign * 1.(23 bit significant fraction) * 2biased_exponent - 127
And for completes of subnormals and zero:
biased_exponent == 0
-1neg_sign * 0.(23 bit significant fraction) * 21 - 127
How 30k fits into 8 bits? 30020 is for exponent and .3 for fraction.
Mathematically 30020.3f has a 30020 whole number portion and a fraction. 30030 is not for the exponent and .3 for the fraction used elsewhere. All the value contribute to the exponent and the significand. floats are typically encoded as a binary 1.xxxxx2 * 2exponent
printf("%a\n", 30020.3f); // 0x1.d51134p+14
+1.110101010001000100110102 * 214
Encoding with binary32
sign + or 0,
biased exponent (14 + 127 = 141 or 100010112)
fraction of significand 11010101000100010011010
0 10001011 110101010001000100110102
01000101 11101010 10001000 100110102
45 EA 88 9A16

Related

Why does EXIF geodata need so much precision?

According to spec, EXIF stores latitude and longitude with 192 precision each. But a simple calculation shows that you only need 32 bits to divide the circumference of Earth into segments of 9 mm:
r = 6378 km = 6.378 × 10^6 m
C = 2πr = 4.007 × 10^6 m
stepSize = C / 2^32 = 0.009 m = 9 mm
That's assuming you store the data in steps of equal size, so as an unsigned int. I can understand that would make handling code harder to write, so what the hell: let's use a double. At this precision, we can divide the Earth's circumference into steps of 2 picometers. A Helium atom has a diameter of 62 picometers. So at 64 bits, we have enough precision to divide the Earth's surface at subatomic scales.
What on Earth do we need 192 bits per angle?
The format stores latitude and longitude each as 6 32-bit integer values, which adds up to 192 bits. The 6 integers store each of degrees, minutes and seconds as rational numbers with a numerator and denominator.
Why this format? Presumably it's designed for very simple processors that can't handle floating point, and might not even be able to do division. The format is more than 25 years old (though I'm not sure when GPS data was added), and cameras weren't as smart back then. Cameras needed to be able to store lots of data (pictures are big), but they didn't need to do a lot of mathematical operations on it. So they wasted some bits to make manipulation easier.

Range of Values Represented by 1's Complement with 7 bits

Assume that there are 7-bits that are available to store a binary number. Specify the range of numbers that can be represented by 1's Complement. I found that the range for 2's Complement is -64 ≤ 0 ≤ 63. How do I do this for 1's Complement?
In 2s complement method of representation of binary numbers, for signed numbers, the range of numbers that can be represented is -2^(N-1) - (2^(N-1)-1) for an N-bit number.
That is why you have obtained the range -64 - 63 for a 7 bit binary number.
Now in the case of 1s Complement method of representation, the range of numbers that can be represented is -(2^(N-1)-1) - (2^(N-1)-1).
And this will result in a range of -63 - 63 for a 7 bit number in 1s complement representation.

Normalized values, when summed are more than 1

I have two files:
File 1:
TOPIC:topic_0 1294
aa 234
bb 123
TOPIC:topic_1 2348
aa 833
cc 239
bb 233
File 2:
0.1 0.2 0.3 0.4
This is just the format of my files. Basically, when the second column (omitting the first "TOPIC" line) is summed for each topic, it constitutes to 1 as they are the normalized values. Similarly, in file 2, the values are normalized and hence they also constitute to 1.
I perform multiplication of the values from file 1 and 2. The resulting output file looks like:
aa 231
bb 379
cc 773
The second column when summed of the output file should give 1. But few files have values little over 1 like 1.1, 1.00038. How can I precisely get 1 for the output file? Is it some rounding off that I should do or something?
PS: The formats are just examples, the values and words are different. This is just for understanding purposes. Please help me sort this.
Python stores floating point decimals in base-2.
https://docs.python.org/2/tutorial/floatingpoint.html
This means that some decimals could be terminating in base-10, but are repeating in base-2, hence the floating-point error when you add them up.
This gets into some math, but imagine in base-10 trying to express the value 2/6. When you eliminate the common factors from the numerator and denominator it's 1/3.
It's 0.333333333..... repeating forever. I'll explain why in a moment, but for now, understand that if only store the first 16 digits in the decimal, for example, when you multiply the number by 3, you won't get 1, you'll get .9999999999999999, which is a little off.
This rounding error occurs whenever there's a repeating decimal.
Here's why your numbers don't repeat in base-10, but they do repeat in base-2.
Decimals are in base-10, which prime factors out to 2^1 * 5^1. Therefore for any ratio to terminate in base-10, its denominator must prime factor to a combination of 2's and 5's, and nothing else.
Now let's get back to Python. Every decimal is stored as binary. This means that in order for a ratio's "decimal" to terminate, the denominator must prime factor to only 2's and nothing else.
Your numbers repeat in base-2.
1/10 has (2*5) in the denominator.
2/10 reduces to 1/5 which still has five in the denominator.
3/10... well you get the idea.

How much have xxx precision binary fixed point representation?

I am trying to measure how much have accuracy when I convert to binary fixed point representation way.
first I tried use this 0.9375. And I got the binary 0.1111.
second I tried used this 0.9377 and I also got the binary 0.1111
There is nothing different between them.
Also how can I solve this problem?
Is there any other way? To make converting ?
For your understand, I let me know some more example,
For example, If I want to convert 3.575 to binary then 3.575 be 11.1001.
but If I back to again to decimal then 3.5625. It so quite different on original value.
From a similar question we have:
Base 2: Twos complement 4 integer, 4 bit fractional
-2^3 2^2 2^1 2^0 . 2^-1 2^-2 2^-3 2^-4
-8 4 2 1 . 0.5 0.25 0.125 0.0625
With only 4 fractional bits the represented number only has an accuracy of 0.0625
3.575 could be 11.1001 = 2+ 1+ 0.5 + 0.0625 => 3.5625 to low
or 11.1010 = 2+ 1+ 0.5 + 0.125 => 3.625 to high
This should indicate that 4 bits is just not enough to represent "3.575" exactly.
To figure out how many bit you would need multiply by a power of 2 until you get an integer: For "3.575" it is rather a lot (50 fractional bits).
3.575 * 2^2 = 14.3 (not integer)
3.575 * 2^20 = 3748659.2
3.575 * 2^30 = 3838627020.8
3.575 * 2^40 = 3930754069299.2 (not integer)
3.575 * 2^50 = 4025092166962381.0 (INTEGER) we need 50 bits!
3.575 => 11.10010011001100110011001100110011001100110011001101
Multiplying by a power of two shift the word to the left (<<) When there is no fractional bits left it means the number is fully represented, then number of shifts is the number of fractional bits required.
For fixed point you are better off thinking about the level of precision your application requires.

Verilog - Floating points multiplication

We have a problem with Verilog.
We have to use multiplication with two floating points(binary), but it doesn't work 100% perfectly.
We have a Req m[31:0]. The first numbers (before the comma) are m[31:16] and the numbers after comma m[15:0] so we have like:
m[31:16] = 1000000000000000;
m[15:0] = 1000000000000000;
m[31:0] = 10000000000000000(.)1000000000000000;
The Problem is: we want to multiplicate numbers with decimal places, but we don't know how.
For example: m = 2.5 in binary. The result of m*m is 6.25.
The question does not fully cover what is understood about fixed-point numbers, therefore will cover a little background which might not be relevant to the OP.
The decimal weighting of unsigned binary (base 2) numbers, 4bit for the example follows this rule:
2^3 2^2 2^1 2^0 (Base 2)
8 4 2 1
Just for reference the powers stay the same and the base is changed. For 4 hex it would be:
16^3 16^2 16^1 16^0
4096 256 16 1
Back to base 2, for twos complement signed number the MSB (Most Significant Bit) becomes negative.
-2^3 2^2 2^1 2^0 (Base 2, Twos complement)
-8 4 2 1
When we insert a binary point or fractional bit the pattern continues. 4 Integer bits 4 fractional bits.
Base 2: Twos complement 4 integer, 4 bit frational
-2^3 2^2 2^1 2^0 . 2^-1 2^-2 2^-3 2^-4
-8 4 2 1 . 0.5 0.25 0.125 0.0625
Unfortunately Verilog does not have a fixed-point format so the user has to keep track of the binary point and worked with scaled numbers. Decimal points . can not be used in in verilog numbers stored as reg or logic as they are essentially integer formats. However verilog does ignore _ when placed in number declarations, so it can be used as the binary point in numbers. Its use is only symbolic and has no meaning to the language.
In the above format 2.5 would be represented by 8'b0010_1000, the question has 16 fractional bits therefore you need to place 16 bits after the _ to keep the binary point in the correct place.
Fixed-point Multiplication bit widths
If we have two numbers A and B the width of the result A*B will be:
Integer bits = A.integer_bits + B.integer_bits.
Fractional bits = A.fractional_bits + B.fractional_bits.
Therefore [4 Int, 4 Frac] * [4 Int, 4 Frac] => [8 Int, 8 Frac]
reg [7:0] a = 0010_1000;
reg [7:0] b = 0010_1000;
reg [15:0] sum;
always #* begin
sum = a * b ;
$displayb(sum); //Binary
$display(sum); //Decimal
end
// sum == 00000110_01000000; //Decimal->6.25
Example on EDA Playground.
From this you should be able to change the depths to suit any type of fixed point number. and cast ing back to a 16 Int 16 fractional number can be done by part selecting the correct bits. Be careful if you need to saturate instead of overflowing.
There is a related Q&A that has 22 fractional bits.

Resources