We have a problem with Verilog.
We have to use multiplication with two floating points(binary), but it doesn't work 100% perfectly.
We have a Req m[31:0]. The first numbers (before the comma) are m[31:16] and the numbers after comma m[15:0] so we have like:
m[31:16] = 1000000000000000;
m[15:0] = 1000000000000000;
m[31:0] = 10000000000000000(.)1000000000000000;
The Problem is: we want to multiplicate numbers with decimal places, but we don't know how.
For example: m = 2.5 in binary. The result of m*m is 6.25.
The question does not fully cover what is understood about fixed-point numbers, therefore will cover a little background which might not be relevant to the OP.
The decimal weighting of unsigned binary (base 2) numbers, 4bit for the example follows this rule:
2^3 2^2 2^1 2^0 (Base 2)
8 4 2 1
Just for reference the powers stay the same and the base is changed. For 4 hex it would be:
16^3 16^2 16^1 16^0
4096 256 16 1
Back to base 2, for twos complement signed number the MSB (Most Significant Bit) becomes negative.
-2^3 2^2 2^1 2^0 (Base 2, Twos complement)
-8 4 2 1
When we insert a binary point or fractional bit the pattern continues. 4 Integer bits 4 fractional bits.
Base 2: Twos complement 4 integer, 4 bit frational
-2^3 2^2 2^1 2^0 . 2^-1 2^-2 2^-3 2^-4
-8 4 2 1 . 0.5 0.25 0.125 0.0625
Unfortunately Verilog does not have a fixed-point format so the user has to keep track of the binary point and worked with scaled numbers. Decimal points . can not be used in in verilog numbers stored as reg or logic as they are essentially integer formats. However verilog does ignore _ when placed in number declarations, so it can be used as the binary point in numbers. Its use is only symbolic and has no meaning to the language.
In the above format 2.5 would be represented by 8'b0010_1000, the question has 16 fractional bits therefore you need to place 16 bits after the _ to keep the binary point in the correct place.
Fixed-point Multiplication bit widths
If we have two numbers A and B the width of the result A*B will be:
Integer bits = A.integer_bits + B.integer_bits.
Fractional bits = A.fractional_bits + B.fractional_bits.
Therefore [4 Int, 4 Frac] * [4 Int, 4 Frac] => [8 Int, 8 Frac]
reg [7:0] a = 0010_1000;
reg [7:0] b = 0010_1000;
reg [15:0] sum;
always #* begin
sum = a * b ;
$displayb(sum); //Binary
$display(sum); //Decimal
end
// sum == 00000110_01000000; //Decimal->6.25
Example on EDA Playground.
From this you should be able to change the depths to suit any type of fixed point number. and cast ing back to a 16 Int 16 fractional number can be done by part selecting the correct bits. Be careful if you need to saturate instead of overflowing.
There is a related Q&A that has 22 fractional bits.
Related
According to arithmetics of binary numbers, the sum would be 2 a decimal 2. This is 10 in binary. so the sum would be 0 and the carry 1. But the carry is not represented in, right?
I've been trying to figure out how biased exponents work.
8 bits are reserved for the exponent itsef, so it's either -127 to 127 or 0 to 255. When i want to store a number(exponent part) that doesn't fit into 8 bits, where it obtains additional bits to store that data from?
In case you are going to say it uses bias as an offset, then provide an additional info on where exactly the data is stored.
To a first approximation, the value of a float with exponent e and significand f is 1.f x 2^e. There are special cases to consider regarding subnormals, infinities, NaN etc; but you can ignore those for starters. Essentially, the exponent is really the exponent of the base-2 in IEEE754 notation. So, the comment you made regarding how 30020.3f fits in 8-bits is simple: Easily. You only need an exponent of 14 to represent that, which is nicely representable in the biased exponent with 8-bits; capable of covering it nicely.
In fact, here's the exact binary binary representation of 30020.3 as a single-precision IEEE-754 float:
3 2 1 0
1 09876543 21098765432109876543210
S ---E8--- ----------F23----------
Binary: 0 10001101 11010101000100010011010
Hex: 46EA 889A
Precision: SP
Sign: Positive
Exponent: 14 (Stored: 141, Bias: 127)
Hex-float: +0x1.d51134p14
Value: +30020.3 (NORMAL)
As you can see we just store 14 in the exponent. The sign is 0, and the fraction covers the rest so that 1.f * 2^14 gives you the correct value.
What is the maximum value of the exponent of single precision floating point on MSVC?
Maximum value of the binary exponent: 254-BIAS --> 127
For a decimal perspective, <float.h> FLT_MAX_10_EXP as the "maximum integer such that 10 raised to that power is in the range of representable finite floating-point numbers," 38
printf("%d %g\n", FLT_MAX_10_EXP, FLT_MAX);
// common result
38 3.40282e+38
8 bits are reserved for the exponent itself, so it's either -127 to 127 or 0 to 255.
Pretty close: for finite values, the raw exponent is more like [0...254] with the value of 0 having special meaning: as if raw exponent of 1 and 0. implied digit.
The exponent is then raw exponent - bias of 127 or [-126 to 127].
Recall this is an exponent for 2 raised to some power.
Using binary32, the maximum value of a biased exponent for finite values is 254.
Applying the bias of -127, the maximum value of the exponent is 254-127 or 127 in the form of:
biased_exponent > 0
-1neg_sign * 1.(23 bit significant fraction) * 2biased_exponent - 127
And for completes of subnormals and zero:
biased_exponent == 0
-1neg_sign * 0.(23 bit significant fraction) * 21 - 127
How 30k fits into 8 bits? 30020 is for exponent and .3 for fraction.
Mathematically 30020.3f has a 30020 whole number portion and a fraction. 30030 is not for the exponent and .3 for the fraction used elsewhere. All the value contribute to the exponent and the significand. floats are typically encoded as a binary 1.xxxxx2 * 2exponent
printf("%a\n", 30020.3f); // 0x1.d51134p+14
+1.110101010001000100110102 * 214
Encoding with binary32
sign + or 0,
biased exponent (14 + 127 = 141 or 100010112)
fraction of significand 11010101000100010011010
0 10001011 110101010001000100110102
01000101 11101010 10001000 100110102
45 EA 88 9A16
Is it possible to represent an RGBA color to a single value that resembles the retinal stimulation? The idea is something like:
0.0 value for black (no stimulation)
1.0 for white (full stimulation)
The RGBA colors in between should be represented by values that capture the amount of stimulation they cause to the eye like:
a very light yellow should have a very high value
a very dark brown should have a low value
Any ideas on this? Is converting to grayscale the only solution?
Thanks in advance!
Assign specific bits of a single number to each part of RGBA to represent your number.
If each part is 8 bits, the first 8 bits can be assigned to R, the second 8 bits to G, the third 8 bits to B, and the final 8 bits to A.
Let's say your RGBA values are= 15,4,2,1. And each one is given 4 bits.
In binary, R is 1111, G is 0100, B is 0010, A is 0001.
In a simple concatenation, your final number would be 1111010000100001 in binary, which is 62497. To get G out of this, 62497 / 256, round it to an integer, then modulo 16. 256 is 16 to the second power because it is the 2nd position past the first from the right(R would need third power, B would need first power). 16 is 2 to the fourth power because I used 4 bits.
62497 / 256 = 244, 244 % 16 = 4.
We can round off a number say 23 or 74 to 20 and 70 by seeing the numbers lsb(right most bit) and 26 or 78 to 30 and 80.. My doubt is if this is possible in verilog codes... I want to know after converting into digital will this concept be possible..
In case you just want to make the LSB of any register zero you can always do something like this
reg [7:0] a;
reg [7:0] b = 23;
a = {b[7:1], 1'b0};
In general to round off integers of base n you add n/2 and the divide by n (discarding the remainder) and then multiply by n. So your examples above:
(23+5)/10 * 10 = 20
(28+5)/10 * 10 = 30
Doing this with binary logic is a bit expensive since you need to divide and multiply. However, if you are working with base 2 numbers then those operations are free since it is just a bit shift.
For an example in binary, let's say you want to round 61 to the nearest multiple of 8, which would be 64. In binary, 61 is 6'b011101. Add 8/2 (6'b000100) to this and you get 6'b100001 which is 65. Now divide by 8 and multiply by 8. Since that is just shift right by 3 and then shift left by three we can simply zero out the 3 lsbs to get 6'b100000 which is 64.
I am trying to measure how much have accuracy when I convert to binary fixed point representation way.
first I tried use this 0.9375. And I got the binary 0.1111.
second I tried used this 0.9377 and I also got the binary 0.1111
There is nothing different between them.
Also how can I solve this problem?
Is there any other way? To make converting ?
For your understand, I let me know some more example,
For example, If I want to convert 3.575 to binary then 3.575 be 11.1001.
but If I back to again to decimal then 3.5625. It so quite different on original value.
From a similar question we have:
Base 2: Twos complement 4 integer, 4 bit fractional
-2^3 2^2 2^1 2^0 . 2^-1 2^-2 2^-3 2^-4
-8 4 2 1 . 0.5 0.25 0.125 0.0625
With only 4 fractional bits the represented number only has an accuracy of 0.0625
3.575 could be 11.1001 = 2+ 1+ 0.5 + 0.0625 => 3.5625 to low
or 11.1010 = 2+ 1+ 0.5 + 0.125 => 3.625 to high
This should indicate that 4 bits is just not enough to represent "3.575" exactly.
To figure out how many bit you would need multiply by a power of 2 until you get an integer: For "3.575" it is rather a lot (50 fractional bits).
3.575 * 2^2 = 14.3 (not integer)
3.575 * 2^20 = 3748659.2
3.575 * 2^30 = 3838627020.8
3.575 * 2^40 = 3930754069299.2 (not integer)
3.575 * 2^50 = 4025092166962381.0 (INTEGER) we need 50 bits!
3.575 => 11.10010011001100110011001100110011001100110011001101
Multiplying by a power of two shift the word to the left (<<) When there is no fractional bits left it means the number is fully represented, then number of shifts is the number of fractional bits required.
For fixed point you are better off thinking about the level of precision your application requires.