Fixed point to Floating point - point

I have the following code, I have just copied some data from external RAM to the MCU into a buffer called "data"
double p32 = 4.294967296e+009; /// equals to 2^32 in decimal notation
int32_t longhigh;
uint32_t longlow;
offset = mapdata(); //Points to the data I want, 55 bit fixed point on HW
longhigh = data[2*offset+1]; //Gets upperpart of data
longlow = data[2*offset]; //Gets lower part
double floating = (longhigh*p32 + longlow); // What is this doing? How does it work?
Can someone explain that last line of code for me? Why are we multiplying by p32? Thanks.

Multiplying by p32 is equivalent to a left shift by 32 bits. It also results in a type conversion for the product (from int to double), as well as for the sum. This way you can essentially keep 64-bit ints in the buffer and convert them to doubles when required.

Related

Float to Int type conversion in Python for large integers/numbers

Need some help on the below piece of code that I am working on. Why original number in "a" is different from "c" when it goes through a type conversion. Any way we can make "a" and "c" same when it goes through float -> int type conversion?
a = '46700000000987654321'
b = float(a) => 4.670000000098765e+19
c = int(b) => 46700000000987652096
a == c => False
Please read this document about Floating Point Arithmetic: Issues and Limitations :
https://docs.python.org/3/tutorial/floatingpoint.html
for your example:
from decimal import Decimal
a='46700000000987654321'
b=Decimal(a)
print(b) #46700000000987654321
c=int(b)
print(c) #46700000000987654321
Modified version of my answer to another question (reasonably) duped to this one:
This happens because 46700000000987654321 is greater than the integer representational limits of a C double (which is what a Python float is implemented in terms of).
Typically, C doubles are IEEE 754 64 bit binary floating point values, which means they have 53 bits of integer precision (the last consecutive integer values float can represent are 2 ** 53 - 1 followed by 2 ** 53; it can't represent 2 ** 53 + 1). Problem is, 46700000000987654321 requires 66 bits of integer precision to store ((46700000000987654321).bit_length() will provide this information). When a value is too large for the significand (the integer component) alone, the exponent component of the floating point value is used to scale a smaller integer value by powers of 2 to be roughly in the ballpark of the original value, but this means that the representable integers start to skip, first by 2 (as you require >53 bits), then by 4 (for >54 bits), then 8 (>55 bits), then 16 (>56 bits), etc., skipping twice as far between representable values for each additional bit of magnitude you have beyond 53 bits.
In your case, 46700000000987654321, converted to float, has an integer value of 46700000000987652096 (as you noted), having lost precision in the low digits.
If you need arbitrarily precise base-10 floating point math, replace your use of float with decimal.Decimal (conveniently, your initial value is already a string, so you don't risk loss of precision between how you type a float and the actual value stored); the default precision will handle these values, and you can increase it if you need larger values. If you do that (and convert a to an int for the comparison, since a str is never equal to any numeric type), you get the behavior you expected:
from decimal import Decimal as Dec, getcontext
a = "46700000000987654321"
b = Dec(a); print(b) # => 46700000000987654321
c = int(b); print(c) # => 46700000000987654321
print(int(a) == c) # => True
Try it online!
If you echo the Decimals in an interactive interpreter instead of using print, you'd see Decimal('46700000000987654321') instead, which is the repr form of Decimals, but it's numerically 46700000000987654321, and if converted to int, or stringified via any method that doesn't use the repr, e.g. print, it displays as just 46700000000987654321.

System Verilog DPI Checker masking 128-bit value

I am writing a DPI checker(.cpp file), In this, Checker reads the 128-bit value on every line and I want to mask it with a 128-bit mask and compare it with RTL value but the issue I am seeing is the data type from which I am creating the mask only holds 32-bit value and I need to do bitwise anding with original data. Can anyone provide any suggestions?
typedef struct {
BitVector<128> data
} read_line;
svBitVecVal mask;
mask = 0xffffffffffffffffffffffffffffffff;
BitVector<128> data_masked;
data_masked = read_line->data & mask;
Here svBitVecVal can only hold a maximum of 32bit value. data_masked will not show the correct value if the mask is more than 32-bit.
I don't know where you got your BitVector template from, but svBitVecVal is usually used as svBitVecVal* pointing to an array of 32 bits chunks.
mask needs to be a 4 element array to hold 128 bits.
And you'll need to either convert mask to a BitVector type first, or make sure it has the right overloaded function to do it for you.

Vulkan - strange mapping of float shader color values to uchar values in a read buffer

I knew that a range of float color value in a shader [0..1] is mapped into range of [0..255] in UCHAR buffer.
According to this, I was expecting for steps of size of 1/255 in shader color values for each change in UCHAR buffer.
But the results were surprisingly different. Here is for the first two steps:
Red float value in Shader -> UCHAR value in a read Buffer
0.000000 -> 0
0.002197 -> 0
0.002198 -> 1
0.006102 -> 1
0.006105 -> 2
The first two steps are around 0.002197 and 0.006102 which are different than the expected steps: 0.00392 and 0.00784.
So what is the mapping formula ?
Unsigned integer normalization is based on the formula f = i/INT_MAX, where f is the floating point value (after clamping to [0, 1]), i is the integer value, and INT_MAX is the maximum integer value for the integer's bitdepth (255) in this case.
So if you have a float, and want the unsigned, normalized integer value of it, you use i = f * INT_MAX. Of course... integers do not have the same precision as floats. So if the result of f * INT_MAX is 0.5, what is the integer value of that? It could be 0, or it could be 1, depending on how things are rounded.
Implementations are permitted to round integer values in any way they prefer. They are encouraged to use nearest rounding (the post-conversion 0.49 would become 0, and 0.5 would become 1), but that is not a requirement. The only requirements are that it must pick one of the two nearest values (it can't turn 0.5 into 3) and that the exact floating-point values of 0.0 and 1.0 (which includes any values clamped to them) must be exactly represented as integer 0 and INT_MAX.
If you have an explicit need to have direct rounding, you can always do the normalization yourself. In fact, GLSL has specific functions to help you. The following assumes that you are trying to write to a texture with the Vulkan format R8G8B8A8_UNORM, and we're assuming you're writing to a storage image, not via outputs from the fragment shader (you can do that too, but you lose blending).
So, step 1 is to change your layout format to be r32ui. That is, you are now writing an unsigned 32-bit value, rather than 4 unsigned 8-bit normalized values. That's perfectly valid.
Step 2 is to employ the packUNorm4x8 function. This function does float-to-integer normalization, but the specification explicitly performs rounding correctly. Use the return value of that function in your imageStore function, and you're fine.
If you want to write to a fragment shader output, that's a bit more complex. There, you will need to use a different image view, one that uses the R32_UINT format. So you're creating a 32-bit unsigned integer view of a 4x8-bit normalized texture. That has to become a render target, so you're going to have to do subpass surgery. From there, just write the result of packUNorm4x8.
Of course, you immediately lose blending and similar operations, since you're writing integers values. And since you had to do that subpass surgery, it's likely that any shader writing to it will need to do this too.
Also, note that in both cases, you will likely need to adjust the order of the components of the value you write. packUNorm4x8 is explicitly defined to be little endian, whereas (I believe?) R8G8B8A8 is specified to be in that order, most-significant to least. So you'll probably need to essentially do endian swapping with packUNorm4x8(value.abgr).

Add floats yields Double

This groovy:
float a = 1;
float b = 2;
def r = a + b;
Creates this Java code when reversed from .class with IntelliJ:
float a = (float)1;
float b = (float)2;
Object r = null;
double var7 = (double)a + (double)b;
r = Double.valueOf(var7);
So r contains a Double.
If I do this:
float a = 1;
float b = 2;
float r = a + b;
It generates code that performs the addition with doubles and converts back to float:
float a = (float)1;
float b = (float)2;
float r = 0.0F;
double var7 = (double)a + (double)b;
r = (float)var7;
So should one abandon floats with groovy as it seems to not want to use them anyway?
Groovy decided to take 5 standard result types of numeric operations. fall back to certain standard numeric types for operations. Those are int, long, BigInteger, double and BigDecimal. Thus adding/multiplying two floats returns a double. Division and pow are special.
From http://www.groovy-lang.org/syntax.html
Division and power binary operations aside,
binary operations between byte, char, short and int result in int
binary operations involving long with byte, char, short and int result
in long
binary operations involving BigInteger and any other integral type
result in BigInteger
binary operations between float, double and BigDecimal result in
double
binary operations between two BigDecimal result in BigDecimal
As for if you should abandon float... normally it is good enough to convert the double to float, especially since groovy is doing that automatically for you.
.net (C#) does something similar with 16-bit integers: Addition of Bytes or Int16s yield Int32. Possibly to prevent overflows.
Operations with "smaller" data types may result in the "bigger" data types. And with bigger, I mean more bits.
As illustrated in this example (more digits also means more bits)
15 (2 digits) x 15 (2 digits) = 225 (3 digits)
1.5 (2 digits) x 1.5 (2 digits) = 2.25 (3 digits)
However, adding two 32 bit integers returns jus a 32 bit integer. And adding two doubles just returns a double. This is because the (virtual) machine is optimized for working with these sizes, which is because physical processors used to be optimized for working with these sizes. Some of them still are. 32 bit operations are often still faster than 64 bit operations, even on 64 bit processors. However, 16 bit operations are not or barely.
Your compiler attempts to protect you against overflows, and allows you to check for them explicitly. So unless you have a good reason not to, I'd default to using these types, and optionally trunc to a compacter type when storing the data.
Good reasons not to include scenarios where you process large amounts (1000s) of numbers, e.g. for graphic processing.

Convert UInt32 to float without rounding?

I need to convert a UInt32 type to a float without having it rounded. Say I do
float num = 4278190335;
uint num1 = num;
The value instantly gets changed to 4278190336. Is there any way around this?
I need to convert a UInt32 type to a float without having it rounded.
That can't be done.
There are 232 possible uint values. There are fewer than 232 float values (there are 232 bit patterns, but that includes various NaN values). Add to that the fact that there are obviously a lot of float values which can't be represented as uint (e.g. 0.5) and it becomes clear that you can't represent every uint value exactly in a float. However, every uint (and every int) can be represented exactly as a double, so that might be a solution to your problem.
The problem you're seeing in your original source code is that 4278190335 isn't exactly representable as a float; the closest float value is 4278190336. This isn't a problem with the conversion from float to uint - it's a problem with the conversion from the exact value you've specified in your source code into a float; the float to uint conversion happens separately (and again, can easily lose information).
float has only 23 bits of mantissa. Along with the implicit 1 bit it can only represent exactly all numbers that fit in 24 bits. For numbers larger than that it can only store the nearest value. 4278190335 = 0xFF0000FF > 224 so it'll be rounded to 4278190336 when converting to float
Similarly double has 52 bits of mantissa and can represent all numbers within the range [-253, 253] exactly, so it can store any value that fit in 32-bit int including 4278190335. But again double can't store all numbers in long's range although they have the same size (64 bits)
Aside from your question being worded backward; I think what you are saying is.
You need to get the integer portion of a float value, e.g. its whole number value not its decimal value. In which case you can simply cast the float to an int, casting does not round.
e.g.
float myFloat = 1.5;
uint myInt = (uint)myFloat; //myInt == 1
Keep in mind though this isn't always clear to others reading your code. To help there Math.Floor and Math.Ceiling ... Floor returns the whole number below the current value, ceiling returns the whole number above it
e.g
float myFloat = 1.5;
uint myFloorInt = (uint)Math.Floor(myFloat); //myFloorInt == 1
uint myCeilingInt = (uint)Math.Ceiling(myFloat); //myCeilingInt == 2
You will need to cast or convert the value from float to uint, int, etc. as your needs dictate. Most frown on casting as the resulting value isn't always clear to people ... Convert has various methods to help you convert one value to another in nice clearly understandable way.
There is no solution to turn back the original value in your method.
i suggest to try byte to byte copying to make it posible retrieving data back. float typecasting could change original value.
if your processor is 32bit it could help u:
uint32 x;
float y;
memcpy((uint8*)&y,(uint8*)&x,4);
(mohandes...)

Resources