Encoding data in an 8-bit float (cg) - graphics

The language is cg.
I have an 8-bit float that must be between 0 and 1 (it's the 'a' component of a float4 rgba color value). I want to store a 6-bit unsigned integer and a 2-bit unsigned integer in that. How can I safely store these two values in those 8 bits?
I couldn't find any documentation on the format of an 8-bit float, especially one limited between 0 and 1. I'm assuming it's more complicated than just data / 255 ?

The OpenGL standard guarantees that at least 256 distinct values are preserved when writing to an 8-bit framebuffer. I’m pretty sure Cg does the same.
So you should be able to write your two values in this way:
output.a = (4.0 * clamp(val1, 0.0, 63.0)
+ clamp(val2, 0.0, 3.0)) / 255.0;
And retrieve them like this:
float val1 = floor(input.a * 255.0 / 4.0);
float val2 = fmod(input.a * 255.0, 4.0);
This is equivalent to bitwise operations on integers.

Related

Vulkan - strange mapping of float shader color values to uchar values in a read buffer

I knew that a range of float color value in a shader [0..1] is mapped into range of [0..255] in UCHAR buffer.
According to this, I was expecting for steps of size of 1/255 in shader color values for each change in UCHAR buffer.
But the results were surprisingly different. Here is for the first two steps:
Red float value in Shader -> UCHAR value in a read Buffer
0.000000 -> 0
0.002197 -> 0
0.002198 -> 1
0.006102 -> 1
0.006105 -> 2
The first two steps are around 0.002197 and 0.006102 which are different than the expected steps: 0.00392 and 0.00784.
So what is the mapping formula ?
Unsigned integer normalization is based on the formula f = i/INT_MAX, where f is the floating point value (after clamping to [0, 1]), i is the integer value, and INT_MAX is the maximum integer value for the integer's bitdepth (255) in this case.
So if you have a float, and want the unsigned, normalized integer value of it, you use i = f * INT_MAX. Of course... integers do not have the same precision as floats. So if the result of f * INT_MAX is 0.5, what is the integer value of that? It could be 0, or it could be 1, depending on how things are rounded.
Implementations are permitted to round integer values in any way they prefer. They are encouraged to use nearest rounding (the post-conversion 0.49 would become 0, and 0.5 would become 1), but that is not a requirement. The only requirements are that it must pick one of the two nearest values (it can't turn 0.5 into 3) and that the exact floating-point values of 0.0 and 1.0 (which includes any values clamped to them) must be exactly represented as integer 0 and INT_MAX.
If you have an explicit need to have direct rounding, you can always do the normalization yourself. In fact, GLSL has specific functions to help you. The following assumes that you are trying to write to a texture with the Vulkan format R8G8B8A8_UNORM, and we're assuming you're writing to a storage image, not via outputs from the fragment shader (you can do that too, but you lose blending).
So, step 1 is to change your layout format to be r32ui. That is, you are now writing an unsigned 32-bit value, rather than 4 unsigned 8-bit normalized values. That's perfectly valid.
Step 2 is to employ the packUNorm4x8 function. This function does float-to-integer normalization, but the specification explicitly performs rounding correctly. Use the return value of that function in your imageStore function, and you're fine.
If you want to write to a fragment shader output, that's a bit more complex. There, you will need to use a different image view, one that uses the R32_UINT format. So you're creating a 32-bit unsigned integer view of a 4x8-bit normalized texture. That has to become a render target, so you're going to have to do subpass surgery. From there, just write the result of packUNorm4x8.
Of course, you immediately lose blending and similar operations, since you're writing integers values. And since you had to do that subpass surgery, it's likely that any shader writing to it will need to do this too.
Also, note that in both cases, you will likely need to adjust the order of the components of the value you write. packUNorm4x8 is explicitly defined to be little endian, whereas (I believe?) R8G8B8A8 is specified to be in that order, most-significant to least. So you'll probably need to essentially do endian swapping with packUNorm4x8(value.abgr).

Add floats yields Double

This groovy:
float a = 1;
float b = 2;
def r = a + b;
Creates this Java code when reversed from .class with IntelliJ:
float a = (float)1;
float b = (float)2;
Object r = null;
double var7 = (double)a + (double)b;
r = Double.valueOf(var7);
So r contains a Double.
If I do this:
float a = 1;
float b = 2;
float r = a + b;
It generates code that performs the addition with doubles and converts back to float:
float a = (float)1;
float b = (float)2;
float r = 0.0F;
double var7 = (double)a + (double)b;
r = (float)var7;
So should one abandon floats with groovy as it seems to not want to use them anyway?
Groovy decided to take 5 standard result types of numeric operations. fall back to certain standard numeric types for operations. Those are int, long, BigInteger, double and BigDecimal. Thus adding/multiplying two floats returns a double. Division and pow are special.
From http://www.groovy-lang.org/syntax.html
Division and power binary operations aside,
binary operations between byte, char, short and int result in int
binary operations involving long with byte, char, short and int result
in long
binary operations involving BigInteger and any other integral type
result in BigInteger
binary operations between float, double and BigDecimal result in
double
binary operations between two BigDecimal result in BigDecimal
As for if you should abandon float... normally it is good enough to convert the double to float, especially since groovy is doing that automatically for you.
.net (C#) does something similar with 16-bit integers: Addition of Bytes or Int16s yield Int32. Possibly to prevent overflows.
Operations with "smaller" data types may result in the "bigger" data types. And with bigger, I mean more bits.
As illustrated in this example (more digits also means more bits)
15 (2 digits) x 15 (2 digits) = 225 (3 digits)
1.5 (2 digits) x 1.5 (2 digits) = 2.25 (3 digits)
However, adding two 32 bit integers returns jus a 32 bit integer. And adding two doubles just returns a double. This is because the (virtual) machine is optimized for working with these sizes, which is because physical processors used to be optimized for working with these sizes. Some of them still are. 32 bit operations are often still faster than 64 bit operations, even on 64 bit processors. However, 16 bit operations are not or barely.
Your compiler attempts to protect you against overflows, and allows you to check for them explicitly. So unless you have a good reason not to, I'd default to using these types, and optionally trunc to a compacter type when storing the data.
Good reasons not to include scenarios where you process large amounts (1000s) of numbers, e.g. for graphic processing.

Convert UInt32 to float without rounding?

I need to convert a UInt32 type to a float without having it rounded. Say I do
float num = 4278190335;
uint num1 = num;
The value instantly gets changed to 4278190336. Is there any way around this?
I need to convert a UInt32 type to a float without having it rounded.
That can't be done.
There are 232 possible uint values. There are fewer than 232 float values (there are 232 bit patterns, but that includes various NaN values). Add to that the fact that there are obviously a lot of float values which can't be represented as uint (e.g. 0.5) and it becomes clear that you can't represent every uint value exactly in a float. However, every uint (and every int) can be represented exactly as a double, so that might be a solution to your problem.
The problem you're seeing in your original source code is that 4278190335 isn't exactly representable as a float; the closest float value is 4278190336. This isn't a problem with the conversion from float to uint - it's a problem with the conversion from the exact value you've specified in your source code into a float; the float to uint conversion happens separately (and again, can easily lose information).
float has only 23 bits of mantissa. Along with the implicit 1 bit it can only represent exactly all numbers that fit in 24 bits. For numbers larger than that it can only store the nearest value. 4278190335 = 0xFF0000FF > 224 so it'll be rounded to 4278190336 when converting to float
Similarly double has 52 bits of mantissa and can represent all numbers within the range [-253, 253] exactly, so it can store any value that fit in 32-bit int including 4278190335. But again double can't store all numbers in long's range although they have the same size (64 bits)
Aside from your question being worded backward; I think what you are saying is.
You need to get the integer portion of a float value, e.g. its whole number value not its decimal value. In which case you can simply cast the float to an int, casting does not round.
e.g.
float myFloat = 1.5;
uint myInt = (uint)myFloat; //myInt == 1
Keep in mind though this isn't always clear to others reading your code. To help there Math.Floor and Math.Ceiling ... Floor returns the whole number below the current value, ceiling returns the whole number above it
e.g
float myFloat = 1.5;
uint myFloorInt = (uint)Math.Floor(myFloat); //myFloorInt == 1
uint myCeilingInt = (uint)Math.Ceiling(myFloat); //myCeilingInt == 2
You will need to cast or convert the value from float to uint, int, etc. as your needs dictate. Most frown on casting as the resulting value isn't always clear to people ... Convert has various methods to help you convert one value to another in nice clearly understandable way.
There is no solution to turn back the original value in your method.
i suggest to try byte to byte copying to make it posible retrieving data back. float typecasting could change original value.
if your processor is 32bit it could help u:
uint32 x;
float y;
memcpy((uint8*)&y,(uint8*)&x,4);
(mohandes...)

Un/pack additional set of UV coordinates into a 32bit RGBA field

I'm modding a game called Mount&Blade, currently trying to implement lightmapping through custom shaders.
As the in-game format doesn't allows more than one UV map per model and I need to carry the info of a second, non-overlapping parametrization somewhere, a field of four uints (RGBA, used for per-vertex coloring) is my only possibility.
At first thought about just using U,V=R,G but the precision isn't good enough.
Now I'm trying to encode them with the maximum precision available, using two fields (16bit) per coordinate. Snip of my Python exporter:
def decompose(a):
a=int(a*0xffff) #fill the entire range to get the maximum precision
aa =(a&0xff00)>>8 #decompose the first half and save it as an 8bit uint
ab =(a&0x00ff) #decompose the second half
return aa,ab
def compose(na,nb):
return (na<<8|nb)/0xffff
I'd like to know how to do the second part (composing, or unpacking it) in HLSL (DX9, shader model 2.0). Here's my try, close, but doesn't works:
//compose UV from n=(na<<8|nb)/0xffff
float2 thingie = float2(
float( ((In.Color.r*255.f)*256.f)+
(In.Color.g*255.f) )/65535.f,
float( ((In.Color.b*255.f)*256.f)+
(In.Color.w*255.f) )/65535.f
);
//sample the lightmap at that position
Output.RGBColor = tex2D(MeshTextureSamplerHQ, thingie);
Any suggestion or ingenious alternative is welcome.
Remember to normalize aa and ab after you decompose a.
Something like this:
(u1, u2) = decompose(u)
(v1, v2) = decompose(v)
color.r = float(u1) / 255.f
color.g = float(u2) / 255.f
color.b = float(v1) / 255.f
color.a = float(v2) / 255.f
The pixel shader:
float2 texc;
texc.x = (In.Color.r * 256.f + In.Color.g) / 257.f;
texc.y = (In.Color.b * 256.f + In.Color.a) / 257.f;

formula for alpha value when blending two transparent colors

lets assume an alpha of 1 means fully opaque and 0 means fully transparent.
lets say i have two black images which have 50% transparency (alpha = 0.5).
if they are laid on top of each other, the resulting transparency is 0.75, right?
if they would have an alpha of 0.25 , the result would be around 0.5, right?
if they would have an alpha of 0.9 , the result would be around 0.97, right?
how can you get to these numbers?
in other words i am looking for a function that gets the resulting alpha value from two other alpha value.
float alpha = f(float alphaBelow, float alphaAbove)
{
//TODO implement
}
This answer is mathematically the same as Jason's answer, but this is the actual formula as you'll find it in reference material.
float blend(float alphaBelow, float alphaAbove)
{
return alphaBelow + (1.0 - alphaBelow) * alphaAbove;
}
float blend(float alphaBelow, float alphaAbove)
{
return alphaBelow + alphaAbove - alphaBelow * alphaAbove;
}
This function assumes both parameters are 0..1, where 0 is fully transparent and 1 is fully opaque.
Photoshop does the following calculation:
float blend(float alphaBelow, float alphaAbove)
{
return min(1,alphaBelow+(1-alphaBelow)*alphaAbove);
}

Resources