How do you count registers in HLSL? - graphics

With shader model 2.0, you can have 256 constant registers. I have been looking at various shaders, and trying to figure out what constitutes a single register?
For example, in my instancing shader, I have the following variables declared at the top, outside of functions:
float4x4 InstanceTransforms[40];
float4 InstanceDiffuses[40];
float4x4 View;
float4x4 Projection;
float3 LightDirection = normalize(float3(-1, -1, -1));
float3 DiffuseLight = 1;
float3 AmbientLight = 0.66;
float Alpha;
texture Texture;
How many registers have I consumed? How do I count them?

Each constant register is a float4.
float3, float2 and float will each allocate a whole register. float4x4 will use 4 registers. Arrays will simply multiply the number of registers allocated by the number of elements. And the compiler will probably allocate a few registers itself to use as constants in various calculations.
The only way to really tell what the shader is using is to disassemble it. To that end you may be interested in this question that I asked a while ago: HLSL: Enforce Constant Register Limit at Compile Time
You might also find this one worth a look: HLSL: Index to unaligned/packed floats. It explains why an array of 40 floats will use 40 registers, and how you can make it use 10 instead.
Your texture will use a texture sampler (you have 16 of these), not a constant register.
For reference, here are the list of ps_2_0 registers and vs_2_0 registers.


Is there a more efficient way of texturing a circle?

I'm trying to create a randomly generated "planet" (circle), and I want the areas of water, land and foliage to be decided by perlin noise, or something similar. Currently I have this (psudo)code:
for (int radius = 0; radius < circleRadius; radius++) {
for (float theta = 0; theta < TWO_PI; theta += 0.1) {
float x = radius * cosine(theta);
float y = radius * sine(theta);
int colour = whateverFunctionIMake(x, y);
setPixel(x, y, colour);
Not only does this not work (there are "gaps" in the circle because of precision issues), it's incredibly slow. Even if I increase the resolution by changing the increment to 0.01, it still has missing pixels and is even slower (I get 10fps on my mediocre computer using Java (I know not the best) and an increment of 0.01. This is certainly not acceptable for a game).
How might I achieve a similar result whilst being much less computationally expensive?
Thanks in advance.
Why not use:
(x-x0)^2 + (y-y0)^2 <= r^2
so simply:
int x0=?,y0=?,r=?; // your planet position and size
int x,y,xx,rr,col;
for (rr=r*r,x=-r;x<=r;x++)
for (xx=x*x,y=-r;y<=r;y++)
if (xx+(y*y)<=rr)
col = whateverFunctionIMake(x, y);
setPixel(x0+x, y0+y, col);
all on integers, no floating or slow operations, no gaps ... Do not forget to use randseed for the coloring function ...
[Edit1] some more stuff
Now if you want speed than you need direct pixel access (in most platforms Pixels, SetPixel, PutPixels etc are slooow. because they perform a lot of stuff like range checking, color conversions etc ... ) In case you got direct pixel access or render into your own array/image whatever you need to add clipping with screen (so you do not need to check if pixel is inside screen on each pixel) to avoid access violations if your circle is overlapping screen.
As mentioned in the comments you can get rid of the x*x and y*y inside loop using previous value (as both x,y are only incrementing). For more info about it see:
32bit SQRT in 16T without multiplication
the math is like this:
(x+1)^2 = (x+1)*(x+1) = x^2 + 2x + 1
so instead of xx = x*x we just do xx+=x+x+1 for not incremented yet x or xx+=x+x-1 if x is already incremented.
When put all together I got this:
void circle(int x,int y,int r,DWORD c)
// my Pixel access
int **Pixels=Main->pyx; // Pixels[y][x]
int xs=Main->xs; // resolution
int ys=Main->ys;
// circle
int sx,sy,sx0,sx1,sy0,sy1; // [screen]
int cx,cy,cx0, cy0 ; // [circle]
int rr=r*r,cxx,cyy,cxx0,cyy0; // [circle^2]
// BBOX + screen clip
sx0=x-r; if (sx0>=xs) return; if (sx0< 0) sx0=0;
sy0=y-r; if (sy0>=ys) return; if (sy0< 0) sy0=0;
sx1=x+r; if (sx1< 0) return; if (sx1>=xs) sx1=xs-1;
sy1=y+r; if (sy1< 0) return; if (sy1>=ys) sy1=ys-1;
cx0=sx0-x; cxx0=cx0*cx0;
cy0=sy0-y; cyy0=cy0*cy0;
// render
for (cxx=cxx0,cx=cx0,sx=sx0;sx<=sx1;sx++,cxx+=cx,cx++,cxx+=cx)
for (cyy=cyy0,cy=cy0,sy=sy0;sy<=sy1;sy++,cyy+=cy,cy++,cyy+=cy)
if (cxx+cyy<=rr)
This renders a circle with radius 512 px in ~35ms so 23.5 Mpx/s filling on mine setup (AMD A8-5500 3.2GHz Win7 64bit single thread VCL/GDI 32bit app coded by BDS2006 C++). Just change the direct pixel access to style/api you use ...
to measure speed on x86/x64 you can use RDTSC asm instruction here some ancient C++ code I used ages ago (on 32bit environment without native 64bit stuff):
double _rdtsc()
LARGE_INTEGER x; // unsigned 64bit integer variable from windows.h I think
DWORD l,h; // standard unsigned 32 bit variables
asm {
mov l,eax
mov h,edx
return double(x.QuadPart);
It returns clocks your CPU has elapsed since power up. Beware you should account for overflows as on fast machines the 32bit counter is overflowing in seconds. Also each core has separate counter so set affinity to single CPU. On variable speed clock before measurement heat upi CPU by some computation and to convert to time just divide by CPU clock frequency. To obtain it just do this:
fcpu = (t1-t0)*4;
and measurement:
mesured stuff
time = (t1-t0)/fcpu
if t1<t0 you overflowed and you need to add the a constant to result or measure again. Also the measured process must take less than overflow period. To enhance precision ignore OS granularity. for more info see:
Measuring Cache Latencies
Cache size estimation on your system? setting affinity example
Negative clock cycle measurements with back-to-back rdtsc?

Why is the transitive closure of matrix multiplication not working in this vertex shader?

This can probably be filed under premature optimization, but since the vertex shader executes on each vertex for each frame, it seems like something worth doing (I have a lot of vars I need to multiply before going to the pixel shader).
Essentially, the vertex shader performs this operation to convert a vector to projected space, like so:
// Transform the vertex position into projected space.
pos = mul(pos, model);
pos = mul(pos, view);
pos = mul(pos, projection);
output.pos = pos;
Since I'm doing this operation to multiple vectors in the shader, it made sense to combine those matrices into a cumulative matrix on the CPU and then flush that to the GPU for calculation, like so:
// VertexShader.hlsl
cbuffer ModelViewProjectionConstantBuffer : register (b0)
matrix model;
matrix view;
matrix projection;
matrix cummulative;
float3 eyePosition;
// Transform the vertex position into projected space.
pos = mul(pos, cummulative);
output.pos = pos;
And on the CPU:
// Renderer.cpp
// now is also the time to update the cummulative matrix
m_constantMatrixBufferData->cummulative =
m_constantMatrixBufferData->model *
m_constantMatrixBufferData->view *
// NOTE: each of the above vars is an XMMATRIX
My intuition was that there was some mismatch of row-major/column-major, but XMMATRIX is a row-major struct (and all of its operators treat it as such) and mul(...) interprets its matrix parameter as row-major. So that doesn't seem to be the problem but perhaps it still is in a way that I don't understand.
I've also checked the contents of the cumulative matrix and they appear correct, further adding to the confusion.
Thanks for reading, I'll really appreciate any hints you can give me.
EDIT (additional information requested in comments):
This is the struct that I am using as my matrix constant buffer:
// a constant buffer that contains the 3 matrices needed to
// transform points so that they're rendered correctly
struct ModelViewProjectionConstantBuffer
DirectX::XMMATRIX model;
DirectX::XMMATRIX view;
DirectX::XMMATRIX projection;
DirectX::XMMATRIX cummulative;
DirectX::XMFLOAT3 eyePosition;
// and padding to make the size divisible by 16
float padding;
I create the matrix stack in CreateDeviceResources (along with my other constant buffers), like so:
void ModelRenderer::CreateDeviceResources()
// Let's take this moment to create some constant buffers
... // creation of other constant buffers
// and lastly, the all mighty matrix buffer
CD3D11_BUFFER_DESC constantMatrixBufferDesc(sizeof(ModelViewProjectionConstantBuffer), D3D11_BIND_CONSTANT_BUFFER);
... // and the rest of the initialization (reading in the shaders, loading assets, etc)
I write into the matrix buffer inside a matrix stack class I created. The client of the class calls Update() once they are done modifying the matrices:
void MatrixStack::Update()
// then update the buffers
m_constantMatrixBufferData->model = model.front();
m_constantMatrixBufferData->view = view.front();
m_constantMatrixBufferData->projection = projection.front();
// NOTE: the eye position has no stack, as it's kept updated by the trackball
// now is also the time to update the cummulative matrix
m_constantMatrixBufferData->cummulative =
m_constantMatrixBufferData->model *
m_constantMatrixBufferData->view *
// and flush
Given your code snippets, it should work.
Possible causes of your problem:
Have you tried the inverted multiplication: projection * view * model?
Are you sure you set correctly the cummulative (register index = 12, offset = 192) in your constant buffer
Same for eyePosition (register index = 12, offset = 256)?

How can I display multiple separate textures (not multi-texturing) with OpenGL ES 2.0?

My iOS 4 app uses OpenGL ES 2.0 and renders elements with a single texture. I would like to draw elements using multiple different textures and am having problems getting things to work.
I added a variable to my vertex shader to indicate which texture to apply:
attribute float TextureIn;
varying float TextureOut;
void main(void)
TextureOut = TextureIn;
I use that value in the fragment shader to select the texture:
varying lowp float TextureOut;
uniform sampler2D Texture0;
uniform sampler2D Texture1;
void main(void)
if (TextureOut == 1.0)
gl_FragColor = texture2D(Texture1, TexCoordOut);
else // 0
gl_FragColor = texture2D(Texture0, TexCoordOut);
Compile shaders:
_texture = glGetAttribLocation(programHandle, "TextureIn");
_textureUniform0 = glGetUniformLocation(programHandle, "Texture0");
_textureUniform1 = glGetUniformLocation(programHandle, "Texture1");
GLuint _texture;
GLuint _textureUniform0;
GLuint _textureUniform1;
glEnable(GL_TEXTURE_2D); // ?
glBindTexture(GL_TEXTURE_2D, _textureUniform0);
glUniform1i(_textureUniform0, 0);
glEnable(GL_TEXTURE_2D); // ?
glBindTexture(GL_TEXTURE_2D, _textureUniform1);
glUniform1i(_textureUniform1, 1);
glVertexAttribPointer(_texture, 1, GL_FLOAT, GL_FALSE, sizeof(Vertex), (GLvoid*) (sizeof(float) * 13));
glBindTexture(GL_TEXTURE_2D, _textureUniform0);
glUniform1i(_textureUniform0, 0);
glBindTexture(GL_TEXTURE_2D, _textureUniform1);
glUniform1i(_textureUniform1, 1);
glDrawElements(GL_TRIANGLES, indicesCountA, GL_UNSIGNED_SHORT, (GLvoid*) (sizeof(GLushort) * 0));
glDrawElements(GL_TRIANGLES, indicesCountB, GL_UNSIGNED_SHORT, (GLvoid*) (sizeof(GLushort) * indicesCountA));
glDrawElements(GL_TRIANGLES, indicesCountC, GL_UNSIGNED_SHORT, (GLvoid*) (sizeof(GLushort) * (indicesCountA + indicesCountB)));
My hope was to dynamically apply the texture associated with a vertex but it seems to only recognize GL_TEXTURE0.
The only way I have been able to change textures is to associated each texture with GL_TEXTURE0 and then draw:
glBindTexture(GL_TEXTURE_2D, _textureUniformX);
glUniform1i(_textureUniformX, 0);
glDrawElements(GL_TRIANGLES, indicesCountA, GL_UNSIGNED_SHORT, (GLvoid*) (sizeof(GLushort) * 0));
In order to render all the textures, I would need a separate glDrawElements() call for each texture, and I have read that glDrawElements() calls are a big hit to performance and the number of calls should be minimized. Thats why I was trying to dynamically specifiy which texture to use for each vertex.
It's entirely possible that my understanding is wrong or I am missing something important. I'm still new to OpenGL and the more I learn the more I feel I have more to learn.
It must be possible to use textures other than just GL_TEXTURE0 but I have yet to figure out how.
Any guidance or direction would be greatly appreciated.
Can it be you're just experiencing floating point rounding issues? There shouldn't be any (except if a single privimitve shares vertices with different textures), but just to be sure replace this TextureOut == 1.0 with a TextureOut > 0.5 or something the like.
As a general advice, you are correct in that the number of draw calls should be reduced as much a possible, but your approach is quite odd. You are buying draw call reduction with fragment shader branching. Your approach also doesn't scale well with the overall number of textures, since you always need all textures in separate texture units.
The usual approach to reduce texture switches is to put all the textures into a single large texture, a so-called texture atlas, and use the texture coordinates to select the appropriate subregion in this texture. This also has some pitfalls (which are an entirely different question), but nothing comes for free.
EDIT: Oh wait, I see what you're actually doing wrong
glBindTexture(GL_TEXTURE_2D, _textureUniform0);
You're binding a texture to the current texture unit, but instead of the texture object you give this function a uniform location, which is complete rubbish (but might even work in some weird circumstances, since both uniform locations and texture objects are themselves just integers). Of course you have to bind the actual texture.

Weird values when passing an array of structs as an openCL kernel argument

When passing an array of structs to my kernel as an argument, I get weird values for the items after the first (array[1], array[2], etc). It seems to be an alignment issue maybe?
Here is the struct:
typedef struct Sphere
float3 color;
float3 position;
float3 reflectivity;
float radius;
int phong;
bool isReflective;
} Sphere;
Here is the host side init code:
cl::Buffer cl_spheres = cl::Buffer(context, CL_MEM_READ_ONLY, sizeof(Sphere) * MAX_SPHERES, NULL, &err);
err = queue.enqueueWriteBuffer(cl_spheres, CL_TRUE, 0, sizeof(Sphere) * MAX_SPHERES, spheres, NULL, &event);
err = kernel.setArg(3, cl_spheres);
What happens is that the color for the second Sphere struct in the array will actually have the last value of what I set color to on the host side (s3 or z), a non initialized zero value, and the first value of what I set position to on the host side (s0 or x). I noticed that the float3 datatype actually still has a fourth value (s3) that does not get initialized. I think that is where the non initialized zero value is coming from. So it seems that it is an alignment issue. I really am at a loss as to what I could do to fix it. I was hoping maybe someone could shed some light on this problem. I have ensured that my struct definitions are exactly the same on both sides.
From the OpenCL 1.2 specs, section 6.11.1:
Note that the alignment of any given struct or union type is required
by the ISO C standard to be at least a perfect multiple of the lowest
common multiple of the alignments of all of the members of the struct
or union in question and must also be a power of two.
Also cl_float3 counts as a cl_float4, see section 6.1.5.
Finally, in section 6.9.k:
Arguments to kernel functions in a program cannot be declared with the
built-in scalar types bool, half, size_t, ptrdiff_t, intptr_t, and
uintptr_t or a struct and/or union that contain fields declared to be
one of these built-in scalar types.
To comply with these rules, and probably make accesses faster, you can try (OpenCL C side; on the host use cl_float4):
typedef struct Sphere
float4 color;
float4 position;
float4 reflectivity;
float4 radiusPhongReflective; // each value uses 1 float
} Sphere;

GLSL - Front vs. Back faces of polygons

I made some simple shading in GLSL of a checkers board:
f(P) = [ floor(Px)+floor(Py)+floor(Pz) ] mod 2
It seems to work well except the fact that i see the interior of the objects but i want to see only the front face.
Any ideas how to fix this? Thanks!
Teapot (glutSolidTeapot()):
Cube (glutSolidCube):
The vertex shader file is:
varying float x,y,z;
void main(){
gl_Position = gl_ProjectionMatrix * gl_ModelViewMatrix * gl_Vertex;
x = gl_Position.x;
y = gl_Position.y;
z = gl_Position.z;
And the fragment shader file is:
varying float x,y,z;
void main(){
float _x=x;
float _y=y;
float _z=z;
float sum = (_x+_y+_z);
sum = mod(sum,2.0);
gl_FragColor = vec4(sum,sum,sum,1.0);
The shaders are not the problem - the face culling is.
You should either disable the face culling (which is not recommended, since it's bad for performance reasons):
or use glCullFace and glFrontFace to set the culling mode, i.e.:
glEnable(GL_CULL_FACE); // enables face culling
glCullFace(GL_BACK); // tells OpenGL to cull back faces (the sane default setting)
glFrontFace(GL_CW); // tells OpenGL which faces are considered 'front' (use GL_CW or GL_CCW)
The argument to glFrontFace depends on application conventions, i.e. the matrix handedness.
