Direct3D 11/HLSL Texture3D<float3> False Error? - direct3d

I am getting this error:
D3D11 ERROR: ID3D11DeviceContext::Dispatch: The Shader Resource View in slot 0 of the Compute Shader unit is using the Format (R32G32B32_FLOAT). This format does not support 'Sample', 'SampleLevel', 'SampleBias' or 'SampleGrad', at least one of which may being used on the Resource by the shader. This mismatch is invalid if the shader actually uses the view (e.g. it is not skipped due to shader code branching). [ EXECUTION ERROR #371: DEVICE_DRAW_RESOURCE_FORMAT_SAMPLE_UNSUPPORTED]
Here's my working code:
Texture3D< float4 > g_VectorField;
float3 ... = g_VectorField.SampleLevel( ... ).rgb;
Here's my code that is causing the error:
Texture3D< float3 > g_VectorField;
float3 ... = g_VectorField.SampleLevel( ... );
My application works just fine when I'm not catching Direct3D errors (disabled D3D11_CREATE_DEVICE_DEBUG). SampleLevel's behavior is not undefined as far as I can see. It's behaving in exactly the same as the first snippet, yet it is giving me this error. This format does not support 'SampleLevel' my ass.
It seems that this error can be ignored without causing undefined behavior, so why is it an error?

It's perfectly valid for an "undefined behaviour" to give you the results you expected. The problem is that it's also perfactly valid for it to crash the graphics driver or reinstall your operating system, so to speak. The "warning" comes from DirectX, but the actual operation is performed by the graphics card. So you're doing an operation that's invalid for DirectX 11, but a specific GPU might support it. Others don't have to. Future cards don't have to either. And it might be doing something really stupid to allow you to use the sampling or converting the resource to float4.
Take C++ for example - the order of evaluation of function parameters is undefined. That means that while it might be doing the thing you expect it to do on your compiler and your computer, it might be different in a different place (different optimizations possible, for example), or on a different compiler / computer. The code is no longer portable or reliable.
As for why this particular operation is undefined, it's hard to tell. Different texture formats behave in very different ways - some recalibrate gamma, some premultiply the alpha... It might very well be that noone expected your particular type to be widely used, so it wasn't worth the extra branch in the specification. It might be that they didn't include it in the specification because enough GPUs didn't support it. Does it actually give you a measurable performance benefit? If the support is done by translating the texture, for example, the float3 variant might actually be slower.
Of course, in general computing, powers of two are usually preferred when storing data, because they make some operations much easier. GPUs might very well still depend on the old-school bit-hacks to handle some operations - 128 is nice, 96... not so much. Maybe they still store data in 128-bit registers, so there's no point in using 96. Maybe, maybe, maybe all the way :D
EDIT: I found something relevant in the DirectX documentation:
A resource declared with the DXGI_FORMAT_R32G32B32 family of formats
cannot be used simultaneously for vertex and texture data. That is,
you may not create a buffer resource with the DXGI_FORMAT_R32G32B32
family of formats that uses any of the following bind flags:
D3D10_BIND_VERTEX_BUFFER, D3D10_BIND_INDEX_BUFFER,
D3D10_BIND_CONSTANT_BUFFER, or D3D10_BIND_STREAM_OUTPUT
So it seems that it's fine to use R32G32B32 for an actual texture, but not for a vertex/index/constant buffer.

Related

Are floating point operations deterministic when running in multiple threads?

Suppose I have a function that runs calculations, example being something like a dot product - I pass in an arrays A, B of vectors and a float array C, and the functions assigns:
C[i] = dot(A[i], B[i]);
If I create and start two threads that will run this function, and pass in the same three arrays to both the threads, under what circumstances is this type of action (perhaps using a different non-random mathematical operation etc.) not guaranteed give the same result (running the same application without any recompilation, and on the same machine)? I'm only interested in the context of a consumer PC.
I know that float operations are in general deterministic, but I do wonder whether perhaps something weird could happen and maybe on one thread the calculations will use an intermediate 80 bit register, but not in the other.
I would assume it's pretty much guaranteed the same binary code should run in both threads (is there some way this could not happen? The function being compiled multiple times for some reason, the compiler somehow figuring out it will run in multiple threads, and compiling it again, for some reason, for the second thread?).
But I'm a a bit more worried that CPU cores might not have the same instruction sets, even on consumer level PCs.
Side question - what about GPUs in a similar scenario?
//
I'm assuming x86_64, Windows, c++, and dot is a.x * b.x + a.y * b.y. Can't give more info than that - using Unity IL2CPP, don't know how it compiles/with what options.
Motivation for the question: I'm writing a computational geometry procedure that modifies a mesh - I'll call this the "geometric mesh". The issue is that it could happen that the "rendering mesh" has multiple vertices for certain geometric positions - it's needed for flat shading for example - you have multiple vertices with different normals. However, the actual computational geometry procedure only uses purely geometric data of the positions in space.
So I see two options:
Create a map from the rendering mesh to the geometric mesh (example - duplicate vertices being mapped to one unique vertex), run the procedure on the geometric mesh, then somehow modify the rendering mesh based on the result.
Work with the rendering mesh directly. Slightly more inefficient as the procedure does calculations for all vertices, but much easier from a code perspective. But most of all I'm a bit worried that I could get two different values for two vertices that actually have the same position and that shouldn't happen. Only the position is used, and the position would be the same for both such vertices.
Floating point (FP) operations are not associative (but it is commutative). As a result, (x+y)+z can give different results than x+(y+z). For example, (1e-13 + (1 - 1e-13)) == ((1e-13 + 1) - 1e-13) is false with 64-bit IEEE-754 floats. The C++ standard is not very restrictive about floating-point numbers. However, the widely-used IEEE-754 standard is. It specifies the precision of 32-bit and 64-bit number operations, including rounding modes. x86-64 processors are IEEE-754 compliant and mainstream compilers (eg. GCC, Clang and MSVC) are also IEEE-754 compliant by default. ICC is not compliant by default since it assumes the FP operations are associative for the sake of performance. Mainstream compilers have compilation flags to make such assumption so to speed up codes. It is generally combined with other ones like the assumption that all FP values are not NaN (eg. -ffast-math). Such flags break the IEEE-754 compliance, but they are often used in the 3D or video game industry so to speed up codes. IEEE-754 is not required by the C++ standard, but you can check this with std::numeric_limits<T>::is_iec559.
Threads can have different rounding modes by default. However, you can set the rounding mode using the C code provided in this answer. Also, please note that denormal numbers are sometimes disabled on some platforms because of their very-high overhead (see this for more information).
Assuming the IEEE-754 compliance is not broken, the rounding mode is the same and the threads does the operations in the same order, then the result should be identical up to at least 1 ULP. In practice, if they are compiled using a same mainstream compiler, the result should be exactly the same.
The thing is using multiple threads often result in a non-deterministic order of the applied FP operations which causes non-deterministic results. More specifically, atomic operations on FP variables often cause such an issue because the order of the operations often changes at runtime. If you want deterministic results, you need to use a static partitioning, avoid atomic operations on FP variables or more generally atomic operations that could result in a different ordering. The same thing applies for locks or any synchronization mechanisms.
The same thing is true for GPUs. In fact, such problem is very frequent when developers use atomic FP operations for example to sum values. They often do that because implementing fast reductions is complex (though it is more deterministic) and atomic operations as pretty fast on modern GPUs (since they use dedicated efficient units).
According to the accepted answer to floating point processor non-determinism?, C++ floating point is not non-deterministic. The same sequence of instructions will give the same results.
There are a few things to take into account though:
Firstly, the behavior (i.e. the result) of a particular piece of C++ source code doing a FP calculation may depend on the compiler and the chosen compiler options. For example, it may depend on whether the compiler chooses to emit 64 or 80 bit FP instructions. But this is deterministic.
Secondly, similar C++ source code may give different results; e.g. due to non-associative behavior of certain FP instructions. This also is deterministic.
Determinism won't be affected by multi-threading by default. The C++ compiler will probably be unaware of whether the code is multi-threaded or not. And it definitely has no reason to emit different FP code.
Admittedly, FP behavior depends on the rounding mode selected, and that can be set on a per-thread basis. However, for this to happen, something (application code) would have to explicitly set different rounding modes for different threads. Once again, that is deterministic. (And a pretty daft thing for the application code to do, IMO.)
The idea that a PC would would use different FP hardware with different behavior for different threads seems far-fetched to me. Sure a PC could have (say) an Intel chipset and an ARM chipset, but it is not plausible that different threads of the same C++ application (executable) would simultaneously run on both chipsets.
Likewise for GPUs. Indeed, given that you need to program GPUs in a way that is radically different to ordinary (or threaded) C++, I would doubt that they could even share the same source code.
In short, I think that you are worrying about a hypothetical problem that you are unlikely to encounter in reality ... given the current state of the art in hardware and C++ compilers.

Why is VkShaderStageFlagBits a bitmask?

In Vulkan you specify the VkPipelineShaderStageCreateInfo's to the VkGraphicsPipelineCreateInfo structure, and presumably there is supposed to be one VkPipelineShaderStageCreateInfo for each shader stage (for example the vertex, and fragment shaders).
So why exactly is the field stage field of type vkShaderStageFlagBits is this just because it sits closer to some kind of Vulkan convention?
My confusion is I am led to believe that the only reason you would use a Bitmask in this way, is if you need to combine bits together. (For example for the general flags field in all Vulkan structures). I was trying to find the answer for this, so I looked at the Vulkan Spec, and this confused me even more! This is because they have two bits VK_SHADER_STAGE_ALL_GRAPHICS and VK_SHADER_STAGE_ALL these are defined as:
VK_SHADER_STAGE_ALL_GRAPHICS is a combination of bits used as shorthand to specify all graphics stages defined above (excluding the compute stage).
VK_SHADER_STAGE_ALL is a combination of bits used as shorthand to specify all shader stages supported by the device, including all additional stages which are introduced by extensions.
Well if they are supposed to be "shorthand" for specifying all bits, does this mean one shader stage, is supposed to be able to represent a version of all the stages?
Thanks in advance!
Exactly, this is mostly to keep the api consistent. VkShaderStageFlagBits is used in several spots where a bit mask makes more sense than at pipeline creation time.
An example where it makes sense are descriptor set layout bindings where you use the flag mask to specify what stages can access your descriptors (samplers, uniform buffer object, etc.).
So if you want one UBO to be accessible from the vertex and fragment stage and another one from the geometry and tessellation stage you'd use different stage flag bit combinations when setting up the VkDescriptorSetLayoutBinding. Pipeline state combinations are pretty common here.
Vulkan uses fields of type Vk*FlagBits (e.g. VkShaderStageFlagBits) when exactly one of the defined values is expected, and uses the corresponding Vk*Flags type (always a typedef for VkFlags which is just a typedef for uint32_t (e.g. typedef VkFlags VkShaderStageFlags) when a combination zero, one, or more of the defined values is expected.
There are two reasons for this:
It gives a signal (albeit subtle) about whether exactly one value is expected/allowed or some combination of values is expected.
Many compilers will give warnings when assigning a combination of bit values to a field of enum type, which in practice helps enforce (1). This is because to do bitwise operations on enum values, they're first promoted to an integer type, and the result is an integer type, and typical settings for most compilers yield a warning (often promoted to error) when doing an implicit conversion from integer to enum type, since the integer may not be one of the enumerated values.
So VkPipelineShaderStageCreateInfo::stage is VkShaderStageFlagBits because exactly one shader stage is valid there, and you'll probably get a warning if you try to set it to something silly like VK_SHADER_STAGE_VERTEX_BIT | VK_SHADER_STAGE_FRAGMENT_BIT.
But VkDescriptorSetLayoutBinding::stageFlags is VkShaderStageFlags because it's common and expected to include multiple stages there, and you won't get a compiler warning if you set it to VK_SHADER_STAGE_VERTEX_BIT | VK_SHADER_STAGE_FRAGMENT_BIT.

What does type level programming mean at runtime?

I am very new to Haskell, so sorry if this is a basic question, or a question founded on shaky understanding
Type level programming is a fascinating idea to me. I think I get the basic premise, but I feel like there is another side to it that is fuzzy to me. I get that the idea is to bring logic and computation into the compiletime instead of runtime, using types. This way you turn what is normally runtime logic/state/data into static logic, e.g. the size of collections.
So I get that for example you can have type level natural numbers, and do type level arithmetic on those natural numbers, and all this calculation and type safety is going on at compile time.
But what does such arithmetic imply at runtime? Especially since Haskell has full type erasure. So for example
If I concatenate two type level lists, then does the type level safety imply something about the behavior or performance of that concatenation at runtime? Or does the type level programming aspect only have meaning at compile time, when the programmer is grappling the code and putting things together?
Or if I have two type level numbers, and then multiply them, what does that mean at runtime? If these operations on large numbers are slow at compile time, are they instantaneous at runtime?
Or if we implemented type level RSA and then use it, what does that even mean at runtime?
Is it purely a compiletime safety/coherence tool? or does type level programming buy us anything for the runtime too? Is the logic and arithmetic 'paid for at compile time' or merely 'assured at compile time' (if that even makes sense)?
As you rightly say, Haskell [without weird extensions] has full type erasure. So that means anything computed purely at the type level is erased at runtime.
However, to do useful stuff, you connect the type-level stuff with your value-level stuff to provide useful properties.
Suppose, for example, you want to write a function that takes a pair of lists, treats them as mathematical vectors, and performs a vector dot-product with them. Now the dot-product is only defined on pairs of vectors of the same size. So if the size of the vectors doesn't match, you can't return a sensible answer.
Without type-level programming, your options are:
Require that the caller always supplies vectors of the same dimension, and cheerfully return gibberish if that requirement is not met. (I.e., ignore the problem.)
Perform an explicit check at run-time, and throw an exception or return Nothing or similar if the dimension don't match.
With type-level programming, you can make it so that if the dimensions don't match, the code does not compile! So that means at run-time you don't need to care about mismatched dimension, because... well, if your code is running, then the dimension cannot be mismatched.
The types have all been erased by this point, but you are still guaranteed that your code cannot crash / return gibberish, because the compiler has checked that that cannot happen.
It's really the same as the ordinary checks the compiler does to make sure you don't try to multiply an integer by a string or something. The types are all erased before runtime, and yet the code does not crash.
Of course, to do a dot-product, we merely have to check that two numbers are equal. We don't need any arithmetic yet. But it should be clear that to check whether the dimensions of our vectors match, we need to know the dimensions of our vectors. And that means that any operations that change the dimension of our vectors needs to do compile-time calculations, so the compiler can know the result size and check it satisfies the requirements.
You can also do more elaborate stuff. Somewhere I saw a library that lets you define a client/server communications protocol, but because it encodes the protocol into ludicrously complicated type signatures [which the compiler automatically infers], it can statically prove that the client and server implement exactly the same protocol (i.e., no bugs with the server not handling one of the messages the client can send). The types get erased at runtime, but we still know the wire protocol can't go wrong.

Vulkan Shader & Resources: Why Uniform and not Const Resources

We usually use const in c++ to imply that the value does not change (read only), why in GLSL/VK in the shader or resource definition they choose the word uniform ? Wodn`t be more consistent and use the keyword borrowed from c/c++
Beside that probably the uniform keyword in shader definitions give clues to the compiler to attach those resources as close to the hardware as possible, probably shared memory or registers ? Not sure on that.
That also probably why they mention in the VkSpec. that we need small ammounts of data for those type of resources. Like for eg: values of cosmological constants..etc
Is anything that I`m missing, or some bit of history that passed away ?
Uniforms in GPU programming and const in C++ are focused on different things.
C++ const documents that a variable is not intended to be changed, with some compiler enforcement. As such it's more about using the type system to improve clarity and enforce intended usage -- important for large-project software engineering. You can still get around it with const_cast or other tricks, and the compiler can't assume you didn't, so it's not strictly enforced.
The important thing about uniforms is that they're, well, uniform. Meaning they have the same value whenever they are read within a draw call. Since there might be hundreds to millions of reads of that value in a single draw call, this allows it to be cached, and just one copy of it to be cached, or that it can be preloaded into registers (or cache) before shaders run, that it can be cached in a non-coherent cache, that a single read result can be broadcast across all SIMD lanes in a core, etc. For this to work, the fact that the contents can't change must be strictly enforced (with memory aliasing you can get around even this, now, but results are very much undefined if you do). So uniform really isn't about declaring intent to other programmers for software engineering benefits like const is, it's about declaring intent to the compiler and driver so they can optimize based on it.
D3D uses "const" and "constant buffer" rather than uniform, so clearly there is some overlap. Though that does lead to saying things like "how many times do you update constants per frame?" which when you think about it is kind of a weird thing to say :). The values are constant within shader code, but very much aren't constant at the API level.
The etymology of the word is important here. The term "uniform" is derived from GLSL, which was inspired by the Renderman standard's shader terminology. In Renderman, "uniform" was used for values "whose values are constant over whatever portion of the surface begin shaded". This was an alternative to "varying" which represented values interpolated across the surface.
"Constant" would imply that the value never changes. Uniform values do change; they simply don't change at the same frequency as other values. Input values change per-invocation, uniform values change per-draw call, and constant values don't change. Note that in GLSL, const usually means "compile-time constant": a value that is set at compile time and is never changed.
A uniform variable in Vulkan ultimately comes from a resource that exists outside of the shader. Blocks of uniform variables fed by buffers, uniforms in push constants fed by push constant state are both external resources, set by the user. That's a fundamentally different concept from having a compile-time constant struct.
Since it's different from a constant struct, it needs a different term to request it.

6502 and little-endian conversion

For fun I'm implementing an NES emulator. I'm currently reading through documentation for the 6502 CPU and I'm a little confused.
I've seen documentation stating because the 6502 is little-endian so when using absolute addressing mode you need to swap the bytes. I'm writing this on an x86 machine which is also little-endian, so I don't understand why I couldn't simply cast to a uint16_t*, dereference that, and let the compiler work out the details.
I've written some simple tests in google test and they seem to agree with me.
// implementation of READ16
#define READ16(addr) (*(uint16_t*)addr)
TEST(MemMacro, READ16) {
uint8_t arr[] = {0xFF,0xCC};
uint8_t *mem = (&arr[0]);
EXPECT_EQ(0xCCFF, READ16(mem));
}
This passes, so it appears my supposition is correct, but I thought I'd ask someone with more experience than I.
Is this correct for pulling out the operand in 6502 absolute addressing mode? Am I possibly missing something?
It will work for simple cases on little-endian systems, but tying your implementation to those feels unnecessary when the corresponding portable implementation is simple. Sticking to the macro, you could do this instead:
#define READ16(addr) (addr[0] + (addr[1] << 8))
(Just to be pedantic, you should also make sure that addr[1] can't be out-of-bounds, and would need to add some more parentheses if addr could be a complex expression.)
However, as you keep developing your emulator, you will find that it's most natural to use a pair of general-purpose read_mem() and write_mem() functions that operate on single bytes. Remember that the address space is split up into multiple regions (RAM, ROM, and memory-mapped registers from the PPU and APU), so having e.g. a single array that you index into won't work well. The fact that memory regions can be remapped by mappers also complicates things. (You won't have to worry about that for simple games though -- I recommend starting with Donkey Kong.)
What you need to do is to figure out what region or memory-mapped register the address belongs to inside your read_mem() and write_mem() functions (this is called address decoding), and do the right thing for the address.
Returning to the original question, the fact that you'll end up using read_mem() to read the individual bytes of the address anyway means that the uint16_t casting trickery is even less likely to be useful. This is the simplest and most robust approach w.r.t. handling corner cases, and what every emulator I've seen does in practice (Nestopia, Nintendulator, and FCEUX).
In case you've missed it, the #nesdev channel on EFNet is very active and a good resource by the way. I assume you're already familiar with the NESDev wiki. :)
I've also been working on an emulator which can be found here.

Resources