What may be wrong about my use of SetGraphicsRootDescriptorTable in D3D12? - direct3d

For 7 meshes that I would like to draw, I load 7 textures and create the corresponding SRVs in a descriptor heap. Then there's another SRV for IMGUI. There are also 3 CBVs, for triple buffer usage. So it should be like: | srv x7 | srv x1 | cbv x3| in the heap.
The problem I met is that when I called SetGraphicsRootDescriptorTable on range 0, which should be an SRV (which is the texture actually), something went wrong. Here's the code:
ID3D12DescriptorHeap* ppHeaps[] = { pCbvSrvDescriptorHeap, pSamplerDescriptorHeap };
pCommandList->SetDescriptorHeaps(_countof(ppHeaps), ppHeaps);
pCommandList->IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
pCommandList->IASetIndexBuffer(pIndexBufferViewDesc);
pCommandList->IASetVertexBuffers(0, 1, pVertexBufferViewDesc);
CD3DX12_GPU_DESCRIPTOR_HANDLE srvHandle(pCbvSrvDescriptorHeap->GetGPUDescriptorHandleForHeapStart(), indexMesh, cbvSrvDescriptorSize);
pCommandList->SetGraphicsRootDescriptorTable(0, srvHandle);
pCommandList->SetGraphicsRootDescriptorTable(1, pSamplerDescriptorHeap->GetGPUDescriptorHandleForHeapStart());
If indexMesh is 5, SetGraphicsRootDescriptorTable will cause the following error though the render output seems still good. And when indexMesh is 6, the following error will still occur and there will be another identical error except that the offset 8 turns into 9.
D3D12 ERROR: CGraphicsCommandList::SetGraphicsRootDescriptorTable: Specified GPU Descriptor Handle (ptr = 0x400750000002c0 at 8 offsetInDescriptorsFromDescriptorHeapStart) of type CBV, for Root Signature (0x0000020A516E8BF0:'m_rootSignature')'s Descriptor Table (at Parameter Index [0])'s Descriptor Range (at Range Index [0] of type D3D12_DESCRIPTOR_RANGE_TYPE_SRV) have mismatching types. All descriptors of descriptor ranges declared STATIC (not-DESCRIPTORS_VOLATILE) in a root signature must be initialized prior to being set on the command list. [ EXECUTION ERROR #646: INVALID_DESCRIPTOR_HANDLE]
That is really weird, because I suppose that the only thing that may cause this is that cbvSrvDescriptorSize is not right. It is 64, and it is set by m_device->GetDescriptorHandleIncrementSize(D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV);which I think should work. Besides, if I set it to another value such as 32, the application would crash.
So if cbvSrvDescriptorSize is right, why would the correct indexMesh cause the wrong offset of the descriptor handle? The consequence of this error is that it seems to be influencing my CBV which breaks the render output. Any suggestion would be appreciated, thanks!
Thanks for Chuck's suggestion, here's the code about the rootSig:
CD3DX12_DESCRIPTOR_RANGE1 ranges[3];
ranges[0].Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 4, 0, 0, D3D12_DESCRIPTOR_RANGE_FLAG_DATA_STATIC);
ranges[1].Init(D3D12_DESCRIPTOR_RANGE_TYPE_SAMPLER, 1, 0);
ranges[2].Init(D3D12_DESCRIPTOR_RANGE_TYPE_CBV, 1, 0, 0, D3D12_DESCRIPTOR_RANGE_FLAG_DATA_STATIC);
CD3DX12_ROOT_PARAMETER1 rootParameters[3];
rootParameters[0].InitAsDescriptorTable(1, &ranges[0], D3D12_SHADER_VISIBILITY_PIXEL);
rootParameters[1].InitAsDescriptorTable(1, &ranges[1], D3D12_SHADER_VISIBILITY_PIXEL);
rootParameters[2].InitAsDescriptorTable(1, &ranges[2], D3D12_SHADER_VISIBILITY_ALL);
CD3DX12_VERSIONED_ROOT_SIGNATURE_DESC rootSignatureDesc;
rootSignatureDesc.Init_1_1(_countof(rootParameters), rootParameters, 0, nullptr, D3D12_ROOT_SIGNATURE_FLAG_ALLOW_INPUT_ASSEMBLER_INPUT_LAYOUT);
ComPtr<ID3DBlob> signature;
ComPtr<ID3DBlob> error;
ThrowIfFailed(D3DX12SerializeVersionedRootSignature(&rootSignatureDesc, featureData.HighestVersion, &signature, &error));
ThrowIfFailed(m_device->CreateRootSignature(0, signature->GetBufferPointer(), signature->GetBufferSize(), IID_PPV_ARGS(&m_rootSignature)));
NAME_D3D12_OBJECT(m_rootSignature);
And here's some declarations in the pixel shader:
Texture2DArray g_textures : register(t0);
SamplerState g_sampler : register(s0);
cbuffer cb0 : register(b0)
{
float4x4 g_mWorldViewProj;
float3 g_lightPos;
float3 g_eyePos;
...
};

It's not very often I come across the exact problem I'm experiencing (my code is almost verbatim) and it's an in-progress post! Let's suffer together.
My problem turned out to be the calls to CreateConstantBufferView()/CreateShaderResourceView() - I was passing srvHeap->GetCPUDescriptorHandleForHeapStart() as the destDescriptor handle. These need to be offset to match your table layout (the offsetInDescriptorsFromTableStart param of CD3DX12_DESCRIPTOR_RANGE1).
I found it easier to just maintain one D3D12_CPU_DESCRIPTOR_HANDLE to the heap and just increment handle.ptr after every call to CreateSomethingView() which uses that heap.
CD3DX12_DESCRIPTOR_RANGE1 rangesV[1] = {{}};
CD3DX12_DESCRIPTOR_RANGE1 rangesP[1] = {{}};
// Vertex
rangesV[0].Init(D3D12_DESCRIPTOR_RANGE_TYPE_CBV, 1, 0, 0, D3D12_DESCRIPTOR_RANGE_FLAG_NONE, 0); // b0 at desc offset 0
// Pixel
rangesP[0].Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 1, 0, 0, D3D12_DESCRIPTOR_RANGE_FLAG_NONE, 1); // t0 at desc offset 1
CD3DX12_ROOT_PARAMETER1 rootParameters[2] = {{}};
rootParameters[0].InitAsDescriptorTable(1, &rangesV[0], D3D12_SHADER_VISIBILITY_VERTEX);
rootParameters[1].InitAsDescriptorTable(1, &rangesP[0], D3D12_SHADER_VISIBILITY_PIXEL);
D3D12_CPU_DESCRIPTOR_HANDLE srvHeapHandle = srvHeap->GetCPUDescriptorHandleForHeapStart();
// ----
device->CreateConstantBufferView(&cbvDesc, srvHeapHandle);
srvHeapHandle.ptr += device->GetDescriptorHandleIncrementSize(D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV);
// ----
device->CreateShaderResourceView(texture, &srvDesc, srvHeapHandle);
srvHeapHandle.ptr += device->GetDescriptorHandleIncrementSize(D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV);
Perhaps an enum would help keep it tidier and more maintainable, though. I'm still experimenting.

Related

CUDA equivalent of pragma omp task

I am working on a problem where work between each thread may varies drastically, where, for example, a thread may this time handle 1000000 element, but another thread may only handle 1 or 2 element. So I stumbled upon this where the answer solve the unbalanced workload by using openmp task on CPU, so my question is can I achieve the same on CUDA ?
In case you want more context:
The problem I'm trying to solve is, I have a n tuple, each has a starting point, an ending point and a value.
(0, 3, 1), (3, 6, 2), (6, 10, 3), ...
So for each tuple I want to write the value to every position between starting point and ending point of another empty array.
1, 1, 1, 2, 2, 2, 3, 3, 3, 3, ...
It is guaranteed that there is no start/ ending overlap.
My current approach is a thread for each tuple, but the starting and ending can vary a lot so the imbalanced workload between threads might cause a bottleneck for the program, though rare, but it may very well be.
The most common thread strategy I can think of in CUDA is to assign one thread per output point, and then have each thread do the work necessary to populate its output point.
For your stated objective (have each thread do roughly equal work) this is a useful strategy.
I will suggest using thrust for this. The basic idea is to:
determine the necessary size of the output based on the input
spin up a set of threads equal to the output size, where each thread determines its "insert index" in the output array by using a vectorized binary search on the input
with the insert index, insert the appropriate value in the output array.
I have used your data, the only change is that I changed the insert values from 1,2,3 to 5,2,7:
$ cat t1871.cu
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/binary_search.h>
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
typedef thrust::tuple<int,int,int> mt;
// returns selected item from tuple
struct my_cpy_functor1
{
__host__ __device__ int operator()(mt d){ return thrust::get<1>(d); }
};
struct my_cpy_functor2
{
__host__ __device__ int operator()(mt d){ return thrust::get<2>(d); }
};
int main(){
mt my_data[] = {{0, 3, 5}, {3, 6, 2}, {6, 10, 7}};
int ds = sizeof(my_data)/sizeof(my_data[0]); // determine data size
int os = thrust::get<1>(my_data[ds-1]) - thrust::get<0>(my_data[0]); // and output size
thrust::device_vector<mt> d_data(my_data, my_data+ds); // transfer data to device
thrust::device_vector<int> d_idx(ds+1); // create index array for searching of insertion points
thrust::transform(d_data.begin(), d_data.end(), d_idx.begin()+1, my_cpy_functor1()); // set index array
thrust::device_vector<int> d_ins(os); // create array to hold insertion points
thrust::upper_bound(d_idx.begin(), d_idx.end(), thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(os), d_ins.begin()); // identify insertion points
thrust::transform(thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(d_ins.begin(), _1 -1)), thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(d_ins.end(), _1 -1)), d_ins.begin(), my_cpy_functor2()); // insert
thrust::copy(d_ins.begin(), d_ins.end(), std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t1871 t1871.cu -std=c++14
$ ./t1871
5,5,5,2,2,2,7,7,7,7,
$

Flushing on UART doesn't work as expected

I need to write a sequence of values (buffer, ~10bytes) via UART.
This sequence needs to start with a BREAK delimiter, and in my case I need to decrease the baud rate to a lower value.
Details about my environment:
Development board: BeagleBone Black.
Linux Kernel version: 3.8.13-bone70.
Serial driver used by the tty discipline: omap-serial.
What I finally get is something like this:
UARTIOHandler->setBaudRate(B9600);
unsigned char breakChar[] = { 0 };
UARTIOHandler->write(breakChar, 1);
UARTIOHandler->setBaudRate(B19200);
UARTIOHandler->write({1, 2, 3, 4, 5, 6, 7, 8, 9, 10});
The write method is implemented this way:
int UARTIOHandler::write(const std::initializer_list<uchar8> &data) {
uchar8 buffer[data.size()];
int counter = 0;
for(auto i : data) {
buffer[counter++] = i;
}
auto output = ::write(this->fd_write, buffer, data.size());
this->flush();
return output;
}
And finally the flush() method:
void UARTIOHandler::flush() {
tcflush(this->fd_write, TCIOFLUSH);
}
The problem with this code is that the flushing doesn't always work, sometimes the distance between the BREAK and the first byte of data (observed on a scope) is ~500us (which is fine for my application), and sometimes is up to ~3ms.
EDIT: This is the actual behavior:
For the first five seconds everything works fine (the distance between the BREAK and the rest of the message doesn't exceed ~1ms), then, after five seconds there are some frames that exceed this inter-byte timing (for up to ~3ms).
There's always the code that I posted which is executed, so there's no possible way that I somehow forget flushing the buffers.
Why do this variations happen?
I have searched for relevant problems and found this, one workaround described there is to use a delay in front of the tcflush(...) function call. I can't use this method in my application because it will affect the functionality.
Another comment in that topic suggests that this was a bug in the Linux kernel, could this also be in my case?

What is PTHREAD_MUTEX_INITIALIZER?

I read a lot of articles about PTHREAD_MUTEX_INITIALIZER, I understood what does it do, however, I am unable to understand how does it do that? How a macro can be used to initialize a variable just by assigning its name to that variable?
What I know about macros is that they can be used just as functions, such as:
#define MAX(a, b) ((a) > (b) ? (a) : (b))
Now we can use this macro as a function like Max(a, b).
But how can we write a macro that can be used in the way which PTHREAD_MUTEX_INITIALIZER is used like:
int x = Macro_Name;
Then x will be initialized to a specific value (like when a mutex is initialized once PTHREAD_MUTEX_INITIALIZER is assigned to it).
Here is a snippet from the source code of libpthread, taken from http://git.savannah.gnu.org/cgit/hurd/libpthread.git/tree/sysdeps/pthread/bits/types/struct___pthread_mutex.h (I only removed comments that are irrelevant to the question)
/* User visible part of a mutex. */
struct __pthread_mutex
{
__pthread_spinlock_t __held;
__pthread_spinlock_t __lock;
char *__cthreadscompat1;
struct __pthread *__queue;
struct __pthread_mutexattr *__attr;
void *__data;
void *__owner;
unsigned __locks;
};
# define __PTHREAD_MUTEX_INITIALIZER \
{ __PTHREAD_SPIN_LOCK_INITIALIZER, __PTHREAD_SPIN_LOCK_INITIALIZER, 0, 0, 0, 0, 0, 0 }
From that, it can be seen that the macro hides an initializer list for the structure that represents the "user visible part of a mutex". Most members of the struct (including pointers) are set to 0, and internal spin locks are initialized with their own initializer macro, which is probably defined similarly.
Of course it's just one implementation, but I guess other implementations might have something similar.

"Invalid Handle" Create CGBitmapContext

I've got a problem with the CGBitmapcontext.
I get en error while creating the CGBitmapContext with the message "invalid Handle".
Here is my code:
var previewContext = new CGBitmapContext(null, (int)ExportedImage.Size.Width, (int)ExportedImage.Size.Height, 8, (int)ExportedImage.Size.Height * 4, CGColorSpace.CreateDeviceRGB(), CGImageAlphaInfo.PremultipliedFirst);
Thank you;
That is because you are passing null to the first parameter. The CGBitmapContext is for drawing directly into a memory buffer. The first parameter in all the overloads of the constructor is (Apple docs):
data
A pointer to the destination in memory where the drawing is to be rendered. The size of this memory block should be at least
(bytesPerRow*height) bytes.
In MonoTouch, we get two overloads that accept a byte[] for convenience. So you should use it like this:
int bytesPerRow = (int)ExportedImage.Size.Width * 4; // note that bytes per row should
//be based on width, not height.
byte[] ctxBuffer = new byte[bytesPerRow * (int)ExportedImage.Size.Height];
var previewContext =
new CGBitmapContext(ctxBuffer, (int)ExportedImage.Size.Width,
(int)ExportedImage.Size.Height, 8, bytesPerRow, colorSpace, bitmapFlags);
This can also happen if the width or height parameters passed into the method have a value of 0.

What is openCL equivalent for this cuda "cudaMallocPitch "code.?

My PC has an AMD processor with an ATI 3200 GPU which doesn't support OpenCL. The rest of the codes all running by "Falling back to CPU itself".
I am converting one of the code from CUDA to OpenCL but stuck in some particular part for which there is no exact conversion code in OpenCL. since i have less experience in OpenCL I can't make out this, please suggest me some solution if any of you think will work,
The CUDA code is,
size_t pitch = 0;
cudaError error = cudaMallocPitch((void**)&gpu_data, (size_t*)&pitch,
instances->cols * sizeof(float), instances->rows);
for( int i = 0; i < instances->rows; i++ ){
error = cudaMemcpy((void*)(gpu_data + (pitch/sizeof(float))*i),
(void*)(instances->data + (instances->cols*i)),
instances->cols * sizeof(float) ,cudaMemcpyHostToDevice);
If I remove the pitch value from the above I end up with an problem which doesn't write to the device memory "gpu_data".
Somebody please convert this code to OpenCL and reply. I have converted it to OpenCL, but its not working and the data is not written to "gpu_data". My converted OpenCL code is
gpu_data = clCreateBuffer(context, CL_MEM_READ_WRITE, ((instances->cols)*(instances->rows))*sizeof(float), NULL, &ret);
for( int i = 0; i < instances->rows; i++ ){
ret = clEnqueueWriteBuffer(command_queue, gpu_data, CL_TRUE, 0, ((instances->cols)*(instances->rows))*sizeof(float),(void*)(instances->data + (instances->cols*i)) , 0, NULL, NULL);
Sometimes it runs well for this code and gets stuck in the reading part i.e.
ret = clEnqueueReadBuffer(command_queue, gpu_data, CL_TRUE, 0,sizeof( float ) * instances->cols* 1 , instances->data, 0, NULL, NULL);
overhere. And it gives error like
Unhandled exception at 0x10001098 in CL_kmeans.exe: 0xC000001D: Illegal Instruction.
when break is pressed , it gives:
No symbols are loaded for any call stack frame. The source code cannot be displayed.
while debugging. In the call stack it is displaying:
OCL8CA9.tmp.dll!10001098()
[Frames below may be incorrect and/or missing, no symbols loaded for OCL8CA9.tmp.dll]
amdocl.dll!5c39de16()
I really dont know what it means. someone please help me to rid of this problem.
First of all, in the CUDA code you're doing a horribly inefficient thing to copy the data. The CUDA runtime has the function cudaMemcpy2D that does exactly what you are trying to do by looping over different rows.
What cudaMallocPitch does is to compute an optimal pitch (= distance in byte between rows in a 2D array) such that each new row begins at an address that is optimal for coalescing, and then allocates a memory area as large as pitch times the number of rows you specify. You can emulate the same thing in OpenCL by first computing the optimal pitch and then doing the allocation of the correct size.
The optimal pitch is computed by (1) getting the base address alignment preference for your card (CL_DEVICE_MEM_BASE_ADDR_ALIGN property with clGetDeviceInfo: note that the returned value is in bits, so you have to divide by 8 to get it in bytes); let's call this base (2) find the largest multiple of base that is no less than your natural data pitch (sizeof(type) times number of columns); this will be your pitch.
You then allocate pitch times number of rows bytes, and pass the pitch information to kernels.
Also, when copying data from the host to the device and converesely, you want to use clEnqueue{Read,Write}BufferRect, that are specifically designed to copy 2D data (they are the counterparts to cudaMemcpy2D).

Resources