Is it possible to write to a non 4-bytes aligned address with HLSL compute shader? - direct3d

I am trying to convert an existing OpenCL kernel to an HLSL compute shader.
The OpenCL kernel samples each pixel in an RGBA texture and writes each color channel to a tighly packed array.
So basically, I need to write to a tightly packed uchar array in a pattern that goes somewhat like this:
r r r ... r g g g ... g b b b ... b a a a ... a
where each letter stands for a single byte (red / green / blue / alpha) that originates from a pixel channel.
going through the documentation for RWByteAddressBuffer Store method, it clearly states:
void Store(
in uint address,
in uint value
);
address [in]
Type: uint
The input address in bytes, which must be a multiple of 4.
In order to write the correct pattern to the buffer, I must be able to write a single byte to a non aligned address. In OpenCL / CUDA this is pretty trivial.
Is it technically possible to achieve that with HLSL?
Is this a known limitation? possible workarounds?

As far as I know it is not possible to write directly to a non aligned address in this scenario. You can, however, use a little trick to achieve what you want. Below you can see the code of the entire compute shader which does exactly what you want. The function StoreValueAtByte in particular is what you are looking for.
Texture2D<float4> Input;
RWByteAddressBuffer Output;
void StoreValueAtByte(in uint index_of_byte, in uint value) {
// Calculate the address of the 4-byte-slot in which index_of_byte resides
uint addr_align4 = floor(float(index_of_byte) / 4.0f) * 4;
// Calculate which byte within the 4-byte-slot it is
uint location = index_of_byte % 4;
// Shift bits to their proper location within its 4-byte-slot
value = value << ((3 - location) * 8);
// Write value to buffer
Output.InterlockedOr(addr_align4, value);
}
[numthreads(20, 20, 1)]
void CSMAIN(uint3 ID : SV_DispatchThreadID) {
// Get width and height of texture
uint tex_width, tex_height;
Input.GetDimensions(tex_width, tex_height);
// Make sure thread does not operate outside the texture
if(tex_width > ID.x && tex_height > ID.y) {
uint num_pixels = tex_width * tex_height;
// Calculate address of where to write color channel data of pixel
uint addr_red = 0 * num_pixels + ID.y * tex_width + ID.x;
uint addr_green = 1 * num_pixels + ID.y * tex_width + ID.x;
uint addr_blue = 2 * num_pixels + ID.y * tex_width + ID.x;
uint addr_alpha = 3 * num_pixels + ID.y * tex_width + ID.x;
// Get color of pixel and convert from [0,1] to [0,255]
float4 color = Input[ID.xy];
uint4 color_final = uint4(round(color.x * 255), round(color.y * 255), round(color.z * 255), round(color.w * 255));
// Store color channel values in output buffer
StoreValueAtByte(addr_red, color_final.x);
StoreValueAtByte(addr_green, color_final.y);
StoreValueAtByte(addr_blue, color_final.z);
StoreValueAtByte(addr_alpha, color_final.w);
}
}
I hope the code is self explanatory since it is hard to explain, but I'll try anyway.
The fist thing the function StoreValueAtByte does is to calculate the address of the 4-byte-slot enclosing the byte you want to write to. After that the position of the byte inside the 4-byte-slot is calculated (is it the fist, second, third or the fourth byte in the slot). Since the byte you want to write is already inside an 4-byte variable (namely value) and occupies the rightmost byte, you then just have to shift the byte to its proper position inside the 4-byte variable. After that you just have to write the variable value to the buffer at the 4-byte-aligned address. This is done using bitwise OR because multiple threads write to the same address interfering each other leading to write-after-write-hazards. This of course only works if you initialize the entire output buffer with zeros before issuing the dispatch-call.

Related

How to get the color of a specific pixel drawn using SDL_RenderDrawPoint() on SDL2?

SDL_SetRenderDrawColor(renderer, 255, 0, 0, 255);
SDL_RenderDrawPoint(renderer, (window_height / 2) + xxi[i], -yyi[i] + (window_width / 2));
SDL_RenderPresent(renderer);
Now I wanna get the color of the xxi[i], yyi[i] point.
But I don't know how to get that.
SDL_RenderReadPixels():
/**
* Read pixels from the current rendering target to an array of pixels.
*
* **WARNING**: This is a very slow operation, and should not be used
* frequently.
*
* `pitch` specifies the number of bytes between rows in the destination
* `pixels` data. This allows you to write to a subrectangle or have padded
* rows in the destination. Generally, `pitch` should equal the number of
* pixels per row in the `pixels` data times the number of bytes per pixel,
* but it might contain additional padding (for example, 24bit RGB Windows
* Bitmap data pads all rows to multiples of 4 bytes).
*
* \param renderer the rendering context
* \param rect an SDL_Rect structure representing the area to read, or NULL
* for the entire render target
* \param format an SDL_PixelFormatEnum value of the desired format of the
* pixel data, or 0 to use the format of the rendering target
* \param pixels a pointer to the pixel data to copy into
* \param pitch the pitch of the `pixels` parameter
* \returns 0 on success or a negative error code on failure; call
* SDL_GetError() for more information.
*/
extern DECLSPEC int SDLCALL SDL_RenderReadPixels(SDL_Renderer * renderer,
const SDL_Rect * rect,
Uint32 format,
void *pixels, int pitch);
Set the rect argument's width & height to 1 to read a single pixel.

GLSL Compute Shader only Partially writing onto Buffer in Vulkan

I've created this GLSL Compute Shader and compiled it using "glslangValidator.exe". However, it will only ever update the "Particles[i].Velocity" values and not any other values and this only happens in some instances. I've checked that the correct input values are sent in using "RenderDoc".
Buffer Usage Flag Bits
VK_BUFFER_USAGE_VERTEX_BUFFER_BIT | VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT
And the Property Flag Bits
VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
GLSL Shader
#version 450
#extension GL_ARB_separate_shader_objects : enable
struct Particle
{
vec3 Position;
vec3 Velocity;
vec3 IPosition;
vec3 IVelocity;
float LifeTime;
float ILifetime;
};
layout(binding = 0) buffer Source
{
Particle Particles[ ];
};
layout(binding = 1) uniform UBO
{
mat4 model;
mat4 view;
mat4 proj;
float time;
};
vec3 Gravity = vec3(0.0f,-0.98f,0.0f);
float dampeningFactor = 0.5;
void main(){
uint i = gl_GlobalInvocationID.x;
if(Particles[i].LifeTime > 0.0f){
Particles[i].Velocity = Particles[i].Velocity + Gravity * dampeningFactor * time;
Particles[i].Position = Particles[i].Position + Particles[i].Velocity * time;
Particles[i].LifeTime = Particles[i].LifeTime - time;
}else{
Particles[i].Velocity = Particles[i].IVelocity;
Particles[i].Position = Particles[i].IPosition;
Particles[i].LifeTime = Particles[i].ILifetime;
}
}
Descriptor Set Layout Binding
VkDescriptorSetLayoutBinding descriptorSetLayoutBindings[2] = {
{ 0, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, 1, VK_SHADER_STAGE_COMPUTE_BIT, 0 },
{ 1, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER, 1, VK_SHADER_STAGE_COMPUTE_BIT, 0 }
};
The Command Dispatch
vkCmdDispatch(computeCommandBuffers, MAX_PARTICLES , 1, 1);
The Submitting of the Queue
VkSubmitInfo cSubmitInfo = {};
cSubmitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
cSubmitInfo.commandBufferCount = 1;
cSubmitInfo.pCommandBuffers = &computeCommandBuffers;
if (vkQueueSubmit(computeQueue.getQueue(), 1, &cSubmitInfo, computeFence) != VK_SUCCESS) {
throw std::runtime_error("failed to submit compute command buffer!");
}
vkWaitForFences(device.getDevice(), 1, &computeFence, VK_TRUE, UINT64_MAX);
UPDATE: 13/05/2017 (More Information Added)
Particle Struct Definition in CPP
struct Particle {
glm::vec3 location;
glm::vec3 velocity;
glm::vec3 initLocation;
glm::vec3 initVelocity;
float lifeTime;
float initLifetime;
}
Data Mapping to Storage Buffer
void* data;
vkMapMemory(device.getDevice(), stagingBufferMemory, 0, bufferSize, 0, &data);
memcpy(data, particles, (size_t)bufferSize);
vkUnmapMemory(device.getDevice(), stagingBufferMemory);
copyBuffer(stagingBuffer, computeBuffer, bufferSize);
Copy Buffer Function (by Alexander Overvoorde from vulkan-tutorial.com)
void copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size) {
VkCommandBufferAllocateInfo allocInfo = {};
allocInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;
allocInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;
allocInfo.commandPool = commandPool.getCommandPool();
allocInfo.commandBufferCount = 1;
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(device.getDevice(), &allocInfo, &commandBuffer);
VkCommandBufferBeginInfo beginInfo = {};
beginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
beginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;
vkBeginCommandBuffer(commandBuffer, &beginInfo);
VkBufferCopy copyRegion = {};
copyRegion.size = size;
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
VkSubmitInfo submitInfo = {};
submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submitInfo.commandBufferCount = 1;
submitInfo.pCommandBuffers = &commandBuffer;
vkQueueSubmit(graphicsQueue.getQueue(), 1, &submitInfo, VK_NULL_HANDLE);
vkQueueWaitIdle(graphicsQueue.getQueue());
vkFreeCommandBuffers(device.getDevice(), commandPool.getCommandPool(), 1, &commandBuffer);
}
Have a look at this StackOverflow question:
Memory allocation with std430 qualifier
FINAL, CORRECTED ANSWER:
In Your case the biggest member of Your structure is vec3 (3-element vector of floats). Base alignment of vec3 is the same as alignment of vec4. So the base alignment of Your array's elements is equal to 16 bytes. This means that each element of Your array has to start at an address that is a multiple of 16.
But alignment rules have to be applied for each structure member recursively. 3-element vectors has the same alignment as 4-element vectors. This means that:
Position member starts at the same alignment as each array member
Velocity, IPosition and IVelocitymembers must start at multiples of 16 bytes after the beginning of a given array element.
LifeTime and ILifeTime members have a 4-bytes alignment.
So the total size of Your struct in bytes is equal to:
Position - 16 bytes (Position itself takes 12 bytes, but next member has a 16-byte alignment)
Velocity - 16 bytes
IPosition - 16 bytes
IVelocity + LifeTime - 16 bytes
ILifeTime - 4 bytes
which gives 68 bytes. So, as far as I understand it, You need a 12-byte padding at the end of Your structure (additional 12 bytes between array elements) because each array element must start at addresses which are a multiple of 16.
So the first array element starts at offset 0 of the memory bound to the storage buffer. But the second array element must start at offset 80 from the begging of the memory (nearest multiple of 16 greater than 68) and so on.
Or, as #NicolBolas commented, to make life easier, pack everything in vec4 members only ;-).
BETTER THOUGH NOT FULLY CORRECT ANSWER:
In Your case the biggest member of Your structure is vec3 (3-element vector of floats). So the base alignment of Your array's elements is equal to 12 bytes (in case of arrays of structs in std430 layout, the base alignment don't have to be rounded up to mach alignment of 4-element vectors. <- Here I was wrong. We don't have to round up structure's base alignment, but the alignment of its members is calculated normally, with vec3 alignment being the same as vec4 alignment). This means that each element of Your array has to start at an address that is a multiple of 12 (no, in this case it should start at a multiple of 16).
But alignment rules have to be applied for each structure member recursively. 3-element vectors has the same alignment as 4-element vectors. This means that:
Position member starts at the same alignment as each array member
Velocity, IPosition and IVelocitymembers must start at multiples of 16 bytes after the beginning of a given array element.
LifeTime and ILifeTime members have a 4-bytes alignment.
So the total size of Your struct in bytes is equal to:
Position - 16 bytes (Position itself takes 12 bytes, but next member has a 16-byte alignment)
Velocity - 16 bytes
IPosition - 16 bytes
IVelocity + LifeTime - 16 bytes
ILifeTime - 4 bytes
which gives 68 bytes. So, as far as I understand it, You need a 4-byte padding at the end of Your structure (additional 4 bytes between array elements) because each array element must start at addresses which are a multiple of 12 (again, we need 12-byte padding here so the next array elements starts at a multiple of 16, not 12).
So the first array element starts at offset 0 of the memory bound to the storage buffer. But the second array element must start at offset 72 from the begging of the memory (nearest multiple of 12 greater than 68) and so on.
PREVIOUS, WRONG ANSWER:
In Your case the biggest member is vec3 (3-element vector of floats). It's alignment is equal to 12 bytes (in case of arrays of structs we don't have to round alignment of 3-element vectors to mach alignment of 4-element vectors). The size of Your struct in bytes equals to 56. So, as far as I understand it, You need a 4-byte padding at the end of Your structure (additional 4 bytes between array elements) because each array element must start at addresses which are a multiple of 12.

C++ size_t and double type calculation

I am not familiar with C++ and current face a problem about size_t calculation with double type.
I provide a part of source code as below. The variable "storage" is define as double and "pos" as size_t type. How come they can be calculate together? I review the value of "pos and it shows value like 0, 1, 2 and so on. Moreover, in the case of double* result = storage + pos, it shows 108 + 2 comes out the result x 117.
Further, sometimes 108 + 0 comes out the result x zero. what the condition lead to the result?
How do I know the exact value of size_t before the calculation?
Any advice & suggestion is appreciated.
double* getPosValue(size_t pos, IdentifierType *idRule, unsigned int *errorNumber, bool *found)
{
double * storage = *from other function with value 108*
double* result = storage + pos;
uint16_t* stat = status + pos; }
The size of a variable (or type) can be obtained with:
sizeof(variableNameOrTypeName)
If you're after the address of a given array element such as variableName[42], it's simply:
&(variableName[42])
with no explicit mucking about with pointers.
If you want to manipulate the actual double value when you only have a pointer to it, you need to dereference the pointer. For example:
double xyzzy = 108.0; // this is the VALUE.
double *pXyzzy = &xyzzy; // this is a POINTER to it.
double plugh = *pXyzzy + 12.0;
The final line above gets the value from the pointer (*pXyzzy) and adds twelve to that, before storing it into another variable named plugh.
You should be very wary of things like:
double * storage = 108;
That creates a pointer to a double with the address of 108. In no way does it create an actual double with the value 108. Dereferencing that pointer is likely to lead to, shall we say, interesting results :-)

How to interpret the field 'data' of an XImage

I am trying to understand how the data obtained from XGetImage is disposed in memory:
XImage img = XGetImage(display, root, 0, 0, width, height, AllPlanes, ZPixmap);
Now suppose I want to decompose each pixel value in red, blue, green channels. How can I do this in a portable way? The following is an example, but it depends on a particular configuration of the XServer and does not work in every case:
for (int x = 0; x < width; x++)
for (int y = 0; y < height; y++) {
unsigned long pixel = XGetPixel(img, x, y);
unsigned char blue = pixel & blue_mask;
unsigned char green = (pixel & green_mask) >> 8;
unsigned char red = (pixel & red_mask) >> 16;
//...
}
In the above example I am assuming a particular order of the RGB channels in pixel and also that pixels are 24bit-depth: in facts, I have img->depth=24 and img->bits_per_pixels=32 (the screen is also 24-bit depth). But this is not a generic case.
As a second step I want to get rid of XGetPixel and use or describe img->data directly. The first thing I need to know is if there is anything in Xlib which exactly gives me all the informations I need to interpret how the image is built starting from the img->data field, which are:
the order of R,G,B channels in each pixel;
the number of bits for each pixels;
the numbbe of bits for each channel;
if possible, a corresponding FOURCC
The shift is a simple function of the mask:
int get_shift (int mask) {
shift = 0;
while (mask) {
if (mask & 1) break;
shift++;
mask >>=1;
}
return shift;
}
Number of bits in each channel is just the number of 1 bits in its mask (count them). The channel order is determined by the shifts (if red shift is 0, the the first channel is R, etc).
I think the valid values for bits_per_pixel are 1, 2, 4, 8, 15, 16, 24 and 32 (15 and 16 bits are the same 2 bytes per pixel format, but the former has 1 bit unused). I don't think it's worth anyone's time to support anything but 24 and 32 bpp.
X11 is not concerned with media files, so no 4CC code.
This can be read from the XImage structure itself.
the order of R,G,B channels in each pixel;
This is contained in this field of the XImage structure:
int byte_order; /* data byte order, LSBFirst, MSBFirst */
which tells you whether it's RGB or BGR (because it only depends on the endianness of the machine).
the number of bits for each pixels;
can be obtained from this field:
int bits_per_pixel; /* bits per pixel (ZPixmap) */
which is basically the number of bits set in each of the channel masks:
unsigned long red_mask; /* bits in z arrangement */
unsigned long green_mask;
unsigned long blue_mask;
the numbbe of bits for each channel;
See above, or you can use the code from #n.m.'s answer to count the bits yourself.
Yeah, it would be great if they put the bit shift constants in that structure too, but apparently they decided not to, since the pixels are aligned to bytes anyway, in "standard order" (RGB). Xlib makes sure to convert it to that order for you when it retrieves the data from the X server, even if they are stored internally in a different format server-side. So it's always in RGB format, byte-aligned, but depending on the endianness of the machine, the bytes inside an unsigned long can appear in a reverse order, hence the byte_order field to tell you about that.
So in order to extract these channels, just use the 0, 8 and 16 shifts after masking with red_mask, green_mask and blue_mask, just make sure you shift the right bytes depending on the byte_order and it should work fine.

What is the return value of sched_find_first_bit if it doesn't find anything?

The kernel is 2.4.
On a side note, does anybody knows a good place where I can search for that kind of information? Searching Google for function definitions is frustrating.
If you plan on spending any significant time searching through or understanding the Linux kernel, I recommend downloading a copy and using Cscope.
Using Cscope on large projects (example: the Linux kernel)
I found the following in a copy of the Linux kernel 2.4.18.
The key seems to be the comment before this last piece of code below. It appears that the return value of sched_find_first_bit is undefined if no bit is set.
From linux-2.4/include/linux/sched.h:185
/*
* The maximum RT priority is configurable. If the resulting
* bitmap is 160-bits , we can use a hand-coded routine which
* is optimal. Otherwise, we fall back on a generic routine for
* finding the first set bit from an arbitrarily-sized bitmap.
*/
#if MAX_PRIO 127
#define sched_find_first_bit(map) _sched_find_first_bit(map)
#else
#define sched_find_first_bit(map) find_first_bit(map, MAX_PRIO)
#endif
From linux-2.4/include/asm-i386/bitops.h:303
/**
* find_first_bit - find the first set bit in a memory region
* #addr: The address to start the search at
* #size: The maximum size to search
*
* Returns the bit-number of the first set bit, not the number of the byte
* containing a bit.
*/
static __inline__ int find_first_bit(void * addr, unsigned size)
{
int d0, d1;
int res;
/* This looks at memory. Mark it volatile to tell gcc not to move it around */
__asm__ __volatile__(
"xorl %%eax,%%eax\n\t"
"repe; scasl\n\t"
"jz 1f\n\t"
"leal -4(%%edi),%%edi\n\t"
"bsfl (%%edi),%%eax\n"
"1:\tsubl %%ebx,%%edi\n\t"
"shll $3,%%edi\n\t"
"addl %%edi,%%eax"
:"=a" (res), "=&c" (d0), "=&D" (d1)
:"1" ((size + 31) >> 5), "2" (addr), "b" (addr));
return res;
}
From linux-2.4/include/asm-i386/bitops.h:425
/*
* Every architecture must define this function. It's the fastest
* way of searching a 140-bit bitmap where the first 100 bits are
* unlikely to be set. It's guaranteed that at least one of the 140
* bits is cleared.
*/
static inline int _sched_find_first_bit(unsigned long *b)
{
if (unlikely(b[0]))
return __ffs(b[0]);
if (unlikely(b[1]))
return __ffs(b[1]) + 32;
if (unlikely(b[2]))
return __ffs(b[2]) + 64;
if (b[3])
return __ffs(b[3]) + 96;
return __ffs(b[4]) + 128;
}
From linux-2.4/include/asm-i386/bitops.h:409
/**
* __ffs - find first bit in word.
* #word: The word to search
*
* Undefined if no bit exists, so code should check against 0 first.
*/
static __inline__ unsigned long __ffs(unsigned long word)
{
__asm__("bsfl %1,%0"
:"=r" (word)
:"rm" (word));
return word;
}

Resources