GLSL Compute Shader only Partially writing onto Buffer in Vulkan - graphics

I've created this GLSL Compute Shader and compiled it using "glslangValidator.exe". However, it will only ever update the "Particles[i].Velocity" values and not any other values and this only happens in some instances. I've checked that the correct input values are sent in using "RenderDoc".
Buffer Usage Flag Bits
VK_BUFFER_USAGE_VERTEX_BUFFER_BIT | VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT
And the Property Flag Bits
VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
GLSL Shader
#version 450
#extension GL_ARB_separate_shader_objects : enable
struct Particle
{
vec3 Position;
vec3 Velocity;
vec3 IPosition;
vec3 IVelocity;
float LifeTime;
float ILifetime;
};
layout(binding = 0) buffer Source
{
Particle Particles[ ];
};
layout(binding = 1) uniform UBO
{
mat4 model;
mat4 view;
mat4 proj;
float time;
};
vec3 Gravity = vec3(0.0f,-0.98f,0.0f);
float dampeningFactor = 0.5;
void main(){
uint i = gl_GlobalInvocationID.x;
if(Particles[i].LifeTime > 0.0f){
Particles[i].Velocity = Particles[i].Velocity + Gravity * dampeningFactor * time;
Particles[i].Position = Particles[i].Position + Particles[i].Velocity * time;
Particles[i].LifeTime = Particles[i].LifeTime - time;
}else{
Particles[i].Velocity = Particles[i].IVelocity;
Particles[i].Position = Particles[i].IPosition;
Particles[i].LifeTime = Particles[i].ILifetime;
}
}
Descriptor Set Layout Binding
VkDescriptorSetLayoutBinding descriptorSetLayoutBindings[2] = {
{ 0, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, 1, VK_SHADER_STAGE_COMPUTE_BIT, 0 },
{ 1, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER, 1, VK_SHADER_STAGE_COMPUTE_BIT, 0 }
};
The Command Dispatch
vkCmdDispatch(computeCommandBuffers, MAX_PARTICLES , 1, 1);
The Submitting of the Queue
VkSubmitInfo cSubmitInfo = {};
cSubmitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
cSubmitInfo.commandBufferCount = 1;
cSubmitInfo.pCommandBuffers = &computeCommandBuffers;
if (vkQueueSubmit(computeQueue.getQueue(), 1, &cSubmitInfo, computeFence) != VK_SUCCESS) {
throw std::runtime_error("failed to submit compute command buffer!");
}
vkWaitForFences(device.getDevice(), 1, &computeFence, VK_TRUE, UINT64_MAX);
UPDATE: 13/05/2017 (More Information Added)
Particle Struct Definition in CPP
struct Particle {
glm::vec3 location;
glm::vec3 velocity;
glm::vec3 initLocation;
glm::vec3 initVelocity;
float lifeTime;
float initLifetime;
}
Data Mapping to Storage Buffer
void* data;
vkMapMemory(device.getDevice(), stagingBufferMemory, 0, bufferSize, 0, &data);
memcpy(data, particles, (size_t)bufferSize);
vkUnmapMemory(device.getDevice(), stagingBufferMemory);
copyBuffer(stagingBuffer, computeBuffer, bufferSize);
Copy Buffer Function (by Alexander Overvoorde from vulkan-tutorial.com)
void copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size) {
VkCommandBufferAllocateInfo allocInfo = {};
allocInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;
allocInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;
allocInfo.commandPool = commandPool.getCommandPool();
allocInfo.commandBufferCount = 1;
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(device.getDevice(), &allocInfo, &commandBuffer);
VkCommandBufferBeginInfo beginInfo = {};
beginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
beginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;
vkBeginCommandBuffer(commandBuffer, &beginInfo);
VkBufferCopy copyRegion = {};
copyRegion.size = size;
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
VkSubmitInfo submitInfo = {};
submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submitInfo.commandBufferCount = 1;
submitInfo.pCommandBuffers = &commandBuffer;
vkQueueSubmit(graphicsQueue.getQueue(), 1, &submitInfo, VK_NULL_HANDLE);
vkQueueWaitIdle(graphicsQueue.getQueue());
vkFreeCommandBuffers(device.getDevice(), commandPool.getCommandPool(), 1, &commandBuffer);
}

Have a look at this StackOverflow question:
Memory allocation with std430 qualifier
FINAL, CORRECTED ANSWER:
In Your case the biggest member of Your structure is vec3 (3-element vector of floats). Base alignment of vec3 is the same as alignment of vec4. So the base alignment of Your array's elements is equal to 16 bytes. This means that each element of Your array has to start at an address that is a multiple of 16.
But alignment rules have to be applied for each structure member recursively. 3-element vectors has the same alignment as 4-element vectors. This means that:
Position member starts at the same alignment as each array member
Velocity, IPosition and IVelocitymembers must start at multiples of 16 bytes after the beginning of a given array element.
LifeTime and ILifeTime members have a 4-bytes alignment.
So the total size of Your struct in bytes is equal to:
Position - 16 bytes (Position itself takes 12 bytes, but next member has a 16-byte alignment)
Velocity - 16 bytes
IPosition - 16 bytes
IVelocity + LifeTime - 16 bytes
ILifeTime - 4 bytes
which gives 68 bytes. So, as far as I understand it, You need a 12-byte padding at the end of Your structure (additional 12 bytes between array elements) because each array element must start at addresses which are a multiple of 16.
So the first array element starts at offset 0 of the memory bound to the storage buffer. But the second array element must start at offset 80 from the begging of the memory (nearest multiple of 16 greater than 68) and so on.
Or, as #NicolBolas commented, to make life easier, pack everything in vec4 members only ;-).
BETTER THOUGH NOT FULLY CORRECT ANSWER:
In Your case the biggest member of Your structure is vec3 (3-element vector of floats). So the base alignment of Your array's elements is equal to 12 bytes (in case of arrays of structs in std430 layout, the base alignment don't have to be rounded up to mach alignment of 4-element vectors. <- Here I was wrong. We don't have to round up structure's base alignment, but the alignment of its members is calculated normally, with vec3 alignment being the same as vec4 alignment). This means that each element of Your array has to start at an address that is a multiple of 12 (no, in this case it should start at a multiple of 16).
But alignment rules have to be applied for each structure member recursively. 3-element vectors has the same alignment as 4-element vectors. This means that:
Position member starts at the same alignment as each array member
Velocity, IPosition and IVelocitymembers must start at multiples of 16 bytes after the beginning of a given array element.
LifeTime and ILifeTime members have a 4-bytes alignment.
So the total size of Your struct in bytes is equal to:
Position - 16 bytes (Position itself takes 12 bytes, but next member has a 16-byte alignment)
Velocity - 16 bytes
IPosition - 16 bytes
IVelocity + LifeTime - 16 bytes
ILifeTime - 4 bytes
which gives 68 bytes. So, as far as I understand it, You need a 4-byte padding at the end of Your structure (additional 4 bytes between array elements) because each array element must start at addresses which are a multiple of 12 (again, we need 12-byte padding here so the next array elements starts at a multiple of 16, not 12).
So the first array element starts at offset 0 of the memory bound to the storage buffer. But the second array element must start at offset 72 from the begging of the memory (nearest multiple of 12 greater than 68) and so on.
PREVIOUS, WRONG ANSWER:
In Your case the biggest member is vec3 (3-element vector of floats). It's alignment is equal to 12 bytes (in case of arrays of structs we don't have to round alignment of 3-element vectors to mach alignment of 4-element vectors). The size of Your struct in bytes equals to 56. So, as far as I understand it, You need a 4-byte padding at the end of Your structure (additional 4 bytes between array elements) because each array element must start at addresses which are a multiple of 12.

Related

Estimating max payload size on a compressed list of integers

I have 1 million rows in an application. It makes a request to a server such as the following:
/search?q=hello
And the search returns a sorted list of integers, representing the rows that have been matched in the input dataset (which the user has in their browser). How would I go about estimating the maximum size that the payload would return? For example, to start we have:
# ~7 MB if we stored "all results" uncompressed
6888887
# ~ 3.5MB if we stored "all results" relative to 0 or ALL matches (cuts it down by two)
3444443
And then we would want to compress these integers using some sort of decompression (Elias-Fano ?) What would be the "worst-case" scenario for the size of 1M sorted integers? And how would that calculation be made?
The application has one million rows of data, so lets say R1 --> R1000000, or if zero-indexing, range(int(1e6)). The server will respond with something like: [1,2,3], indicating that (only) rows 1, 2, and 3 were matched.
There are 2^(10^6) different sorted (duplicate-less) lists of integers < 10^6. Mapping each such list, say [0, 4, ...], to the corresponding
bit array (say 10001....) yields 10^6 bits, i.e 125kB of information. As each bit array corresponds to a unique possible sorted list, and vice versa, this is the most
compact (in the sense of: having the smallest max. size) representation.
Of course, if some results are more probable than others, there may be more efficient (in the sense of: having a smaller average size) representations.
For example, if most result sets are small, a simple run-length encoding may generally yield smaller encodings.
Inevitably, in that case the maximal size of the encoding (the max. payload size you were asking about) will be more than 125 kB
Compressing the above-mentioned 125 kB bit array with e.g. zlib will yield an acceptably compact encoding for small result sets. Moreover, zlib has a function deflateBound() that, given the uncompressed size, will calculate the max payload size (which, in your case, will definitely be larger than 125 kB, but not by much)
Input specification:
Row numbers between 0 and 999999 (if you need 1-indexing, you can apply offset)
Each row number appears only once
The numbers are sorted in ascending order (useful, we'd want to sort them anyway)
A great idea you've had was to invert the meaning of the result when the number of matches is more than half the possible values. Let us retain that, and assume we are given a flag and a list of matches/misses.
Your initial attempt at coding this encoded the numbers as text with comma separation. That means that for 90% of the possible values you need 6 characters + 1 separator -- so 7 bytes on average. However, since the maximum value is 999999, you really only need 20 bits to encode each entry.
Hence, the first idea to reducing the size is to use binary encoding.
Binary Encoding
The simplest approach is to write the number of values sent followed by a stream of 32bit integers.
A more efficient approach would be to pack two 20-bit values into each 5 bytes written. In case of an odd count, you would just pad the 4 excess bits with zeros.
Those approaches may be good for small amounts of matches (or misses). However, the important thing to note is that for each row, we only need to track 1 bit of information -- whether it's present or not. That means that we can encode the results as a bitmap of 1000000 bits.
Combining those two approaches, we can use a bitmap when there are many matches or misses, and switch to binary coding when it's more efficient.
Range Reduction
The next potential improvement to use when coding sorted sequences of integers is to use range reduction.
The idea is to code the values from largest to smallest, and reducing the number of bits per value as they get smaller.
First, we encode the number of bits N necessary to represent the first value.
We encode the first value using N bits
For each following value
Encode the value using N bits
If the value requires fewer bits to encode, reduce N appropriately
Entropy Coding
Let's go back to the bitmap encoding. Based on the Shannon entropy theory
the worst case is where we have 50% matches. The further the probabilities are skewed, the fewer bits we need on average to code each entry.
Matches | Bits
--------+-----------
0 | 0
1 | 22
2 | 41
3 | 60
4 | 78
5 | 96
10 | 181
100 | 1474
1000 | 11408
10000 | 80794
100000 | 468996
250000 | 811279
500000 | 1000000
To do this, we need to use an entropy coder that can code fractional bits -- something like arithmetic or range coder or some of the new ANS based coders like FSE. Alternatively, we could group symbols together and use Huffman coding.
Prototypes and Measurements
I've written a test using a 32-bit implementation of FastAC by Amir Said, which limits the model to 4 decimal places.
(This is not really a problem, since we shouldn't be feeding such data to the codec directly. This is just a demonstration.)
First some common code:
typedef std::vector<uint8_t> match_symbols_t;
typedef std::vector<uint32_t> match_list_t;
typedef std::set<uint32_t> match_set_t;
typedef std::vector<uint8_t> buffer_t;
// ----------------------------------------------------------------------------
static uint32_t const NUM_VALUES(1000000);
// ============================================================================
size_t symbol_count(uint8_t bits)
{
size_t count(NUM_VALUES / bits);
if (NUM_VALUES % bits > 0) {
return count + 1;
}
return count;
}
// ----------------------------------------------------------------------------
void set_symbol(match_symbols_t& symbols, uint8_t bits, uint32_t match, bool state)
{
size_t index(match / bits);
size_t offset(match % bits);
if (state) {
symbols[index] |= 1 << offset;
} else {
symbols[index] &= ~(1 << offset);
}
}
// ----------------------------------------------------------------------------
bool get_symbol(match_symbols_t const& symbols, uint8_t bits, uint32_t match)
{
size_t index(match / bits);
size_t offset(match % bits);
return (symbols[index] & (1 << offset)) != 0;
}
// ----------------------------------------------------------------------------
match_symbols_t make_symbols(match_list_t const& matches, uint8_t bits)
{
assert((bits > 0) && (bits <= 8));
match_symbols_t symbols(symbol_count(bits), 0);
for (auto match : matches) {
set_symbol(symbols, bits, match, true);
}
return symbols;
}
// ----------------------------------------------------------------------------
match_list_t make_matches(match_symbols_t const& symbols, uint8_t bits)
{
match_list_t result;
for (uint32_t i(0); i < 1000000; ++i) {
if (get_symbol(symbols, bits, i)) {
result.push_back(i);
}
}
return result;
}
First, simpler variant is to write the number of matches, determine the probability of match/miss and clamp it to the supported range.
Then simply encode each value of the bitmap using this static probability model.
class arithmetic_codec_v1
{
public:
buffer_t compress(match_list_t const& matches)
{
uint32_t match_count(static_cast<uint32_t>(matches.size()));
arithmetic_codec codec(static_cast<uint32_t>(NUM_VALUES / 4));
codec.start_encoder();
// Store the number of matches (1000000 needs only 20 bits)
codec.put_bits(match_count, 20);
if (match_count > 0) {
// Initialize the model
static_bit_model model;
model.set_probability_0(get_probability_0(match_count));
// Create a bitmap and code all the bitmap entries
// NB: This is lazy and inefficient, but simple
match_symbols_t symbols = make_symbols(matches, 1);
for (auto entry : symbols) {
codec.encode(entry, model);
}
}
uint32_t compressed_size = codec.stop_encoder();
return buffer_t(codec.buffer(), codec.buffer() + compressed_size);
}
match_list_t decompress(buffer_t& compressed)
{
arithmetic_codec codec(static_cast<uint32_t>(compressed.size()), &compressed[0]);
codec.start_decoder();
// Read number of matches (20 bits)
uint32_t match_count(codec.get_bits(20));
match_list_t result;
if (match_count > 0) {
static_bit_model model;
model.set_probability_0(get_probability_0(match_count));
result.reserve(match_count);
for (uint32_t i(0); i < NUM_VALUES; ++i) {
uint32_t entry = codec.decode(model);
if (entry == 1) {
result.push_back(i);
}
}
}
codec.stop_decoder();
return result;
}
private:
double get_probability_0(uint32_t match_count, uint32_t num_values = NUM_VALUES)
{
double probability_0(double(num_values - match_count) / num_values);
// Limit probability to match FastAC limitations...
return std::max(0.0001, std::min(0.9999, probability_0));
}
};
The second approach is to adapt the model based on the symbols we code.
After each match is encoded, reduce the probability of the next match.
Once all matches we coded, stop.
The second variation compresses slightly better, but at a noticeable performance cost.
class arithmetic_codec_v2
{
public:
buffer_t compress(match_list_t const& matches)
{
uint32_t match_count(static_cast<uint32_t>(matches.size()));
uint32_t total_count(NUM_VALUES);
arithmetic_codec codec(static_cast<uint32_t>(NUM_VALUES / 4));
codec.start_encoder();
// Store the number of matches (1000000 needs only 20 bits)
codec.put_bits(match_count, 20);
if (match_count > 0) {
static_bit_model model;
// Create a bitmap and code all the bitmap entries
// NB: This is lazy and inefficient, but simple
match_symbols_t symbols = make_symbols(matches, 1);
for (auto entry : symbols) {
model.set_probability_0(get_probability_0(match_count, total_count));
codec.encode(entry, model);
--total_count;
if (entry) {
--match_count;
}
if (match_count == 0) {
break;
}
}
}
uint32_t compressed_size = codec.stop_encoder();
return buffer_t(codec.buffer(), codec.buffer() + compressed_size);
}
match_list_t decompress(buffer_t& compressed)
{
arithmetic_codec codec(static_cast<uint32_t>(compressed.size()), &compressed[0]);
codec.start_decoder();
// Read number of matches (20 bits)
uint32_t match_count(codec.get_bits(20));
uint32_t total_count(NUM_VALUES);
match_list_t result;
if (match_count > 0) {
static_bit_model model;
result.reserve(match_count);
for (uint32_t i(0); i < NUM_VALUES; ++i) {
model.set_probability_0(get_probability_0(match_count, NUM_VALUES - i));
if (codec.decode(model) == 1) {
result.push_back(i);
--match_count;
}
if (match_count == 0) {
break;
}
}
}
codec.stop_decoder();
return result;
}
private:
double get_probability_0(uint32_t match_count, uint32_t num_values = NUM_VALUES)
{
double probability_0(double(num_values - match_count) / num_values);
// Limit probability to match FastAC limitations...
return std::max(0.0001, std::min(0.9999, probability_0));
}
};
Practical Approach
Practically, it's probalby not worth designing a new compression format.
In fact, it might not even be worth it writing the results as bits, just make an array of bytes with values 0 or 1.
Then use an existing compression library -- zlib is very common, or you could try lz4 or snappy, bzip2, lzma... the choices are plentiful.
ZLib Example
class zlib_codec
{
public:
zlib_codec(uint32_t bits_per_symbol) : bits_per_symbol(bits_per_symbol) {}
buffer_t compress(match_list_t const& matches)
{
match_symbols_t symbols(make_symbols(matches, bits_per_symbol));
z_stream defstream;
defstream.zalloc = nullptr;
defstream.zfree = nullptr;
defstream.opaque = nullptr;
deflateInit(&defstream, Z_BEST_COMPRESSION);
size_t max_compress_size = deflateBound(&defstream, static_cast<uLong>(symbols.size()));
buffer_t compressed(max_compress_size);
defstream.avail_in = static_cast<uInt>(symbols.size());
defstream.next_in = &symbols[0];
defstream.avail_out = static_cast<uInt>(max_compress_size);
defstream.next_out = &compressed[0];
deflate(&defstream, Z_FINISH);
deflateEnd(&defstream);
compressed.resize(defstream.total_out);
return compressed;
}
match_list_t decompress(buffer_t& compressed)
{
z_stream infstream;
infstream.zalloc = nullptr;
infstream.zfree = nullptr;
infstream.opaque = nullptr;
inflateInit(&infstream);
match_symbols_t symbols(symbol_count(bits_per_symbol));
infstream.avail_in = static_cast<uInt>(compressed.size());
infstream.next_in = &compressed[0];
infstream.avail_out = static_cast<uInt>(symbols.size());
infstream.next_out = &symbols[0];
inflate(&infstream, Z_FINISH);
inflateEnd(&infstream);
return make_matches(symbols, bits_per_symbol);
}
private:
uint32_t bits_per_symbol;
};
BZip2 Example
class bzip2_codec
{
public:
bzip2_codec(uint32_t bits_per_symbol) : bits_per_symbol(bits_per_symbol) {}
buffer_t compress(match_list_t const& matches)
{
match_symbols_t symbols(make_symbols(matches, bits_per_symbol));
uint32_t compressed_size = symbols.size() * 2;
buffer_t compressed(compressed_size);
int err = BZ2_bzBuffToBuffCompress((char*)&compressed[0]
, &compressed_size
, (char*)&symbols[0]
, symbols.size()
, 9
, 0
, 30);
if (err != BZ_OK) {
throw std::runtime_error("Compression error.");
}
compressed.resize(compressed_size);
return compressed;
}
match_list_t decompress(buffer_t& compressed)
{
match_symbols_t symbols(symbol_count(bits_per_symbol));
uint32_t decompressed_size = symbols.size();
int err = BZ2_bzBuffToBuffDecompress((char*)&symbols[0]
, &decompressed_size
, (char*)&compressed[0]
, compressed.size()
, 0
, 0);
if (err != BZ_OK) {
throw std::runtime_error("Compression error.");
}
if (decompressed_size != symbols.size()) {
throw std::runtime_error("Size mismatch.");
}
return make_matches(symbols, bits_per_symbol);
}
private:
uint32_t bits_per_symbol;
};
Comparison
Code repository, including dependencies for 64bit Visual Studio 2015 is at https://github.com/dan-masek/bounded_sorted_list_compression.git
Storing a compressed list of sorted integers is extremely common in data retrieval and database applications, and a variety of techniques have been developed.
I'm pretty sure that an unguessably random selection of about half of the items in your list is going to be your worst case.
Many popular integer-list-compression techniques, such as Roaring bitmaps, fallback to using (with such worst-case input data) a 1-bit-per-index bitmap.
So in your case, with 1 million rows, the maximum size payload returned would be (in the worst case) a header with the "using a bitmap" flag set,
followed by a bitmap of 1 million bits (125,000 bytes), where for example the 700th bit of the bitmap is set to 1 if the 700th row in the database is a match, or set to 0 if the 700th row in the database does not match. (Thanks, Dan MaĊĦek!)
My understanding is that, while quasi-succinct Elias-Fano compression and other techniques are very useful for compressing many "naturally-occurring" sets of sorted integers, for this worst-case data set, none of them give better compression, and most of them give far worse "compression", than a simple bitmap.
(This is analogous to the way most general-purpose data compression algorithms, such as DEFLATE, when fed "worst-case" data such as indistinguishable-from-random encrypted data, create "compressed" files with a few bytes of overhead with the "stored/raw/literal" flag set, followed by a simple copy of the uncompressed file).
Jianguo Wang; Chunbin Lin; Yannis Papakonstantinou; Steven Swanson. "An Experimental Study of Bitmap Compression vs. Inverted List Compression"
https://en.wikipedia.org/wiki/Bitmap_index#Compression
https://en.wikipedia.org/wiki/Inverted_index#Compression

GoLang - memory allocation - []byte vs string

In the below code:
c := "fool"
d := []byte("fool")
fmt.Printf("c: %T, %d\n", c, unsafe.Sizeof(c)) // 16 bytes
fmt.Printf("d: %T, %d\n", d, unsafe.Sizeof(d)) // 24 bytes
To decide the datatype needed to receive JSON data from CloudFoundry, am testing above sample code to understand the memory allocation for []byte vs string type.
Expected size of string type variable c is 1 byte x 4 ascii encoded letter = 4 bytes, but the size shows 16 bytes.
For byte type variable d, GO embeds the string in the executable program as a string literal. It converts the string literal to a byte slice at runtime using the runtime.stringtoslicebyte function. Something like... []byte{102, 111, 111, 108}
Expected size of byte type variable d is again 1 byte x 4 ascii values = 4 bytes but the size of variable d shows 24 bytes as it's underlying array capacity.
Why the size of both variables is not 4 bytes?
Both slices and strings in Go are struct-like headers:
reflect.SliceHeader:
type SliceHeader struct {
Data uintptr
Len int
Cap int
}
reflect.StringHeader:
type StringHeader struct {
Data uintptr
Len int
}
The sizes reported by unsafe.Sizeof() are the sizes of these headers, exluding the size of the pointed arrays:
Sizeof takes an expression x of any type and returns the size in bytes of a hypothetical variable v as if v was declared via var v = x. The size does not include any memory possibly referenced by x. For instance, if x is a slice, Sizeof returns the size of the slice descriptor, not the size of the memory referenced by the slice.
To get the actual ("recursive") size of some arbitrary value, use Go's builtin testing and benchmarking framework. For details, see How to get memory size of variable in Go?
For strings specifically, see String memory usage in Golang. The complete memory required by a string value can be computed like this:
var str string = "some string"
stringSize := len(str) + int(unsafe.Sizeof(str))

Is it possible to write to a non 4-bytes aligned address with HLSL compute shader?

I am trying to convert an existing OpenCL kernel to an HLSL compute shader.
The OpenCL kernel samples each pixel in an RGBA texture and writes each color channel to a tighly packed array.
So basically, I need to write to a tightly packed uchar array in a pattern that goes somewhat like this:
r r r ... r g g g ... g b b b ... b a a a ... a
where each letter stands for a single byte (red / green / blue / alpha) that originates from a pixel channel.
going through the documentation for RWByteAddressBuffer Store method, it clearly states:
void Store(
in uint address,
in uint value
);
address [in]
Type: uint
The input address in bytes, which must be a multiple of 4.
In order to write the correct pattern to the buffer, I must be able to write a single byte to a non aligned address. In OpenCL / CUDA this is pretty trivial.
Is it technically possible to achieve that with HLSL?
Is this a known limitation? possible workarounds?
As far as I know it is not possible to write directly to a non aligned address in this scenario. You can, however, use a little trick to achieve what you want. Below you can see the code of the entire compute shader which does exactly what you want. The function StoreValueAtByte in particular is what you are looking for.
Texture2D<float4> Input;
RWByteAddressBuffer Output;
void StoreValueAtByte(in uint index_of_byte, in uint value) {
// Calculate the address of the 4-byte-slot in which index_of_byte resides
uint addr_align4 = floor(float(index_of_byte) / 4.0f) * 4;
// Calculate which byte within the 4-byte-slot it is
uint location = index_of_byte % 4;
// Shift bits to their proper location within its 4-byte-slot
value = value << ((3 - location) * 8);
// Write value to buffer
Output.InterlockedOr(addr_align4, value);
}
[numthreads(20, 20, 1)]
void CSMAIN(uint3 ID : SV_DispatchThreadID) {
// Get width and height of texture
uint tex_width, tex_height;
Input.GetDimensions(tex_width, tex_height);
// Make sure thread does not operate outside the texture
if(tex_width > ID.x && tex_height > ID.y) {
uint num_pixels = tex_width * tex_height;
// Calculate address of where to write color channel data of pixel
uint addr_red = 0 * num_pixels + ID.y * tex_width + ID.x;
uint addr_green = 1 * num_pixels + ID.y * tex_width + ID.x;
uint addr_blue = 2 * num_pixels + ID.y * tex_width + ID.x;
uint addr_alpha = 3 * num_pixels + ID.y * tex_width + ID.x;
// Get color of pixel and convert from [0,1] to [0,255]
float4 color = Input[ID.xy];
uint4 color_final = uint4(round(color.x * 255), round(color.y * 255), round(color.z * 255), round(color.w * 255));
// Store color channel values in output buffer
StoreValueAtByte(addr_red, color_final.x);
StoreValueAtByte(addr_green, color_final.y);
StoreValueAtByte(addr_blue, color_final.z);
StoreValueAtByte(addr_alpha, color_final.w);
}
}
I hope the code is self explanatory since it is hard to explain, but I'll try anyway.
The fist thing the function StoreValueAtByte does is to calculate the address of the 4-byte-slot enclosing the byte you want to write to. After that the position of the byte inside the 4-byte-slot is calculated (is it the fist, second, third or the fourth byte in the slot). Since the byte you want to write is already inside an 4-byte variable (namely value) and occupies the rightmost byte, you then just have to shift the byte to its proper position inside the 4-byte variable. After that you just have to write the variable value to the buffer at the 4-byte-aligned address. This is done using bitwise OR because multiple threads write to the same address interfering each other leading to write-after-write-hazards. This of course only works if you initialize the entire output buffer with zeros before issuing the dispatch-call.

Why is the transitive closure of matrix multiplication not working in this vertex shader?

This can probably be filed under premature optimization, but since the vertex shader executes on each vertex for each frame, it seems like something worth doing (I have a lot of vars I need to multiply before going to the pixel shader).
Essentially, the vertex shader performs this operation to convert a vector to projected space, like so:
// Transform the vertex position into projected space.
pos = mul(pos, model);
pos = mul(pos, view);
pos = mul(pos, projection);
output.pos = pos;
Since I'm doing this operation to multiple vectors in the shader, it made sense to combine those matrices into a cumulative matrix on the CPU and then flush that to the GPU for calculation, like so:
// VertexShader.hlsl
cbuffer ModelViewProjectionConstantBuffer : register (b0)
{
matrix model;
matrix view;
matrix projection;
matrix cummulative;
float3 eyePosition;
};
...
// Transform the vertex position into projected space.
pos = mul(pos, cummulative);
output.pos = pos;
And on the CPU:
// Renderer.cpp
// now is also the time to update the cummulative matrix
m_constantMatrixBufferData->cummulative =
m_constantMatrixBufferData->model *
m_constantMatrixBufferData->view *
m_constantMatrixBufferData->projection;
// NOTE: each of the above vars is an XMMATRIX
My intuition was that there was some mismatch of row-major/column-major, but XMMATRIX is a row-major struct (and all of its operators treat it as such) and mul(...) interprets its matrix parameter as row-major. So that doesn't seem to be the problem but perhaps it still is in a way that I don't understand.
I've also checked the contents of the cumulative matrix and they appear correct, further adding to the confusion.
Thanks for reading, I'll really appreciate any hints you can give me.
EDIT (additional information requested in comments):
This is the struct that I am using as my matrix constant buffer:
// a constant buffer that contains the 3 matrices needed to
// transform points so that they're rendered correctly
struct ModelViewProjectionConstantBuffer
{
DirectX::XMMATRIX model;
DirectX::XMMATRIX view;
DirectX::XMMATRIX projection;
DirectX::XMMATRIX cummulative;
DirectX::XMFLOAT3 eyePosition;
// and padding to make the size divisible by 16
float padding;
};
I create the matrix stack in CreateDeviceResources (along with my other constant buffers), like so:
void ModelRenderer::CreateDeviceResources()
{
Direct3DBase::CreateDeviceResources();
// Let's take this moment to create some constant buffers
... // creation of other constant buffers
// and lastly, the all mighty matrix buffer
CD3D11_BUFFER_DESC constantMatrixBufferDesc(sizeof(ModelViewProjectionConstantBuffer), D3D11_BIND_CONSTANT_BUFFER);
DX::ThrowIfFailed(
m_d3dDevice->CreateBuffer(
&constantMatrixBufferDesc,
nullptr,
&m_constantMatrixBuffer
)
);
... // and the rest of the initialization (reading in the shaders, loading assets, etc)
}
I write into the matrix buffer inside a matrix stack class I created. The client of the class calls Update() once they are done modifying the matrices:
void MatrixStack::Update()
{
// then update the buffers
m_constantMatrixBufferData->model = model.front();
m_constantMatrixBufferData->view = view.front();
m_constantMatrixBufferData->projection = projection.front();
// NOTE: the eye position has no stack, as it's kept updated by the trackball
// now is also the time to update the cummulative matrix
m_constantMatrixBufferData->cummulative =
m_constantMatrixBufferData->model *
m_constantMatrixBufferData->view *
m_constantMatrixBufferData->projection;
// and flush
m_d3dContext->UpdateSubresource(
m_constantMatrixBuffer.Get(),
0,
NULL,
m_constantMatrixBufferData,
0,
0
);
}
Given your code snippets, it should work.
Possible causes of your problem:
Have you tried the inverted multiplication: projection * view * model?
Are you sure you set correctly the cummulative (register index = 12, offset = 192) in your constant buffer
Same for eyePosition (register index = 12, offset = 256)?

How to interpret the field 'data' of an XImage

I am trying to understand how the data obtained from XGetImage is disposed in memory:
XImage img = XGetImage(display, root, 0, 0, width, height, AllPlanes, ZPixmap);
Now suppose I want to decompose each pixel value in red, blue, green channels. How can I do this in a portable way? The following is an example, but it depends on a particular configuration of the XServer and does not work in every case:
for (int x = 0; x < width; x++)
for (int y = 0; y < height; y++) {
unsigned long pixel = XGetPixel(img, x, y);
unsigned char blue = pixel & blue_mask;
unsigned char green = (pixel & green_mask) >> 8;
unsigned char red = (pixel & red_mask) >> 16;
//...
}
In the above example I am assuming a particular order of the RGB channels in pixel and also that pixels are 24bit-depth: in facts, I have img->depth=24 and img->bits_per_pixels=32 (the screen is also 24-bit depth). But this is not a generic case.
As a second step I want to get rid of XGetPixel and use or describe img->data directly. The first thing I need to know is if there is anything in Xlib which exactly gives me all the informations I need to interpret how the image is built starting from the img->data field, which are:
the order of R,G,B channels in each pixel;
the number of bits for each pixels;
the numbbe of bits for each channel;
if possible, a corresponding FOURCC
The shift is a simple function of the mask:
int get_shift (int mask) {
shift = 0;
while (mask) {
if (mask & 1) break;
shift++;
mask >>=1;
}
return shift;
}
Number of bits in each channel is just the number of 1 bits in its mask (count them). The channel order is determined by the shifts (if red shift is 0, the the first channel is R, etc).
I think the valid values for bits_per_pixel are 1, 2, 4, 8, 15, 16, 24 and 32 (15 and 16 bits are the same 2 bytes per pixel format, but the former has 1 bit unused). I don't think it's worth anyone's time to support anything but 24 and 32 bpp.
X11 is not concerned with media files, so no 4CC code.
This can be read from the XImage structure itself.
the order of R,G,B channels in each pixel;
This is contained in this field of the XImage structure:
int byte_order; /* data byte order, LSBFirst, MSBFirst */
which tells you whether it's RGB or BGR (because it only depends on the endianness of the machine).
the number of bits for each pixels;
can be obtained from this field:
int bits_per_pixel; /* bits per pixel (ZPixmap) */
which is basically the number of bits set in each of the channel masks:
unsigned long red_mask; /* bits in z arrangement */
unsigned long green_mask;
unsigned long blue_mask;
the numbbe of bits for each channel;
See above, or you can use the code from #n.m.'s answer to count the bits yourself.
Yeah, it would be great if they put the bit shift constants in that structure too, but apparently they decided not to, since the pixels are aligned to bytes anyway, in "standard order" (RGB). Xlib makes sure to convert it to that order for you when it retrieves the data from the X server, even if they are stored internally in a different format server-side. So it's always in RGB format, byte-aligned, but depending on the endianness of the machine, the bytes inside an unsigned long can appear in a reverse order, hence the byte_order field to tell you about that.
So in order to extract these channels, just use the 0, 8 and 16 shifts after masking with red_mask, green_mask and blue_mask, just make sure you shift the right bytes depending on the byte_order and it should work fine.

Resources