In bitmap.h, why does bitmap_zero need memset?

In include/linux/bitmap.h, in the bitmap_zero(), why use memset?
static inline void bitmap_zero(unsigned long *dst, int nbits)
if (small_const_nbits(nbits))
*dst = 0UL;
else {
int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
memset(dst, 0, len);
IS the *det = OUL not enough?

The definition of small_const_nbits is:
#define small_const_nbits(nbits) \
(__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG)
BITS_PER_LONG is normally 32 or 64, depending on what machine you're on.
So, if you're trying to clear fewer than that many bits, you can certainly do it in a single operation - that's the first half of the if statement. If it's longer than 32 or 64 bits, you need to set multiple words, and that's done by the memset call.


C Function to return a String resulting in corrupted top size

I am trying to write a program that calls upon an [external library (?)] (I'm not sure that I'm using the right terminology here) that I am also writing to clean up a provided string. For example, if my main.c program were to be provided with a string such as:
asdfFAweWFwseFL Wefawf JAWEFfja FAWSEF
it would call upon a function in externalLibrary.c (lets call it externalLibrary_Clean for now) that would take in the string, and return all characters in upper case without spaces:
The crazy part is that I have this working... so long as my string doesn't exceed 26 characters in length. As soon as I add a 27th character, I end up with an error that says
malloc(): corrupted top size.
Here is externalLibrary.c:
#include "externalLibrary.h"
#include <ctype.h>
#include <malloc.h>
#include <assert.h>
#include <string.h>
char * restrict externalLibrary_Clean(const char* restrict input) {
// first we define the return value as a pointer and initialize
// an integer to count the length of the string
char * returnVal = malloc(sizeof(input));
char * initialReturnVal = returnVal; //point to the start location
// until we hit the end of the string, we use this while loop to
// iterate through it
while (*input != '\0') {
if (isalpha(*input)) { // if we encounter an alphabet character (a-z/A-Z)
// then we convert it to an uppercase value and point our return value at it
*returnVal = toupper(*input);
returnVal++; //we use this to move our return value to the next location in memory
input++; // we move to the next memory location on the provided character pointer
*returnVal = '\0'; //once we have exhausted the input character pointer, we terminate our return value
return initialReturnVal;
int * restrict externalLibrary_getFrequencies(char * ar, int length){
static int freq[26];
for (int i = 0; i < length; i++){
return freq;
the header file for it (externalLibrary.h):
#ifdef __cplusplus
extern "C" {
char * restrict externalLibrary_Clean(const char* restrict input);
int * restrict externalLibrary_getFrequencies(char * ar, int length);
#ifdef __cplusplus
my main.c file from where all the action is happening:
#include <stdio.h>
#include "externalLibrary.h"
int main() {
char * unfilteredString = "ASDFOIWEGOASDGLKASJGISUAAAA";//if this exceeds 26 characters, the program breaks
char * cleanString = externalLibrary_Clean(unfilteredString);
//int * charDist = externalLibrary_getFrequencies(cleanString, 25); //this works just fine... for now
printf("\nOutput: %s\n", unfilteredString);
printf("\nCleaned Output: %s\n", cleanString);
/*for(int i = 0; i < 26; i++){
if(charDist[i] == 0){
else {
printf("%c: %d \n", (i + 65), charDist[i]);
return 0;
I'm extremely well versed in Java programming and I'm trying to translate my knowledge over to C as I wish to learn how my computer works in more detail (and have finer control over things such as memory).
If I were solving this problem in Java, it would be as simple as creating two class files: one called and one called, where I would have static String Clean(string input) and then call upon it in with String cleanString = externalLibrary.Clean(unfilteredString).
Clearly this isn't how C works, but I want to learn how (and why my code is crashing with corrupted top size)
The bug is this line:
char * returnVal = malloc(sizeof(input));
The reason it is a bug is that it requests an allocation large enough space to store a pointer, meaning 8 bytes in a 64-bit program. What you want to do is to allocate enough space to store the modified string, which you can do with the following line:
char *returnVal = malloc(strlen(input) + 1);
So the other part of your question is why the program doesn't crash when your string is less than 26 characters. The reason is that malloc is allowed to give the caller slightly more than the caller requested.
In your case, the message "malloc(): corrupted top size" suggests that you are using libc malloc, which is the default on Linux. That variant of malloc, in a 64-bit process, would always give you at least 0x18 (24) bytes (minimum chunk size 0x20 - 8 bytes for the size/status). In the specific case that the allocation immediately precedes the "top" allocation, writing past the end of the allocation will clobber the "top" size.
If your string is larger than 23 (0x17) you will start to clobber the size/status of the subsequent allocation because you also need 1 byte to store the trailing NULL. However, any string 23 characters or shorter will not cause a problem.
As to why you didn't get an error with a string with 26 characters, to answer that one would have to see that exact program with the string of 26 characters that does not crash to give a more precise answer. For example, if the program provided a 26-character input that contained 3 blanks, this would would require only 26 + 1 - 3 = 24 bytes in the allocation, which would fit.
If you are not interested in that level of detail, fixing the malloc call to request the proper amount will fix your crash.

Estimating max payload size on a compressed list of integers

I have 1 million rows in an application. It makes a request to a server such as the following:
And the search returns a sorted list of integers, representing the rows that have been matched in the input dataset (which the user has in their browser). How would I go about estimating the maximum size that the payload would return? For example, to start we have:
# ~7 MB if we stored "all results" uncompressed
# ~ 3.5MB if we stored "all results" relative to 0 or ALL matches (cuts it down by two)
And then we would want to compress these integers using some sort of decompression (Elias-Fano ?) What would be the "worst-case" scenario for the size of 1M sorted integers? And how would that calculation be made?
The application has one million rows of data, so lets say R1 --> R1000000, or if zero-indexing, range(int(1e6)). The server will respond with something like: [1,2,3], indicating that (only) rows 1, 2, and 3 were matched.
There are 2^(10^6) different sorted (duplicate-less) lists of integers < 10^6. Mapping each such list, say [0, 4, ...], to the corresponding
bit array (say 10001....) yields 10^6 bits, i.e 125kB of information. As each bit array corresponds to a unique possible sorted list, and vice versa, this is the most
compact (in the sense of: having the smallest max. size) representation.
Of course, if some results are more probable than others, there may be more efficient (in the sense of: having a smaller average size) representations.
For example, if most result sets are small, a simple run-length encoding may generally yield smaller encodings.
Inevitably, in that case the maximal size of the encoding (the max. payload size you were asking about) will be more than 125 kB
Compressing the above-mentioned 125 kB bit array with e.g. zlib will yield an acceptably compact encoding for small result sets. Moreover, zlib has a function deflateBound() that, given the uncompressed size, will calculate the max payload size (which, in your case, will definitely be larger than 125 kB, but not by much)
Input specification:
Row numbers between 0 and 999999 (if you need 1-indexing, you can apply offset)
Each row number appears only once
The numbers are sorted in ascending order (useful, we'd want to sort them anyway)
A great idea you've had was to invert the meaning of the result when the number of matches is more than half the possible values. Let us retain that, and assume we are given a flag and a list of matches/misses.
Your initial attempt at coding this encoded the numbers as text with comma separation. That means that for 90% of the possible values you need 6 characters + 1 separator -- so 7 bytes on average. However, since the maximum value is 999999, you really only need 20 bits to encode each entry.
Hence, the first idea to reducing the size is to use binary encoding.
Binary Encoding
The simplest approach is to write the number of values sent followed by a stream of 32bit integers.
A more efficient approach would be to pack two 20-bit values into each 5 bytes written. In case of an odd count, you would just pad the 4 excess bits with zeros.
Those approaches may be good for small amounts of matches (or misses). However, the important thing to note is that for each row, we only need to track 1 bit of information -- whether it's present or not. That means that we can encode the results as a bitmap of 1000000 bits.
Combining those two approaches, we can use a bitmap when there are many matches or misses, and switch to binary coding when it's more efficient.
Range Reduction
The next potential improvement to use when coding sorted sequences of integers is to use range reduction.
The idea is to code the values from largest to smallest, and reducing the number of bits per value as they get smaller.
First, we encode the number of bits N necessary to represent the first value.
We encode the first value using N bits
For each following value
Encode the value using N bits
If the value requires fewer bits to encode, reduce N appropriately
Entropy Coding
Let's go back to the bitmap encoding. Based on the Shannon entropy theory
the worst case is where we have 50% matches. The further the probabilities are skewed, the fewer bits we need on average to code each entry.
Matches | Bits
0 | 0
1 | 22
2 | 41
3 | 60
4 | 78
5 | 96
10 | 181
100 | 1474
1000 | 11408
10000 | 80794
100000 | 468996
250000 | 811279
500000 | 1000000
To do this, we need to use an entropy coder that can code fractional bits -- something like arithmetic or range coder or some of the new ANS based coders like FSE. Alternatively, we could group symbols together and use Huffman coding.
Prototypes and Measurements
I've written a test using a 32-bit implementation of FastAC by Amir Said, which limits the model to 4 decimal places.
(This is not really a problem, since we shouldn't be feeding such data to the codec directly. This is just a demonstration.)
First some common code:
typedef std::vector<uint8_t> match_symbols_t;
typedef std::vector<uint32_t> match_list_t;
typedef std::set<uint32_t> match_set_t;
typedef std::vector<uint8_t> buffer_t;
// ----------------------------------------------------------------------------
static uint32_t const NUM_VALUES(1000000);
// ============================================================================
size_t symbol_count(uint8_t bits)
size_t count(NUM_VALUES / bits);
if (NUM_VALUES % bits > 0) {
return count + 1;
return count;
// ----------------------------------------------------------------------------
void set_symbol(match_symbols_t& symbols, uint8_t bits, uint32_t match, bool state)
size_t index(match / bits);
size_t offset(match % bits);
if (state) {
symbols[index] |= 1 << offset;
} else {
symbols[index] &= ~(1 << offset);
// ----------------------------------------------------------------------------
bool get_symbol(match_symbols_t const& symbols, uint8_t bits, uint32_t match)
size_t index(match / bits);
size_t offset(match % bits);
return (symbols[index] & (1 << offset)) != 0;
// ----------------------------------------------------------------------------
match_symbols_t make_symbols(match_list_t const& matches, uint8_t bits)
assert((bits > 0) && (bits <= 8));
match_symbols_t symbols(symbol_count(bits), 0);
for (auto match : matches) {
set_symbol(symbols, bits, match, true);
return symbols;
// ----------------------------------------------------------------------------
match_list_t make_matches(match_symbols_t const& symbols, uint8_t bits)
match_list_t result;
for (uint32_t i(0); i < 1000000; ++i) {
if (get_symbol(symbols, bits, i)) {
return result;
First, simpler variant is to write the number of matches, determine the probability of match/miss and clamp it to the supported range.
Then simply encode each value of the bitmap using this static probability model.
class arithmetic_codec_v1
buffer_t compress(match_list_t const& matches)
uint32_t match_count(static_cast<uint32_t>(matches.size()));
arithmetic_codec codec(static_cast<uint32_t>(NUM_VALUES / 4));
// Store the number of matches (1000000 needs only 20 bits)
codec.put_bits(match_count, 20);
if (match_count > 0) {
// Initialize the model
static_bit_model model;
// Create a bitmap and code all the bitmap entries
// NB: This is lazy and inefficient, but simple
match_symbols_t symbols = make_symbols(matches, 1);
for (auto entry : symbols) {
codec.encode(entry, model);
uint32_t compressed_size = codec.stop_encoder();
return buffer_t(codec.buffer(), codec.buffer() + compressed_size);
match_list_t decompress(buffer_t& compressed)
arithmetic_codec codec(static_cast<uint32_t>(compressed.size()), &compressed[0]);
// Read number of matches (20 bits)
uint32_t match_count(codec.get_bits(20));
match_list_t result;
if (match_count > 0) {
static_bit_model model;
for (uint32_t i(0); i < NUM_VALUES; ++i) {
uint32_t entry = codec.decode(model);
if (entry == 1) {
return result;
double get_probability_0(uint32_t match_count, uint32_t num_values = NUM_VALUES)
double probability_0(double(num_values - match_count) / num_values);
// Limit probability to match FastAC limitations...
return std::max(0.0001, std::min(0.9999, probability_0));
The second approach is to adapt the model based on the symbols we code.
After each match is encoded, reduce the probability of the next match.
Once all matches we coded, stop.
The second variation compresses slightly better, but at a noticeable performance cost.
class arithmetic_codec_v2
buffer_t compress(match_list_t const& matches)
uint32_t match_count(static_cast<uint32_t>(matches.size()));
uint32_t total_count(NUM_VALUES);
arithmetic_codec codec(static_cast<uint32_t>(NUM_VALUES / 4));
// Store the number of matches (1000000 needs only 20 bits)
codec.put_bits(match_count, 20);
if (match_count > 0) {
static_bit_model model;
// Create a bitmap and code all the bitmap entries
// NB: This is lazy and inefficient, but simple
match_symbols_t symbols = make_symbols(matches, 1);
for (auto entry : symbols) {
model.set_probability_0(get_probability_0(match_count, total_count));
codec.encode(entry, model);
if (entry) {
if (match_count == 0) {
uint32_t compressed_size = codec.stop_encoder();
return buffer_t(codec.buffer(), codec.buffer() + compressed_size);
match_list_t decompress(buffer_t& compressed)
arithmetic_codec codec(static_cast<uint32_t>(compressed.size()), &compressed[0]);
// Read number of matches (20 bits)
uint32_t match_count(codec.get_bits(20));
uint32_t total_count(NUM_VALUES);
match_list_t result;
if (match_count > 0) {
static_bit_model model;
for (uint32_t i(0); i < NUM_VALUES; ++i) {
model.set_probability_0(get_probability_0(match_count, NUM_VALUES - i));
if (codec.decode(model) == 1) {
if (match_count == 0) {
return result;
double get_probability_0(uint32_t match_count, uint32_t num_values = NUM_VALUES)
double probability_0(double(num_values - match_count) / num_values);
// Limit probability to match FastAC limitations...
return std::max(0.0001, std::min(0.9999, probability_0));
Practical Approach
Practically, it's probalby not worth designing a new compression format.
In fact, it might not even be worth it writing the results as bits, just make an array of bytes with values 0 or 1.
Then use an existing compression library -- zlib is very common, or you could try lz4 or snappy, bzip2, lzma... the choices are plentiful.
ZLib Example
class zlib_codec
zlib_codec(uint32_t bits_per_symbol) : bits_per_symbol(bits_per_symbol) {}
buffer_t compress(match_list_t const& matches)
match_symbols_t symbols(make_symbols(matches, bits_per_symbol));
z_stream defstream;
defstream.zalloc = nullptr;
defstream.zfree = nullptr;
defstream.opaque = nullptr;
deflateInit(&defstream, Z_BEST_COMPRESSION);
size_t max_compress_size = deflateBound(&defstream, static_cast<uLong>(symbols.size()));
buffer_t compressed(max_compress_size);
defstream.avail_in = static_cast<uInt>(symbols.size());
defstream.next_in = &symbols[0];
defstream.avail_out = static_cast<uInt>(max_compress_size);
defstream.next_out = &compressed[0];
deflate(&defstream, Z_FINISH);
return compressed;
match_list_t decompress(buffer_t& compressed)
z_stream infstream;
infstream.zalloc = nullptr;
infstream.zfree = nullptr;
infstream.opaque = nullptr;
match_symbols_t symbols(symbol_count(bits_per_symbol));
infstream.avail_in = static_cast<uInt>(compressed.size());
infstream.next_in = &compressed[0];
infstream.avail_out = static_cast<uInt>(symbols.size());
infstream.next_out = &symbols[0];
inflate(&infstream, Z_FINISH);
return make_matches(symbols, bits_per_symbol);
uint32_t bits_per_symbol;
BZip2 Example
class bzip2_codec
bzip2_codec(uint32_t bits_per_symbol) : bits_per_symbol(bits_per_symbol) {}
buffer_t compress(match_list_t const& matches)
match_symbols_t symbols(make_symbols(matches, bits_per_symbol));
uint32_t compressed_size = symbols.size() * 2;
buffer_t compressed(compressed_size);
int err = BZ2_bzBuffToBuffCompress((char*)&compressed[0]
, &compressed_size
, (char*)&symbols[0]
, symbols.size()
, 9
, 0
, 30);
if (err != BZ_OK) {
throw std::runtime_error("Compression error.");
return compressed;
match_list_t decompress(buffer_t& compressed)
match_symbols_t symbols(symbol_count(bits_per_symbol));
uint32_t decompressed_size = symbols.size();
int err = BZ2_bzBuffToBuffDecompress((char*)&symbols[0]
, &decompressed_size
, (char*)&compressed[0]
, compressed.size()
, 0
, 0);
if (err != BZ_OK) {
throw std::runtime_error("Compression error.");
if (decompressed_size != symbols.size()) {
throw std::runtime_error("Size mismatch.");
return make_matches(symbols, bits_per_symbol);
uint32_t bits_per_symbol;
Code repository, including dependencies for 64bit Visual Studio 2015 is at
Storing a compressed list of sorted integers is extremely common in data retrieval and database applications, and a variety of techniques have been developed.
I'm pretty sure that an unguessably random selection of about half of the items in your list is going to be your worst case.
Many popular integer-list-compression techniques, such as Roaring bitmaps, fallback to using (with such worst-case input data) a 1-bit-per-index bitmap.
So in your case, with 1 million rows, the maximum size payload returned would be (in the worst case) a header with the "using a bitmap" flag set,
followed by a bitmap of 1 million bits (125,000 bytes), where for example the 700th bit of the bitmap is set to 1 if the 700th row in the database is a match, or set to 0 if the 700th row in the database does not match. (Thanks, Dan MaĊĦek!)
My understanding is that, while quasi-succinct Elias-Fano compression and other techniques are very useful for compressing many "naturally-occurring" sets of sorted integers, for this worst-case data set, none of them give better compression, and most of them give far worse "compression", than a simple bitmap.
(This is analogous to the way most general-purpose data compression algorithms, such as DEFLATE, when fed "worst-case" data such as indistinguishable-from-random encrypted data, create "compressed" files with a few bytes of overhead with the "stored/raw/literal" flag set, followed by a simple copy of the uncompressed file).
Jianguo Wang; Chunbin Lin; Yannis Papakonstantinou; Steven Swanson. "An Experimental Study of Bitmap Compression vs. Inverted List Compression"

using bitfields as a sorting key in modern C (C99/C11 union)

For my tiny graphics engine, I need an array of all objects to draw. For performance reasons this array needs to be sorted on the attributes. In short:
Store a lot of attributes per struct, add the struct to an array of structs
Efficiently sort the array
walk over the array and perform operations (modesetting and drawing) depending on the attributes
Approach: bitfields in a union (i.e.: let the compiler do the masking and shifting for me)
I thought I had an elegant plan to accomplish this, based on this article: The idea is as follows: each attribute is a bitfield, which can be read and written to (step 1). After writing, the sorting procedure look at the bitfield struct as an integer, and sorts on it (step 2). Afterwards (step 3), the bitfields are read again.
Sometimes code says more than a 1000 words, a high-level view:
union key {
/* useful for accessing */
struct {
unsigned int some_attr : 2;
unsigned int another_attr : 3;
/* ... */
} bitrep;
/* useful for sorting */
uint64_t intrep;
I would just make sure that the bit-representation was as large as the integer representation (64 bits in this case). My first approach went like this:
union key {
/* useful for accessing */
struct {
/* generic part: 11 bits */
unsigned int layer : 2;
unsigned int viewport : 3;
unsigned int viewportLayer : 3;
unsigned int translucency : 2;
unsigned int type : 1;
/* depends on type-bit: 64 - 11 bits = 53 bits */
union {
struct {
unsigned int sequence : 8;
unsigned int id : 32;
unsigned int padding : 13;
} cmd;
struct {
unsigned int depth : 24;
unsigned int material : 29;
} normal;
/* useful for sorting */
uint64_t intrep;
Note that in this case, there is a decision bitfield called type. Based on that, either the cmd struct or the normal struct gets filled in, just like in the mentioned article. However this failed horribly. With clang 3.3 on OSX 10.9 (x86 macbook pro), the key union is 16 bytes, while it should be 8.
Unable to coerce clang to pack the struct better, I took another approach based on some other stack overflow answers and the preprocessor to avoid me having to repeat myself:
/* 2 + 3 + 3 + 2 + 1 + 5 = 16 bits */
unsigned int layer : 2; \
unsigned int viewport : 3; \
unsigned int viewportLayer : 3; \
unsigned int translucency : 2; \
unsigned int type : 1; \
unsigned int : 5;
/* 8 + 32 + 8 = 48 bits */
unsigned int sequence : 8; \
unsigned int id : 32; \
unsigned int : 8;
/* 24 + 24 = 48 bits */
#define MODEL_FIELDS \
unsigned int depth : 24; \
unsigned int material : 24;
struct generic {
/* 16 bits */
struct command {
/* 16 bits */
/* 48 bits */
} __attribute__((packed));
struct model {
/* 16 bits */
/* 48 bits */
} __attribute__((packed));
union alkey {
struct generic gen;
struct command cmd;
struct model mod;
uint64_t intrep;
Without including the __attribute__((packed)), the command and model structs are 12 bytes. But with the __attribute__((packed)), they are 8 bytes, exactly what I wanted! So it would seems that I have found my solution. However, my small experience with bitfields has taught me to be leery. Which is why I have a few questions:
My questions are:
Can I get this to be cleaner (i.e.: more like my first big union-within-struct-within-union) and still keep it 8 bytes for the key, for fast sorting?
Is there a better way to accomplish this?
Is this safe? Will this fail on x86/ARM? (really exotic architectures are not much of a concern, I'm targeting the 2 most prevalent ones). What about setting a bitfield and then finding out that an adjacent one has already been written to. Will different compilers vary wildly on this?
What issues can I expect from different compilers? Currently I'm just aiming for clang 3.3+ and gcc 4.9+ with -std=c11. However it would be quite nice if I could use MSVC as well in the future.
Related question and webpages I've looked up:
Variable-sized bitfields with aliasing
Unions within unions
for those (like me) scratching their heads about what happens with bitfields that are not byte-aligned and endianness, look no further:
Sadly no answer got me the entirety of the way there.
EDIT: While experimenting, setting some values and reading the integer representation. I noticed something that I had forgotten about: endianness. This opens up another can of worms. Is it even possible to do what I want using bitfields or will I have to go for bitshifting operations?
The layout for bitfields is highly implementation (=compiler) dependent. In essence, compilers are free to place consecutive bitfields in the same byte/word if it sees fit, or not. Thus without extensions like the packed attribute that you mention, you can never be sure that your bitfields are squeezed into one word.
Then, if the bitfields are not squeezed into one word, or if you have just some spare bits that you don't use, you may be even more in trouble. These so-called padding bits can have arbitrary values, thus your sorting idea could never work in a portable setting.
For all these reasons, bitfields are relatively rarely used in real code. What you can see more often is the use of macros for the bits of your uint64_t that you need. For each of your bitfields that you have now, you'd need two macros, one to extract the bits and one to set them. Such a code then would be portable on all platforms that have a C99/C11 compiler without problems.
Minor point:
In the declaration of a union it is better to but the basic integer field first. The default initializer for a union uses the first field, so this would then ensure that your union would be initialized to all bits zero by such an initializer. The initializer of the struct would only guarantee that the individual fields are set to 0, the padding bits, if any, would be unspecific.

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
if (localId==0) {
output[get_group_id(0)] = target[0];
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem,;
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

What is the return value of sched_find_first_bit if it doesn't find anything?

The kernel is 2.4.
On a side note, does anybody knows a good place where I can search for that kind of information? Searching Google for function definitions is frustrating.
If you plan on spending any significant time searching through or understanding the Linux kernel, I recommend downloading a copy and using Cscope.
Using Cscope on large projects (example: the Linux kernel)
I found the following in a copy of the Linux kernel 2.4.18.
The key seems to be the comment before this last piece of code below. It appears that the return value of sched_find_first_bit is undefined if no bit is set.
From linux-2.4/include/linux/sched.h:185
* The maximum RT priority is configurable. If the resulting
* bitmap is 160-bits , we can use a hand-coded routine which
* is optimal. Otherwise, we fall back on a generic routine for
* finding the first set bit from an arbitrarily-sized bitmap.
#if MAX_PRIO 127
#define sched_find_first_bit(map) _sched_find_first_bit(map)
#define sched_find_first_bit(map) find_first_bit(map, MAX_PRIO)
From linux-2.4/include/asm-i386/bitops.h:303
* find_first_bit - find the first set bit in a memory region
* #addr: The address to start the search at
* #size: The maximum size to search
* Returns the bit-number of the first set bit, not the number of the byte
* containing a bit.
static __inline__ int find_first_bit(void * addr, unsigned size)
int d0, d1;
int res;
/* This looks at memory. Mark it volatile to tell gcc not to move it around */
__asm__ __volatile__(
"xorl %%eax,%%eax\n\t"
"repe; scasl\n\t"
"jz 1f\n\t"
"leal -4(%%edi),%%edi\n\t"
"bsfl (%%edi),%%eax\n"
"1:\tsubl %%ebx,%%edi\n\t"
"shll $3,%%edi\n\t"
"addl %%edi,%%eax"
:"=a" (res), "=&c" (d0), "=&D" (d1)
:"1" ((size + 31) >> 5), "2" (addr), "b" (addr));
return res;
From linux-2.4/include/asm-i386/bitops.h:425
* Every architecture must define this function. It's the fastest
* way of searching a 140-bit bitmap where the first 100 bits are
* unlikely to be set. It's guaranteed that at least one of the 140
* bits is cleared.
static inline int _sched_find_first_bit(unsigned long *b)
if (unlikely(b[0]))
return __ffs(b[0]);
if (unlikely(b[1]))
return __ffs(b[1]) + 32;
if (unlikely(b[2]))
return __ffs(b[2]) + 64;
if (b[3])
return __ffs(b[3]) + 96;
return __ffs(b[4]) + 128;
From linux-2.4/include/asm-i386/bitops.h:409
* __ffs - find first bit in word.
* #word: The word to search
* Undefined if no bit exists, so code should check against 0 first.
static __inline__ unsigned long __ffs(unsigned long word)
__asm__("bsfl %1,%0"
:"=r" (word)
:"rm" (word));
return word;
