Estimating max payload size on a compressed list of integers - search

I have 1 million rows in an application. It makes a request to a server such as the following:
/search?q=hello
And the search returns a sorted list of integers, representing the rows that have been matched in the input dataset (which the user has in their browser). How would I go about estimating the maximum size that the payload would return? For example, to start we have:
# ~7 MB if we stored "all results" uncompressed
6888887
# ~ 3.5MB if we stored "all results" relative to 0 or ALL matches (cuts it down by two)
3444443
And then we would want to compress these integers using some sort of decompression (Elias-Fano ?) What would be the "worst-case" scenario for the size of 1M sorted integers? And how would that calculation be made?
The application has one million rows of data, so lets say R1 --> R1000000, or if zero-indexing, range(int(1e6)). The server will respond with something like: [1,2,3], indicating that (only) rows 1, 2, and 3 were matched.

There are 2^(10^6) different sorted (duplicate-less) lists of integers < 10^6. Mapping each such list, say [0, 4, ...], to the corresponding
bit array (say 10001....) yields 10^6 bits, i.e 125kB of information. As each bit array corresponds to a unique possible sorted list, and vice versa, this is the most
compact (in the sense of: having the smallest max. size) representation.
Of course, if some results are more probable than others, there may be more efficient (in the sense of: having a smaller average size) representations.
For example, if most result sets are small, a simple run-length encoding may generally yield smaller encodings.
Inevitably, in that case the maximal size of the encoding (the max. payload size you were asking about) will be more than 125 kB
Compressing the above-mentioned 125 kB bit array with e.g. zlib will yield an acceptably compact encoding for small result sets. Moreover, zlib has a function deflateBound() that, given the uncompressed size, will calculate the max payload size (which, in your case, will definitely be larger than 125 kB, but not by much)

Input specification:
Row numbers between 0 and 999999 (if you need 1-indexing, you can apply offset)
Each row number appears only once
The numbers are sorted in ascending order (useful, we'd want to sort them anyway)
A great idea you've had was to invert the meaning of the result when the number of matches is more than half the possible values. Let us retain that, and assume we are given a flag and a list of matches/misses.
Your initial attempt at coding this encoded the numbers as text with comma separation. That means that for 90% of the possible values you need 6 characters + 1 separator -- so 7 bytes on average. However, since the maximum value is 999999, you really only need 20 bits to encode each entry.
Hence, the first idea to reducing the size is to use binary encoding.
Binary Encoding
The simplest approach is to write the number of values sent followed by a stream of 32bit integers.
A more efficient approach would be to pack two 20-bit values into each 5 bytes written. In case of an odd count, you would just pad the 4 excess bits with zeros.
Those approaches may be good for small amounts of matches (or misses). However, the important thing to note is that for each row, we only need to track 1 bit of information -- whether it's present or not. That means that we can encode the results as a bitmap of 1000000 bits.
Combining those two approaches, we can use a bitmap when there are many matches or misses, and switch to binary coding when it's more efficient.
Range Reduction
The next potential improvement to use when coding sorted sequences of integers is to use range reduction.
The idea is to code the values from largest to smallest, and reducing the number of bits per value as they get smaller.
First, we encode the number of bits N necessary to represent the first value.
We encode the first value using N bits
For each following value
Encode the value using N bits
If the value requires fewer bits to encode, reduce N appropriately
Entropy Coding
Let's go back to the bitmap encoding. Based on the Shannon entropy theory
the worst case is where we have 50% matches. The further the probabilities are skewed, the fewer bits we need on average to code each entry.
Matches | Bits
--------+-----------
0 | 0
1 | 22
2 | 41
3 | 60
4 | 78
5 | 96
10 | 181
100 | 1474
1000 | 11408
10000 | 80794
100000 | 468996
250000 | 811279
500000 | 1000000
To do this, we need to use an entropy coder that can code fractional bits -- something like arithmetic or range coder or some of the new ANS based coders like FSE. Alternatively, we could group symbols together and use Huffman coding.
Prototypes and Measurements
I've written a test using a 32-bit implementation of FastAC by Amir Said, which limits the model to 4 decimal places.
(This is not really a problem, since we shouldn't be feeding such data to the codec directly. This is just a demonstration.)
First some common code:
typedef std::vector<uint8_t> match_symbols_t;
typedef std::vector<uint32_t> match_list_t;
typedef std::set<uint32_t> match_set_t;
typedef std::vector<uint8_t> buffer_t;
// ----------------------------------------------------------------------------
static uint32_t const NUM_VALUES(1000000);
// ============================================================================
size_t symbol_count(uint8_t bits)
{
size_t count(NUM_VALUES / bits);
if (NUM_VALUES % bits > 0) {
return count + 1;
}
return count;
}
// ----------------------------------------------------------------------------
void set_symbol(match_symbols_t& symbols, uint8_t bits, uint32_t match, bool state)
{
size_t index(match / bits);
size_t offset(match % bits);
if (state) {
symbols[index] |= 1 << offset;
} else {
symbols[index] &= ~(1 << offset);
}
}
// ----------------------------------------------------------------------------
bool get_symbol(match_symbols_t const& symbols, uint8_t bits, uint32_t match)
{
size_t index(match / bits);
size_t offset(match % bits);
return (symbols[index] & (1 << offset)) != 0;
}
// ----------------------------------------------------------------------------
match_symbols_t make_symbols(match_list_t const& matches, uint8_t bits)
{
assert((bits > 0) && (bits <= 8));
match_symbols_t symbols(symbol_count(bits), 0);
for (auto match : matches) {
set_symbol(symbols, bits, match, true);
}
return symbols;
}
// ----------------------------------------------------------------------------
match_list_t make_matches(match_symbols_t const& symbols, uint8_t bits)
{
match_list_t result;
for (uint32_t i(0); i < 1000000; ++i) {
if (get_symbol(symbols, bits, i)) {
result.push_back(i);
}
}
return result;
}
First, simpler variant is to write the number of matches, determine the probability of match/miss and clamp it to the supported range.
Then simply encode each value of the bitmap using this static probability model.
class arithmetic_codec_v1
{
public:
buffer_t compress(match_list_t const& matches)
{
uint32_t match_count(static_cast<uint32_t>(matches.size()));
arithmetic_codec codec(static_cast<uint32_t>(NUM_VALUES / 4));
codec.start_encoder();
// Store the number of matches (1000000 needs only 20 bits)
codec.put_bits(match_count, 20);
if (match_count > 0) {
// Initialize the model
static_bit_model model;
model.set_probability_0(get_probability_0(match_count));
// Create a bitmap and code all the bitmap entries
// NB: This is lazy and inefficient, but simple
match_symbols_t symbols = make_symbols(matches, 1);
for (auto entry : symbols) {
codec.encode(entry, model);
}
}
uint32_t compressed_size = codec.stop_encoder();
return buffer_t(codec.buffer(), codec.buffer() + compressed_size);
}
match_list_t decompress(buffer_t& compressed)
{
arithmetic_codec codec(static_cast<uint32_t>(compressed.size()), &compressed[0]);
codec.start_decoder();
// Read number of matches (20 bits)
uint32_t match_count(codec.get_bits(20));
match_list_t result;
if (match_count > 0) {
static_bit_model model;
model.set_probability_0(get_probability_0(match_count));
result.reserve(match_count);
for (uint32_t i(0); i < NUM_VALUES; ++i) {
uint32_t entry = codec.decode(model);
if (entry == 1) {
result.push_back(i);
}
}
}
codec.stop_decoder();
return result;
}
private:
double get_probability_0(uint32_t match_count, uint32_t num_values = NUM_VALUES)
{
double probability_0(double(num_values - match_count) / num_values);
// Limit probability to match FastAC limitations...
return std::max(0.0001, std::min(0.9999, probability_0));
}
};
The second approach is to adapt the model based on the symbols we code.
After each match is encoded, reduce the probability of the next match.
Once all matches we coded, stop.
The second variation compresses slightly better, but at a noticeable performance cost.
class arithmetic_codec_v2
{
public:
buffer_t compress(match_list_t const& matches)
{
uint32_t match_count(static_cast<uint32_t>(matches.size()));
uint32_t total_count(NUM_VALUES);
arithmetic_codec codec(static_cast<uint32_t>(NUM_VALUES / 4));
codec.start_encoder();
// Store the number of matches (1000000 needs only 20 bits)
codec.put_bits(match_count, 20);
if (match_count > 0) {
static_bit_model model;
// Create a bitmap and code all the bitmap entries
// NB: This is lazy and inefficient, but simple
match_symbols_t symbols = make_symbols(matches, 1);
for (auto entry : symbols) {
model.set_probability_0(get_probability_0(match_count, total_count));
codec.encode(entry, model);
--total_count;
if (entry) {
--match_count;
}
if (match_count == 0) {
break;
}
}
}
uint32_t compressed_size = codec.stop_encoder();
return buffer_t(codec.buffer(), codec.buffer() + compressed_size);
}
match_list_t decompress(buffer_t& compressed)
{
arithmetic_codec codec(static_cast<uint32_t>(compressed.size()), &compressed[0]);
codec.start_decoder();
// Read number of matches (20 bits)
uint32_t match_count(codec.get_bits(20));
uint32_t total_count(NUM_VALUES);
match_list_t result;
if (match_count > 0) {
static_bit_model model;
result.reserve(match_count);
for (uint32_t i(0); i < NUM_VALUES; ++i) {
model.set_probability_0(get_probability_0(match_count, NUM_VALUES - i));
if (codec.decode(model) == 1) {
result.push_back(i);
--match_count;
}
if (match_count == 0) {
break;
}
}
}
codec.stop_decoder();
return result;
}
private:
double get_probability_0(uint32_t match_count, uint32_t num_values = NUM_VALUES)
{
double probability_0(double(num_values - match_count) / num_values);
// Limit probability to match FastAC limitations...
return std::max(0.0001, std::min(0.9999, probability_0));
}
};
Practical Approach
Practically, it's probalby not worth designing a new compression format.
In fact, it might not even be worth it writing the results as bits, just make an array of bytes with values 0 or 1.
Then use an existing compression library -- zlib is very common, or you could try lz4 or snappy, bzip2, lzma... the choices are plentiful.
ZLib Example
class zlib_codec
{
public:
zlib_codec(uint32_t bits_per_symbol) : bits_per_symbol(bits_per_symbol) {}
buffer_t compress(match_list_t const& matches)
{
match_symbols_t symbols(make_symbols(matches, bits_per_symbol));
z_stream defstream;
defstream.zalloc = nullptr;
defstream.zfree = nullptr;
defstream.opaque = nullptr;
deflateInit(&defstream, Z_BEST_COMPRESSION);
size_t max_compress_size = deflateBound(&defstream, static_cast<uLong>(symbols.size()));
buffer_t compressed(max_compress_size);
defstream.avail_in = static_cast<uInt>(symbols.size());
defstream.next_in = &symbols[0];
defstream.avail_out = static_cast<uInt>(max_compress_size);
defstream.next_out = &compressed[0];
deflate(&defstream, Z_FINISH);
deflateEnd(&defstream);
compressed.resize(defstream.total_out);
return compressed;
}
match_list_t decompress(buffer_t& compressed)
{
z_stream infstream;
infstream.zalloc = nullptr;
infstream.zfree = nullptr;
infstream.opaque = nullptr;
inflateInit(&infstream);
match_symbols_t symbols(symbol_count(bits_per_symbol));
infstream.avail_in = static_cast<uInt>(compressed.size());
infstream.next_in = &compressed[0];
infstream.avail_out = static_cast<uInt>(symbols.size());
infstream.next_out = &symbols[0];
inflate(&infstream, Z_FINISH);
inflateEnd(&infstream);
return make_matches(symbols, bits_per_symbol);
}
private:
uint32_t bits_per_symbol;
};
BZip2 Example
class bzip2_codec
{
public:
bzip2_codec(uint32_t bits_per_symbol) : bits_per_symbol(bits_per_symbol) {}
buffer_t compress(match_list_t const& matches)
{
match_symbols_t symbols(make_symbols(matches, bits_per_symbol));
uint32_t compressed_size = symbols.size() * 2;
buffer_t compressed(compressed_size);
int err = BZ2_bzBuffToBuffCompress((char*)&compressed[0]
, &compressed_size
, (char*)&symbols[0]
, symbols.size()
, 9
, 0
, 30);
if (err != BZ_OK) {
throw std::runtime_error("Compression error.");
}
compressed.resize(compressed_size);
return compressed;
}
match_list_t decompress(buffer_t& compressed)
{
match_symbols_t symbols(symbol_count(bits_per_symbol));
uint32_t decompressed_size = symbols.size();
int err = BZ2_bzBuffToBuffDecompress((char*)&symbols[0]
, &decompressed_size
, (char*)&compressed[0]
, compressed.size()
, 0
, 0);
if (err != BZ_OK) {
throw std::runtime_error("Compression error.");
}
if (decompressed_size != symbols.size()) {
throw std::runtime_error("Size mismatch.");
}
return make_matches(symbols, bits_per_symbol);
}
private:
uint32_t bits_per_symbol;
};
Comparison
Code repository, including dependencies for 64bit Visual Studio 2015 is at https://github.com/dan-masek/bounded_sorted_list_compression.git

Storing a compressed list of sorted integers is extremely common in data retrieval and database applications, and a variety of techniques have been developed.
I'm pretty sure that an unguessably random selection of about half of the items in your list is going to be your worst case.
Many popular integer-list-compression techniques, such as Roaring bitmaps, fallback to using (with such worst-case input data) a 1-bit-per-index bitmap.
So in your case, with 1 million rows, the maximum size payload returned would be (in the worst case) a header with the "using a bitmap" flag set,
followed by a bitmap of 1 million bits (125,000 bytes), where for example the 700th bit of the bitmap is set to 1 if the 700th row in the database is a match, or set to 0 if the 700th row in the database does not match. (Thanks, Dan MaĊĦek!)
My understanding is that, while quasi-succinct Elias-Fano compression and other techniques are very useful for compressing many "naturally-occurring" sets of sorted integers, for this worst-case data set, none of them give better compression, and most of them give far worse "compression", than a simple bitmap.
(This is analogous to the way most general-purpose data compression algorithms, such as DEFLATE, when fed "worst-case" data such as indistinguishable-from-random encrypted data, create "compressed" files with a few bytes of overhead with the "stored/raw/literal" flag set, followed by a simple copy of the uncompressed file).
Jianguo Wang; Chunbin Lin; Yannis Papakonstantinou; Steven Swanson. "An Experimental Study of Bitmap Compression vs. Inverted List Compression"
https://en.wikipedia.org/wiki/Bitmap_index#Compression
https://en.wikipedia.org/wiki/Inverted_index#Compression

Related

android AudioTrack playback short array (16bit)

I have an application that playback audio. It takes encoded audio data over RTP and decode it to 16bit array. The decoded 16bit array is converted to 8 bit array (byte array) as this is required for some other functionality.
Even though audio playback is working it is breaking continuously and very hard to recognise audio output. If I listen carefully I can tell it is playing the correct audio.
I suspect this is due to the fact I convert 16 bit data stream into a byte array and use the write(byte[], int, int, AudioTrack.WRITE_NON_BLOCKING) of AudioTrack class for audio playback.
Therefore I converted the byte array back to a short array and used write(short[], int, int, AudioTrack.WRITE_NON_BLOCKING) method to see if it could resolve the problem.
However now there is no audio sound at all. In the debug output I can see the short array has data.
What could be the reason?
Here is the AUdioTrak initialization
sampleRate =AudioTrack.getNativeOutputSampleRate(AudioManager.STREAM_MUSIC);
minimumBufferSize = AudioTrack.getMinBufferSize(sampleRate, AudioFormat.CHANNEL_OUT_STEREO, AudioFormat.ENCODING_PCM_16BIT);
audioTrack = new AudioTrack(AudioManager.STREAM_MUSIC, sampleRate,
AudioFormat.CHANNEL_OUT_STEREO,
AudioFormat.ENCODING_PCM_16BIT,
minimumBufferSize,
AudioTrack.MODE_STREAM);
Here is the code converts short array to byte array
for (int i=0;i<internalBuffer.length;i++){
bufferIndex = i*2;
buffer[bufferIndex] = shortToByte(internalBuffer[i])[0];
buffer[bufferIndex+1] = shortToByte(internalBuffer[i])[1];
}
Here is the method that converts byte array to short array.
public short[] getShortAudioBuffer(byte[] b){
short audioBuffer[] = null;
int index = 0;
int audioSize = 0;
ByteBuffer byteBuffer = ByteBuffer.allocate(2);
if ((b ==null) && (b.length<2)){
return null;
}else{
audioSize = (b.length - (b.length%2));
audioBuffer = new short[audioSize/2];
}
if ((audioSize/2) < 2)
return null;
byteBuffer.order(ByteOrder.LITTLE_ENDIAN);
for(int i=0;i<audioSize/2;i++){
index = i*2;
byteBuffer.put(b[index]);
byteBuffer.put(b[index+1]);
audioBuffer[i] = byteBuffer.getShort(0);
byteBuffer.clear();
System.out.print(Integer.toHexString(audioBuffer[i]) + " ");
}
System.out.println();
return audioBuffer;
}
Audio is decoded using opus library and the configuration is as follows;
opus_decoder_ctl(dec,OPUS_SET_APPLICATION(OPUS_APPLICATION_AUDIO));
opus_decoder_ctl(dec,OPUS_SET_SIGNAL(OPUS_SIGNAL_MUSIC));
opus_decoder_ctl(dec,OPUS_SET_FORCE_CHANNELS(OPUS_AUTO));
opus_decoder_ctl(dec,OPUS_SET_MAX_BANDWIDTH(OPUS_BANDWIDTH_FULLBAND));
opus_decoder_ctl(dec,OPUS_SET_PACKET_LOSS_PERC(0));
opus_decoder_ctl(dec,OPUS_SET_COMPLEXITY(10)); // highest complexity
opus_decoder_ctl(dec,OPUS_SET_LSB_DEPTH(16)); // 16bit = two byte samples
opus_decoder_ctl(dec,OPUS_SET_DTX(0)); // default - not using discontinuous transmission
opus_decoder_ctl(dec,OPUS_SET_VBR(1)); // use variable bit rate
opus_decoder_ctl(dec,OPUS_SET_VBR_CONSTRAINT(0)); // unconstrained
opus_decoder_ctl(dec,OPUS_SET_INBAND_FEC(0)); // no forward error correction
Let's assume you have a short[] array which contains the 16-bit one channel data to be played.
Then each sample is a value between -32768 and 32767 which represents the signal amplitude at the exact moment. And 0 value represents a middle point (no signal). This array can be passed to the audio track with ENCODING_PCM_16BIT format encoding.
But things are going weird when playing ENCODING_PCM_8BIT is used (See AudioFormat)
In this case each sample encoded by one byte. But each byte is unsigned. That means, it's value is between 0 and 255, while 128 represents the middle point.
Java has no unsigned byte format. Byte format is signed. I.e. values -128...-1 will represent actual values of 128...255. So you have to be careful when converting to the byte array, otherwise it will be a noise with barely recognizable source sound.
short[] input16 = ... // the source 16-bit audio data;
byte[] output8 = new byte[input16.length];
for (int i = 0 ; i < input16.length ; i++) {
// To convert 16 bit signed sample to 8 bit unsigned
// We add 128 (for rounding), then shift it right 8 positions
// Then add 128 to be in range 0..255
int sample = ((input16[i] + 128) >> 8) + 128;
if (sample > 255) sample = 255; // strip out overload
output8[i] = (byte)(sample); // cast to signed byte type
}
To perform backward conversion all should be the same: each single sample to be converted to exactly one sample of the output signal
byte[] input8 = // source 8-bit unsigned audio data;
short[] output16 = new short[input8.length];
for (int i = 0 ; i < input8.length ; i++) {
// to convert signed byte back to unsigned value just use bitwise AND with 0xFF
// then we need subtract 128 offset
// Then, just scale up the value by 256 to fit 16-bit range
output16[i] = (short)(((input8[i] & 0xFF) - 128) * 256);
}
The issue of not being able to convert data from byte array to short array was resolved when used bitwise operators instead of using ByteArray. It could be due not setting the correct parameters in ByteArray or it is not suitable for such conversion.
Nevertheless implementing conversion using bitwise operators resolved the problem. Since the original question has been resolved by this approach, please consider this as the final answer.
I will raise a separate topic for playback issue.
Thank you for all your support.

Write data to an SD card through a buffer without a race condition

I am writing firmware for a data logging device. It reads data from sensors at 20 Hz and writes data to an SD card. However, the time to write data to SD card is not consistent (about 200-300 ms). Thus one solution is writing data to a buffer at a consistent rate (using a timer interrupt), and have a second thread that writes data to the SD card when the buffer is full.
Here is my naive implementation:
#define N 64
char buffer[N];
int count;
ISR() {
if (count < N) {
char a = analogRead(A0);
buffer[count] = a;
count = count + 1;
}
}
void loop() {
if (count == N) {
myFile.open("data.csv", FILE_WRITE);
int i = 0;
for (i = 0; i < N; i++) {
myFile.print(buffer[i]);
}
myFile.close();
count = 0;
}
}
The code has the following problems:
Writing data to the SD card is blocking reading when the buffer is full
It might have a race conditions.
What is the best way to solve this problem? Using a circular buffer, or double buffering? How do I ensure that a race condition does not happen?
You have rather answered your own question; you should use either double buffering or a circular buffer. Double-buffering is probably simpler to implement and appropriate for devices such as an SD card for which block-writes are generally more efficient.
Buffer length selection may need some consideration; generally you would make the buffer the same as the SD sector buffer size (typically 512 bytes), but that may not be practical, and with a sample rate as low as 20 sps, optimising SD write performance is perhaps not an issue.
Another consideration is that you need to match your sample rate to the file-system latency by selecting an appropriate buffer size. In this case the 64 sample buffer buffer will fill in a little more than three seconds, but the block write takes only up-to 300 ms - so you could use a much smaller buffer if required - 8 samples would be sufficient - although be careful, you may have observed latency of 300 ms, but it may be larger when specific boundaries are crossed in the physical flash memory - I have seen significant latency on some cards at 1 Mbyte boundaries - moreover card performance varies significantly between sizes and manufacturers.
An adaptation of your implementation with double-buffering is below. I have reduced the buffer length to 32 samples, but with double-buffering the total is unchanged at 64, but the write lag is reduced to 1.6 seconds.
// Double buffer and its management data/constants
static volatile char buffer[2][32];
static const int BUFFLEN = sizeof(buffer[0]);
static const unsigned char EMPTY = 0xff;
static volatile unsigned char inbuffer = 0;
static volatile unsigned char outbuffer = EMPTY;
ISR()
{
static int count = 0;
// Write to current buffer
char a = analogRead(A0);
buffer[inbuffer][count] = a;
count++ ;
// If buffer full...
if( count >= BUFFLEN )
{
// Signal to loop() that data available (not EMPTY)
outbuffer = inbuffer;
// Toggle input buffer
inbuffer = inbuffer == 0 ? 1 : 0;
count = 0;
}
}
void loop()
{
// If buffer available...
if( outbuffer != EMPTY )
{
// Write buffer
myFile.open("data.csv", FILE_WRITE);
for( int i = 0; i < BUFFLEN; i++)
{
myFile.print(buffer[outbuffer][i]);
}
myFile.close();
// Set the buffer to empty
outbuffer = EMPTY;
}
}
Note the use of volatile and unsigned char for the shared data. It is important that data shared between concurrent execution contexts is accessed explicitly and atomically; access to an int on 8-bit AVR based Arduino requires multiple machine instructions and the interrupt may occur part way through a read/write in loop() and cause an incorrect value to be read.

How to sort a variable-length string array with radix sort?

I know that radix sort can sort same-length string arrays, but is it possible to do so with variable-length strings. If it is, what is the C-family code or pseudo-code to implement this?
It might not a be fast algorithm for variable-length strings, but it is easy to implement radix sort, so it's useful if a sort needs to be coded quickly.
I'm not quite sure what you mean by "variable-length strings" but you can perform a binary MSB radix sort in-place so the length of the string doesn't matter since there are no intermediate buckets.
#include <stdio.h>
#include <algorithm>
static void display(char *str, int *data, int size)
{
printf("%s: ", str);
for(int v=0;v<size;v++) {
printf("%d ", data[v]);
}
printf("\n");
}
static void sort(int *data, int size, int bit)
{
if (bit == 0)
return;
int b = 0;
int e = size;
if (size > 0) {
while (b != e) {
if (data[b] & (1 << bit)) {
std::swap(data[b], data[--e]);
}
else {
b++;
}
}
sort(data, e, bit - 1);
sort(data + b, size - b, bit - 1);
}
}
int main()
{
int data[] = { 13, 12, 22, 20, 3, 4, 14, 92, 11 };
int size = sizeof(data) / sizeof(data[0]);
display("Before", data, size);
sort(data, size, sizeof(int)*8 - 1);
display("After", data, size);
}
You can do a MSB-first radix sort on variable-length strings.
There are a couple non-obvious details:
Pass #N will partition (scatter) strings from the input vector into 256 partitions, according to strvec[i][N]. It then will scan the partitions in order, and put (reinsert) strings back into the input vector.
Now the slightly complicated bit...
When you reach the end of a string, it is in its final position, and should never be touched again. That splits the strings before and after it into separate RANGES. The result of each pass is a set of ranges of yet-unsorted rows.
That means that pass #N, after the first, scans the strings in each range, and stores the source range id (index) along with the string, in the partition. In the "reinsert" step, it puts the string back into its source range; and again, it generates a new set of unsorted-row ranges.
You keep the stable-sort bonus of radix sort, if you forward-scan the input ranges and then backward-scan the partitions and reinsert starting at the back of each source range.
You can also use recursion (doing a complete sort from scratch on any subrange) but the above saves on setup and is faster.
There are more details ... quicksort falls through to doing an insertion sort for tiny ranges (e.g. up to 16); radix sort benefits from doing the same.
Using multiple bytes as a partition index is possible. One approach for that is in: Radix Sort-Mischa Sandberg-2010 There are other approaches.
Sorry I can't post code; it's now proprietary.

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
{
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
}
}
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
barrier(CLK_LOCAL_MEM_FENCE);
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
}
if (localId==0) {
output[get_group_id(0)] = target[0];
}
}
https://pastebin.com/xN4yQ28N
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
}
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
}
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

In bitmap.h, why does bitmap_zero need memset?

In include/linux/bitmap.h, in the bitmap_zero(), why use memset?
static inline void bitmap_zero(unsigned long *dst, int nbits)
{
if (small_const_nbits(nbits))
*dst = 0UL;
else {
int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
memset(dst, 0, len);
}
}
IS the *det = OUL not enough?
The definition of small_const_nbits is:
#define small_const_nbits(nbits) \
(__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG)
BITS_PER_LONG is normally 32 or 64, depending on what machine you're on.
So, if you're trying to clear fewer than that many bits, you can certainly do it in a single operation - that's the first half of the if statement. If it's longer than 32 or 64 bits, you need to set multiple words, and that's done by the memset call.

Resources