How to concatenate bounded-length strings in SIMD/AVX2 code - string

I have 32 length-1-to-4 strings stored in AVX2 uint8x32 registers, one register for each of length, byte0, byte1, byte2, byte3. I'd like to concatenate all the strings and write them densely to memory. If all the strings were equal length this would be straightforward: I'd shuffle the bytes to their target positions using pshufb and use some blend calls to mix the byte0/byte1/byte2/byte3 registers together. (Alternatively perhaps I could use vpunpck* instructions. Not yet figured out...)
However, the variable-length aspect makes this harder: where each output byte comes from is now a nontrivial function of the lengths. I can't figure out how to implement this efficiently in AVX2 code. Help?
Bottom line: I'd like a reimplementation of the following function, written in (as fast as possible) vector code rather than scalar code:
(godbolt link)
int concat_strings(char* dst, __m256i len_v, __m256i byte0_v, __m256i byte1_v, __m256i byte2_v, __m256i byte3_v) {
char len[32];
char byte0[32];
char byte1[32];
char byte2[32];
char byte3[32];
_mm256_store_si256(reinterpret_cast<__m256i*>(len), len_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte0), byte0_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte1), byte1_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte2), byte2_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte3), byte3_v);
int pos = 0;
for (int i = 0; i < 32; ++i) {
dst[pos + 0] = byte0[i];
dst[pos + 1] = byte1[i];
dst[pos + 2] = byte2[i];
dst[pos + 3] = byte3[i];
pos += len[i];
}
return pos;
}
Help?

Related

unsigned integer devision rounding AVR GCC

I am struggling to understand how to divide an unsigned integer by a factor of 10 accounting for rounding like a float would round.
uint16_t val = 331 / 10; // two decimal to one decimal places 3.31 to 3.3
uint16_t val = 329 / 10; // two decimal to one decimal places 3.29 to 3.3 (not 2.9)
I would like both of these sums to round to 33 (3.3 in decimal equivalent of 1 decimal place)
Im sure the answer is simple if i were more knowledgable than i am on how processors perform integer division.
Since integer division rounds the result downward to zero, just add half of the divisor, whatever it is. I.e.:
uint16_t val = ((uint32_t)x + 5) / 10; // convert to 32-bit to avoid overflow
or in more general form:
static inline uint16_t divideandround(uint16_t quotient, uint16_t divisor) {
return ((uint32_t)quotient + (divisor >> 1)) / divisor;
}
If you are sure there will no 16-bit overflow (i.e. values will always be not more than 65530) you can speed up the calculation by keeping values 16 bit:
uint16_t val = (x + 5) / 10;
I think I have worked it out, This seems to give me the right answer, please let me know if I am actually wrong and it fails.
uint16_t val = 329;
if (val%10>=5)
{
val = (val+5)/10;
}
else
{
val = val/10;
}
You can do it with just one 16-bit divmod operation:
#include <stdint.h>
uint16_t udiv10_round (uint16_t n)
{
uint16_t q = n / 10;
uint16_t r = n % 10;
return r >= 5 ? q + 1 : q;
}
When you are optimizing for size (-Os), avr-gcc will compute both quotient and remainder by means of one library call to __udivmodhi4.
When you are optimizing for speed (-O2), avr-gcc might avoid1 __udivmodhi4 altogether and instead perform a 16×16=32 multiplication with a factor of 0xcccd, so that the quotient and remainder are easy to compute from the high part of that product.
1This happens if the MCU you are compiling for supports MUL. If MUL is nor supported, avr-gcc still uses divmod operation as a 16×16=32 multiplication would not gain anything.

How to sort a variable-length string array with radix sort?

I know that radix sort can sort same-length string arrays, but is it possible to do so with variable-length strings. If it is, what is the C-family code or pseudo-code to implement this?
It might not a be fast algorithm for variable-length strings, but it is easy to implement radix sort, so it's useful if a sort needs to be coded quickly.
I'm not quite sure what you mean by "variable-length strings" but you can perform a binary MSB radix sort in-place so the length of the string doesn't matter since there are no intermediate buckets.
#include <stdio.h>
#include <algorithm>
static void display(char *str, int *data, int size)
{
printf("%s: ", str);
for(int v=0;v<size;v++) {
printf("%d ", data[v]);
}
printf("\n");
}
static void sort(int *data, int size, int bit)
{
if (bit == 0)
return;
int b = 0;
int e = size;
if (size > 0) {
while (b != e) {
if (data[b] & (1 << bit)) {
std::swap(data[b], data[--e]);
}
else {
b++;
}
}
sort(data, e, bit - 1);
sort(data + b, size - b, bit - 1);
}
}
int main()
{
int data[] = { 13, 12, 22, 20, 3, 4, 14, 92, 11 };
int size = sizeof(data) / sizeof(data[0]);
display("Before", data, size);
sort(data, size, sizeof(int)*8 - 1);
display("After", data, size);
}
You can do a MSB-first radix sort on variable-length strings.
There are a couple non-obvious details:
Pass #N will partition (scatter) strings from the input vector into 256 partitions, according to strvec[i][N]. It then will scan the partitions in order, and put (reinsert) strings back into the input vector.
Now the slightly complicated bit...
When you reach the end of a string, it is in its final position, and should never be touched again. That splits the strings before and after it into separate RANGES. The result of each pass is a set of ranges of yet-unsorted rows.
That means that pass #N, after the first, scans the strings in each range, and stores the source range id (index) along with the string, in the partition. In the "reinsert" step, it puts the string back into its source range; and again, it generates a new set of unsorted-row ranges.
You keep the stable-sort bonus of radix sort, if you forward-scan the input ranges and then backward-scan the partitions and reinsert starting at the back of each source range.
You can also use recursion (doing a complete sort from scratch on any subrange) but the above saves on setup and is faster.
There are more details ... quicksort falls through to doing an insertion sort for tiny ranges (e.g. up to 16); radix sort benefits from doing the same.
Using multiple bytes as a partition index is possible. One approach for that is in: Radix Sort-Mischa Sandberg-2010 There are other approaches.
Sorry I can't post code; it's now proprietary.

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
{
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
}
}
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
barrier(CLK_LOCAL_MEM_FENCE);
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
}
if (localId==0) {
output[get_group_id(0)] = target[0];
}
}
https://pastebin.com/xN4yQ28N
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
}
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
}
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

allocating enough memory using typedef struct object whose size varies in another typedef struct

I have defined two typedef structs, and the second has the first as an object:
typedef struct
{
int numFeatures;
float* levelNums;
} Symbol;
typedef struct
{
int numSymbols;
Symbol* symbols;
} Data_Set;
I then defined numFeatures and numSymbols and allocate memory for both symbols and levelNums, then fill levelNums inside a for loop with value of the inner loop index just to verify it is working as expected.
Data_Set lung_cancer;
lung_cancer.numSymbols = 5;
lung_cancer.symbols = (Symbol*)malloc( lung_cancer.numSymbols * sizeof( Symbol ) );
lung_cancer.symbols->numFeatures = 3;
lung_cancer.symbols->levelNums = (float*)malloc( lung_cancer.symbols->numFeatures * sizeof( float ) );
for(int symbol = 0; symbol < lung_cancer.numSymbols; symbol++ )
for( int feature = 0; feature < lung_cancer.symbols->numFeatures; feature++ )
*(lung_cancer.symbols->levelNums + symbol * lung_cancer.symbols->numFeatures + feature ) = feature;
for(int symbol = 0; symbol < lung_cancer.numSymbols; symbol++ )
for( int feature = 0; feature < lung_cancer.symbols->numFeatures; feature++ )
cout << *(lung_cancer.symbols->levelNums + symbol * lung_cancer.symbols->numFeatures + feature ) << endl;
return 0;
When levelNums are int I get what I expect( i.e. 0,1,2,0,1,2,...) but when they are float, only the first 3 are correct and the remaining are very small or very large values, not 0,1,2 like expected. I then have two questions:
When allocating memory for symbols, how does it know how big a Symbol is since I have not yet defined how large levelNums will be yet.
How do I get float values into levelNums correctly.
The reason I am doing it like this is this is a data structure that will be sent to a GPU for GPGPU programming in CUDA and arrays are not recognized. I can only send in a continuous block of memory explicitly and the typedef structs are only there for conveying/defining the memory struture of the data.
A couple thing jump out at meet. For one thing, you only allocated a buffer for levelNums of the first symbol. Similarly, your inner loops always loop over the numFeatures of the first symbol.
You're doing a whole lot of dereferencing of arrays, which is fine in general, but the assignment in particular (inside the first set of loops) looks very strange. It's entirely possible I just don't understand what you're trying to do there, but I think it'd be a lot less confusing if you used some square bracket array accessors.

String manipulation in Linux kernel module

I am having a hard time in manipulating strings while writing module for linux. My problem is that I have a int Array[10] with different values in it. I need to produce a string to be able send to the buffer in my_read procedure. If my array is {0,1,112,20,4,0,0,0,0,0}
then my output should be:
0:(0)
1:-(1)
2:-------------------------------------------------------------------------------------------------------(112)
3:--------------------(20)
4:----(4)
5:(0)
6:(0)
7:(0)
8:(0)
9:(0)
when I try to place the above strings in char[] arrays some how weird characters end up there
here is the code
int my_read (char *page, char **start, off_t off, int count, int *eof, void *data)
{
int len;
if (off > 0){
*eof =1;
return 0;
}
/* get process tree */
int task_dep=0; /* depth of a task from INIT*/
get_task_tree(&init_task,task_dep);
char tmp[1024];
char A[ProcPerDepth[0]],B[ProcPerDepth[1]],C[ProcPerDepth[2]],D[ProcPerDepth[3]],E[ProcPerDepth[4]],F[ProcPerDepth[5]],G[ProcPerDepth[6]],H[ProcPerDepth[7]],I[ProcPerDepth[8]],J[ProcPerDepth[9]];
int i=0;
for (i=0;i<1024;i++){ tmp[i]='\0';}
memset(A, '\0', sizeof(A));memset(B, '\0', sizeof(B));memset(C, '\0', sizeof(C));
memset(D, '\0', sizeof(D));memset(E, '\0', sizeof(E));memset(F, '\0', sizeof(F));
memset(G, '\0', sizeof(G));memset(H, '\0', sizeof(H));memset(I, '\0', sizeof(I));memset(J, '\0', sizeof(J));
printk("A:%s\nB:%s\nC:%s\nD:%s\nE:%s\nF:%s\nG:%s\nH:%s\nI:%s\nJ:%s\n",A,B,C,D,E,F,G,H,I,J);
memset(A,'-',sizeof(A));
memset(B,'-',sizeof(B));
memset(C,'-',sizeof(C));
memset(D,'-',sizeof(D));
memset(E,'-',sizeof(E));
memset(F,'-',sizeof(F));
memset(G,'-',sizeof(G));
memset(H,'-',sizeof(H));
memset(I,'-',sizeof(I));
memset(J,'-',sizeof(J));
printk("A:%s\nB:%s\nC:%s\nD:%s\nE:%s\nF:%s\nG:%s\nH:%s\nI:%s\nJ:%\n",A,B,C,D,E,F,G,H,I,J);
len = sprintf(page,"0:%s(%d)\n1:%s(%d)\n2:%s(%d)\n3:%s(%d)\n4:%s(%d)\n5:%s(%d)\n6:%s(%d)\n7:%s(%d)\n8:%s(%d)\n9:%s(%d)\n",A,ProcPerDepth[0],B,ProcPerDepth[1],C,ProcPerDepth[2],D,ProcPerDepth[3],E,ProcPerDepth[4],F,ProcPerDepth[5],G,ProcPerDepth[6],H,ProcPerDepth[7],I,ProcPerDepth[8],J,ProcPerDepth[9]);
return len;
}
it worked out with this:
char s[500];
memset(s,'-',498);
for (i=len=0;i<10;++i){
len+=sprintf(page+len,"%d:%.*s(%d)\n",i,ProcPerDepth[i],s,ProcPerDepth[i]);
}
I wonder if there is an easy flag to multiply string char in sprintf. thanx –
Here are a some issues:
You have entirely filled the A, B, C ... arrays with characters. Then, you pass them to an I/O routine that is expecting null-terminated strings. Because your strings are not null-terminated, printk() will keep printing whatever is in stack memory after your object until it finds a null by luck.
Multi-threaded kernels like Linux have strict and relatively small constraints regarding stack allocations. All instances in the kernel call chain must fit into a specific size or something will be overwritten. You may not get any detection of this error, just some kind of downstream crash as memory corruption leads to a panic or a wedge. Allocating large and variable arrays on a kernel stack is just not a good idea.
If you are going to write the tmp[] array and properly nul-terminate it, there is no reason to also initialize it. But if you were going to initialize it, you could do so with compiler-generated code by just saying: char tmp[1024] = { 0 }; (A partial initialization of an aggregate requires by C99 initialization of the entire aggregate.) A similar observation applies to the other arrays.
How about getting rid of most of those arrays and most of that code and just doing something along the lines of:
for(i = j = 0; i < n; ++i)
j += sprintf(page + j, "...", ...)

Resources