unsigned integer devision rounding AVR GCC - rounding

I am struggling to understand how to divide an unsigned integer by a factor of 10 accounting for rounding like a float would round.
uint16_t val = 331 / 10; // two decimal to one decimal places 3.31 to 3.3
uint16_t val = 329 / 10; // two decimal to one decimal places 3.29 to 3.3 (not 2.9)
I would like both of these sums to round to 33 (3.3 in decimal equivalent of 1 decimal place)
Im sure the answer is simple if i were more knowledgable than i am on how processors perform integer division.

Since integer division rounds the result downward to zero, just add half of the divisor, whatever it is. I.e.:
uint16_t val = ((uint32_t)x + 5) / 10; // convert to 32-bit to avoid overflow
or in more general form:
static inline uint16_t divideandround(uint16_t quotient, uint16_t divisor) {
return ((uint32_t)quotient + (divisor >> 1)) / divisor;
If you are sure there will no 16-bit overflow (i.e. values will always be not more than 65530) you can speed up the calculation by keeping values 16 bit:
uint16_t val = (x + 5) / 10;

I think I have worked it out, This seems to give me the right answer, please let me know if I am actually wrong and it fails.
uint16_t val = 329;
if (val%10>=5)
val = (val+5)/10;
val = val/10;

You can do it with just one 16-bit divmod operation:
#include <stdint.h>
uint16_t udiv10_round (uint16_t n)
uint16_t q = n / 10;
uint16_t r = n % 10;
return r >= 5 ? q + 1 : q;
When you are optimizing for size (-Os), avr-gcc will compute both quotient and remainder by means of one library call to __udivmodhi4.
When you are optimizing for speed (-O2), avr-gcc might avoid1 __udivmodhi4 altogether and instead perform a 16×16=32 multiplication with a factor of 0xcccd, so that the quotient and remainder are easy to compute from the high part of that product.
1This happens if the MCU you are compiling for supports MUL. If MUL is nor supported, avr-gcc still uses divmod operation as a 16×16=32 multiplication would not gain anything.


GMP setting last digit to zero

I’m looking for fastest way to set last digit of positive number l declated as mpz_t to zero. I didn’t find the function could to this already. For example 6531489321483 should be changed to 6531489321480.
It appears that subtraction and modulo is the superior method for zeroing out the last digit with mpz_t types. Just as #MarkDickinson and #MarcGlisse pointed out, the asymptotic behavior greatly favors using mpz_tdiv_r_ui (or mpz_fdiv_r_ui) over mpz_div_ui followed by mpz-mul_ui. My original benchmarks were on relatively small numbers (25 digits). I retested on a 175 digit number and the sub_mod method was nearly 40% faster.
Test value: 1234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789
Result with div_mul: 1234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456780
Result with sub_mod: 1234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456780
time with division followed by multiplication: 6.145656
time with subtraction and modulo: 4.413998
And with a 350 digit number we see that sub_mod is around 85% faster:
Test value: 12345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789
Result with div_mul: 12345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456780
Result with sub_mod: 12345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456780
time with division followed by multiplication: 10.256122
time with subtraction and modulo: 5.522990
It should be noted that whether we use mpz_tdiv_r_ui or mpz_fdiv_r_ui, the results were almost identical.
Since the sub_mod method was only marginally slower with smaller numbers, it seems reasonable to only use this method for all cases.
It would be nice to tests this on different compilers. I'm currently using clang 5.0.1.
Benchmarks on my machine show that division followed by multiplication is faster than finding the remainder via modulo operator and subtracting.
#include <stdio.h>
#include <time.h>
#include <gmp.h>
void div_mul(mpz_t x) {
mpz_tdiv_q_ui(x, x, 10u);
mpz_mul_ui(x, x, 10u);
// Maybe this could be simpler?
void sub_mod(mpz_t x, mpz_t y) {
// N.B. mpz_mod_ui is equivalent to mpz_fdiv_r_ui. Changed to
// mpz_tdiv_r_ui for consistency with div_mul.
mpz_tdiv_r_ui(y, x, 10u);
mpz_sub(x, x, y);
int main() {
mpz_t testVal;
mpz_set_str(testVal, "1234567898765432123456789", 10);
gmp_printf("Test value: %Zd\n", testVal);
mpz_t x;
mpz_t y;
mpz_set(x, testVal);
gmp_printf("Result with div_mul: %Zd\n", x);
mpz_set(x, testVal);
sub_mod(x, y);
gmp_printf("Result with sub_mod: %Zd\n", x);
const int limit = 100000000;
const double checkPoint0 = (double) clock() / CLOCKS_PER_SEC;
for (int i = 0; i < limit; ++i) {
mpz_set(x, testVal);
const double checkPoint1 = (double) clock() / CLOCKS_PER_SEC;
const double time_div_mul = checkPoint1 - checkPoint0;
printf("time with division followed by multiplication: %f\n", time_div_mul);
const double checkPoint2 = (double) clock() / CLOCKS_PER_SEC;
for (int i = 0; i < limit; ++i) {
mpz_set(x, testVal);
sub_mod(x, y);
const double checkPoint3 = (double) clock() / CLOCKS_PER_SEC;
const double time_sub_mod = checkPoint3 - checkPoint2;
printf("time with subtraction and modulo: %f\n", time_sub_mod);
return 0;
Test value: 1234567898765432123456789
Result with div_mul: 1234567898765432123456780
Result with sub_mod: 1234567898765432123456780
time with division followed by multiplication: 2.941251
time with subtraction and modulo: 3.171949
I suspect that one of the reasons that the latter method is slightly slower is that 2 variables are needed as complicated multi operations on the same line are not accessible in the C api. If we could use gmpxx, we could write x - x % 10.
Another thought as to why the first method is faster, is that the div_mul involves two operations with unsigned integers while the sub_mod method involves an operation with an unsigned integer followed by an operation with mpz_t.
I tried to get this reproduced on ideone.com but could not get gmp.h loaded. I opted to implement a similar benchmark with type long long int just for fun. You will note the presence of volatile and that the limit is one billion instead of one hundred million as seen above. The volatile was need to keep the for loop from being optimized away.
Converting the number to a string and changing last character wouldn't be the fastest way?

How to concatenate bounded-length strings in SIMD/AVX2 code

I have 32 length-1-to-4 strings stored in AVX2 uint8x32 registers, one register for each of length, byte0, byte1, byte2, byte3. I'd like to concatenate all the strings and write them densely to memory. If all the strings were equal length this would be straightforward: I'd shuffle the bytes to their target positions using pshufb and use some blend calls to mix the byte0/byte1/byte2/byte3 registers together. (Alternatively perhaps I could use vpunpck* instructions. Not yet figured out...)
However, the variable-length aspect makes this harder: where each output byte comes from is now a nontrivial function of the lengths. I can't figure out how to implement this efficiently in AVX2 code. Help?
Bottom line: I'd like a reimplementation of the following function, written in (as fast as possible) vector code rather than scalar code:
(godbolt link)
int concat_strings(char* dst, __m256i len_v, __m256i byte0_v, __m256i byte1_v, __m256i byte2_v, __m256i byte3_v) {
char len[32];
char byte0[32];
char byte1[32];
char byte2[32];
char byte3[32];
_mm256_store_si256(reinterpret_cast<__m256i*>(len), len_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte0), byte0_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte1), byte1_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte2), byte2_v);
_mm256_store_si256(reinterpret_cast<__m256i*>(byte3), byte3_v);
int pos = 0;
for (int i = 0; i < 32; ++i) {
dst[pos + 0] = byte0[i];
dst[pos + 1] = byte1[i];
dst[pos + 2] = byte2[i];
dst[pos + 3] = byte3[i];
pos += len[i];
return pos;

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
if (localId==0) {
output[get_group_id(0)] = target[0];
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

very interesting behaviour using CUDA 4.2 and driver 295.41

I witnessed a very interesting behaviour when using CUDA 4.2 and driver 295.41 on Linux.
The code itself is nothing more than finding the maximum value of a random matrix and labelling the position to be 1.
#include <stdio.h>
#include <stdlib.h>
const int MAX = 8;
static __global__ void position(int* d, int len) {
int idx = threadIdx.x + blockIdx.x*blockDim.x;
if (idx < len)
d[idx] = (d[idx] == MAX) ? 1 : 0;
int main(int argc, const char** argv) {
int colNum = 16*512, rowNum = 1024;
int len = rowNum * colNum;
int* h = (int*)malloc(len*sizeof(int));
int* d = NULL;
cudaMalloc((void**)&d, len*sizeof(int));
// get a random matrix
for (int i = 0; i < len; i++) {
h[i] = rand()%(MAX+1);
// launch kernel
int threads = 128;
cudaMemcpy(d, h, len*sizeof(int), cudaMemcpyHostToDevice);
position<<<(len-1)/threads+1, threads>>>(d, len);
cudaMemcpy(h, d, len*sizeof(int), cudaMemcpyDeviceToHost);
return 0;
When I set the rowNum = 1024, the code does not work at all as if the kernel has never been launched.
If rowNum = 1023, everything works fine.
And this rowNum value is somehow convoluted with the block size (in this example, 128), if I change the block size to be 512, the behaviour happens between rowNum = 4095 and 4096.
I'm not quite sure if this is a bug or did I miss anything?
You should always check for errors after calling CUDA functions. For example, in your code the invalid configuration argument error occurs during kernel launch.
This usually means that the grid or block dimensions are unvalid.
With colNum = 16*512, rowNum = 1024 you are attempting to run 65536 blocks x 128 threads, exceeding the maximum grid dimension (which is 65535 blocks for GPUs with compute capability 1.x and 2.x, not sure about 3.x).
If you need to run more threads, you can either increase block size (you have alredy tried it and it gave some effect) or use 2D/3D grid (3D is available only for devices with compute capability 2.0 or higher).

Need to do 64 bit multiplication on a machine with 32 bit longs

I'm working on a small embedded system that has 32 bit long ints. For one calculation I need output the time since 1970 in ms. I can get the time in 32 bit unsigned long seconds since 1970, but how can I represent this as a 64 bit no. of ms if my biggest int is only 32bits? I'm sure stackoverflow will have a cunning answer! I am using Dynamic C, close to standard C. I have some sample code from another system which has a 64 bit long long data type:
long long T = (long long)(SampleTime * 1000.0 + 0.5);
data.TimeLower = (unsigned int)(T & 0xffffffff);
data.TimeUpper = (unsigned short)((T >> 32) & 0xffff);
Since you are only multiplying by 1000 (seconds -> millis), you can do it with two 16 bit mutliplies and one add and a bit of bit fiddling, I have used your putative data type to store the result below:
uint32_t time32 = time();
uint32_t t1 = (time32 & 0xffff) * 1000;
uint32_t t2 = ((time32 >> 16) * 1000) + (t1 >> 16);
data.TimeLower = (uint32_t) ((t2 & 0xffff) << 16) | (t1 & 0xffff);
data.TimeUpper = (uint32_t) (t2 >> 16);
The standard approach, assuming you have a 16x16->32 multiply available, would be to split both numbers into 16-bit high and low parts, compute four partial products, and add the results. If you don't have a 16x16->32 primitive which is faster than a 32x32->32 primitive, though, I'm not sure what the best approach would be. I would think that a 32x32->32 multiply should be more useful than a 16x16->32, but I can't think how one would use it.
Personally, I wish there were a standard primitive to return the top half of a NxN multiply (32x32, certainly; also 16x16 for smaller machines and 64x64 for larger ones).
It might be helpful if you were more specific about what kinds of calculations you need to do. 64-bit multiplication implemented with 32-bit operations is quite slow, and you may have the additional overhead of 64-bit division (to convert back to seconds and milliseconds), which is even slower.
Without knowing more about what exactly you need to do, it seems to me that it would be more efficient to use a struct, containing a 32-bit unsigned int for the number of seconds and a 16-bit int for the number of milliseconds (the "remainder"). (Or use a 32-bit int for the milliseconds if 64-bit alignment is more important than saving a couple of bytes.)
