Is the bash function $RANDOM supposed to have an uniform distribution?

I understand that the bash function $RANDOM generates random integer number within a range, but, are these number supposed to follow (or approximate) an uniform discrete distribution?

I just printed $RANDOM a million times, turned it into a histogram, and viewed it with gnumeric, and the graph shows a very Normal distribution!
for n in `seq 1 1000000`; do echo $RANDOM ; done > random.txt
gawk '{b=int($1/100);a[b]++};END{for (n in a) {print n","a[n]}}' random.txt > hist.csv
gnumeric hist.csv
So, if you want an approximately linear distribution, use $(( $RANDOM % $MAXIMUM )) and don't use it with $MAXIMUM larger than 16383, or 8192 to be safe. You could concatenate $RANDOM % 1000 several times if you want really large numbers, as long as you take care of leading zeros.
If you do want a normal distribution, use $(( $RANGE * $RANDOM / 32767 + $MINIMUM)), and remember this is only integer math.

The Bash document doesn't actually say so:
Each time this parameter is referenced, a random integer between 0 and 32767 is generated.
Assigning a value to this variable seeds the random number generator.
Reading that, I would certainly assume that it's intended to be linear; it wouldn't make much sense IMHO for it to be anything else.
But looking at the bash source code, the implementation of $RANDOM is intended to produce a linear distribution (this is from variable.c in the bash 4.2 source):
/* The random number seed. You can change this by setting RANDOM. */
static unsigned long rseed = 1;
static int last_random_value;
static int seeded_subshell = 0;
/* A linear congruential random number generator based on the example
one in the ANSI C standard. This one isn't very good, but a more
complicated one is overkill. */
/* Returns a pseudo-random number between 0 and 32767. */
static int
brand ()
/* From "Random number generators: good ones are hard to find",
Park and Miller, Communications of the ACM, vol. 31, no. 10,
October 1988, p. 1195. filtered through FreeBSD */
long h, l;
/* Can't seed with 0. */
if (rseed == 0)
rseed = 123459876;
h = rseed / 127773;
l = rseed % 127773;
rseed = 16807 * l - 2836 * h;
#if 0
if (rseed < 0)
rseed += 0x7fffffff;
return ((unsigned int)(rseed & 32767)); /* was % 32768 */
As the comments imply, if you want good random numbers, use something else.


unsigned integer devision rounding AVR GCC

I am struggling to understand how to divide an unsigned integer by a factor of 10 accounting for rounding like a float would round.
uint16_t val = 331 / 10; // two decimal to one decimal places 3.31 to 3.3
uint16_t val = 329 / 10; // two decimal to one decimal places 3.29 to 3.3 (not 2.9)
I would like both of these sums to round to 33 (3.3 in decimal equivalent of 1 decimal place)
Im sure the answer is simple if i were more knowledgable than i am on how processors perform integer division.
Since integer division rounds the result downward to zero, just add half of the divisor, whatever it is. I.e.:
uint16_t val = ((uint32_t)x + 5) / 10; // convert to 32-bit to avoid overflow
or in more general form:
static inline uint16_t divideandround(uint16_t quotient, uint16_t divisor) {
return ((uint32_t)quotient + (divisor >> 1)) / divisor;
If you are sure there will no 16-bit overflow (i.e. values will always be not more than 65530) you can speed up the calculation by keeping values 16 bit:
uint16_t val = (x + 5) / 10;
I think I have worked it out, This seems to give me the right answer, please let me know if I am actually wrong and it fails.
uint16_t val = 329;
if (val%10>=5)
val = (val+5)/10;
val = val/10;
You can do it with just one 16-bit divmod operation:
#include <stdint.h>
uint16_t udiv10_round (uint16_t n)
uint16_t q = n / 10;
uint16_t r = n % 10;
return r >= 5 ? q + 1 : q;
When you are optimizing for size (-Os), avr-gcc will compute both quotient and remainder by means of one library call to __udivmodhi4.
When you are optimizing for speed (-O2), avr-gcc might avoid1 __udivmodhi4 altogether and instead perform a 16×16=32 multiplication with a factor of 0xcccd, so that the quotient and remainder are easy to compute from the high part of that product.
1This happens if the MCU you are compiling for supports MUL. If MUL is nor supported, avr-gcc still uses divmod operation as a 16×16=32 multiplication would not gain anything.

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
if (localId==0) {
output[get_group_id(0)] = target[0];
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem,;
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

How to interpret the field 'data' of an XImage

I am trying to understand how the data obtained from XGetImage is disposed in memory:
XImage img = XGetImage(display, root, 0, 0, width, height, AllPlanes, ZPixmap);
Now suppose I want to decompose each pixel value in red, blue, green channels. How can I do this in a portable way? The following is an example, but it depends on a particular configuration of the XServer and does not work in every case:
for (int x = 0; x < width; x++)
for (int y = 0; y < height; y++) {
unsigned long pixel = XGetPixel(img, x, y);
unsigned char blue = pixel & blue_mask;
unsigned char green = (pixel & green_mask) >> 8;
unsigned char red = (pixel & red_mask) >> 16;
In the above example I am assuming a particular order of the RGB channels in pixel and also that pixels are 24bit-depth: in facts, I have img->depth=24 and img->bits_per_pixels=32 (the screen is also 24-bit depth). But this is not a generic case.
As a second step I want to get rid of XGetPixel and use or describe img->data directly. The first thing I need to know is if there is anything in Xlib which exactly gives me all the informations I need to interpret how the image is built starting from the img->data field, which are:
the order of R,G,B channels in each pixel;
the number of bits for each pixels;
the numbbe of bits for each channel;
if possible, a corresponding FOURCC
The shift is a simple function of the mask:
int get_shift (int mask) {
shift = 0;
while (mask) {
if (mask & 1) break;
mask >>=1;
return shift;
Number of bits in each channel is just the number of 1 bits in its mask (count them). The channel order is determined by the shifts (if red shift is 0, the the first channel is R, etc).
I think the valid values for bits_per_pixel are 1, 2, 4, 8, 15, 16, 24 and 32 (15 and 16 bits are the same 2 bytes per pixel format, but the former has 1 bit unused). I don't think it's worth anyone's time to support anything but 24 and 32 bpp.
X11 is not concerned with media files, so no 4CC code.
This can be read from the XImage structure itself.
the order of R,G,B channels in each pixel;
This is contained in this field of the XImage structure:
int byte_order; /* data byte order, LSBFirst, MSBFirst */
which tells you whether it's RGB or BGR (because it only depends on the endianness of the machine).
the number of bits for each pixels;
can be obtained from this field:
int bits_per_pixel; /* bits per pixel (ZPixmap) */
which is basically the number of bits set in each of the channel masks:
unsigned long red_mask; /* bits in z arrangement */
unsigned long green_mask;
unsigned long blue_mask;
the numbbe of bits for each channel;
See above, or you can use the code from #n.m.'s answer to count the bits yourself.
Yeah, it would be great if they put the bit shift constants in that structure too, but apparently they decided not to, since the pixels are aligned to bytes anyway, in "standard order" (RGB). Xlib makes sure to convert it to that order for you when it retrieves the data from the X server, even if they are stored internally in a different format server-side. So it's always in RGB format, byte-aligned, but depending on the endianness of the machine, the bytes inside an unsigned long can appear in a reverse order, hence the byte_order field to tell you about that.
So in order to extract these channels, just use the 0, 8 and 16 shifts after masking with red_mask, green_mask and blue_mask, just make sure you shift the right bytes depending on the byte_order and it should work fine.

Need to do 64 bit multiplication on a machine with 32 bit longs

I'm working on a small embedded system that has 32 bit long ints. For one calculation I need output the time since 1970 in ms. I can get the time in 32 bit unsigned long seconds since 1970, but how can I represent this as a 64 bit no. of ms if my biggest int is only 32bits? I'm sure stackoverflow will have a cunning answer! I am using Dynamic C, close to standard C. I have some sample code from another system which has a 64 bit long long data type:
long long T = (long long)(SampleTime * 1000.0 + 0.5);
data.TimeLower = (unsigned int)(T & 0xffffffff);
data.TimeUpper = (unsigned short)((T >> 32) & 0xffff);
Since you are only multiplying by 1000 (seconds -> millis), you can do it with two 16 bit mutliplies and one add and a bit of bit fiddling, I have used your putative data type to store the result below:
uint32_t time32 = time();
uint32_t t1 = (time32 & 0xffff) * 1000;
uint32_t t2 = ((time32 >> 16) * 1000) + (t1 >> 16);
data.TimeLower = (uint32_t) ((t2 & 0xffff) << 16) | (t1 & 0xffff);
data.TimeUpper = (uint32_t) (t2 >> 16);
The standard approach, assuming you have a 16x16->32 multiply available, would be to split both numbers into 16-bit high and low parts, compute four partial products, and add the results. If you don't have a 16x16->32 primitive which is faster than a 32x32->32 primitive, though, I'm not sure what the best approach would be. I would think that a 32x32->32 multiply should be more useful than a 16x16->32, but I can't think how one would use it.
Personally, I wish there were a standard primitive to return the top half of a NxN multiply (32x32, certainly; also 16x16 for smaller machines and 64x64 for larger ones).
It might be helpful if you were more specific about what kinds of calculations you need to do. 64-bit multiplication implemented with 32-bit operations is quite slow, and you may have the additional overhead of 64-bit division (to convert back to seconds and milliseconds), which is even slower.
Without knowing more about what exactly you need to do, it seems to me that it would be more efficient to use a struct, containing a 32-bit unsigned int for the number of seconds and a 16-bit int for the number of milliseconds (the "remainder"). (Or use a 32-bit int for the milliseconds if 64-bit alignment is more important than saving a couple of bytes.)

In Perl module Proc::ProccessTable, why does pctcpu sometimes return 'inf', 'nan', or a value greater than 100?

The Perl module Proc::ProcessTable occasionally observes that the pctcpu attribute as 'inf', 'nan', or a value greater then 100. Why does it do this? And are there any guidelines on how to deal with this kind of information?
We have observed this on various platforms including Linux 2.4 running on 8 logical processors.
I would guess that 'inf' or 'nan' is the result of some impossibly large value or a divide by zero.
For values greater then 100, could this possibly mean that more then one processor was used?
And for dealing with this information, is the best practice merely marking the data point as untrustworthy and normalizing to 100%?
I do not know why that happens and I cannot stress test the module right now trying to generate such cases.
However, a principle I have followed all my research is not to replace data I know to be non-sense with something that looks reasonable. You basically have missing observations and you should treat them as such. I would not attach a numerical value at all so as not to pretend I have information when I in fact do not.
Then, your statistics for the non-missing points will be meaningful and you can look at any patterns in the missing observations separately.
UPDATE: Looking at the calc_prec() function in the source code:
/* calc_prec()
* calculate the two cpu/memory precentage values
static void calc_prec(char *format_str, struct procstat *prs, struct obstack *mem_pool)
float pctcpu = 100.0f * (prs->utime / 1e6) / (time(NULL) - prs->start_time);
/* calculate pctcpu - NOTE: This assumes the cpu time is in microsecond units! */
sprintf(prs->pctcpu, "%3.2f", pctcpu);
field_enable(format_str, F_PCTCPU);
/* calculate pctmem */
if (system_memory > 0) {
sprintf(prs->pctmem, "%3.2f", (float) prs->rss / system_memory * 100.f);
field_enable(format_str, F_PCTMEM);
First, IMHO, it would be better to just divide by 1e4 rather than multiplying by 100.0f after the division. Second, it is possible (if polled immediately after process start) for the time delta to be 0. Third, I would have just done the whole thing in double.
As an aside, this function looks like a good example of why you should not have comments in code.
#include <stdio.h>
#include <time.h>
volatile float calc_percent(
unsigned long utime,
time_t now,
time_t start
) {
return 100.0f * ( utime / 1e6) / (now - start);
int main(void) {
printf("%3.2f\n", calc_percent(1e6, time(NULL), time(NULL)));
printf("%3.2f\n", calc_percent(0, time(NULL), time(NULL)));
return 0;
This outputs inf in the first case and nan in the second case when compiled with Cygwin gcc-4 on Windows. I do not know if this behavior is standard or just what happens with this particular combination of OS+compiler.
