Floating Point Bug in OpenCL via ssh - linux

I found a problem with floating point arithmetic in OpenCL. This is my kernel:
__kernel void MyKernel(__global const float4* _pInput, __global float4* _pOutput)
{
int IndexOfRow = get_global_id(0);
int NumberOfRows = get_global_size(0);
int IndexOfColumn = get_global_id(1);
int NumberOfColumns = get_global_size(1);
...
_pOutput[0] = 1.9f * 100.0f; // constant float return value
}
After the kernel execution and download of the output buffer the result is always 100 on different clients using an ssh connection. If I execute the program locally the result is 190. It seems that the digits after the decimal point are cut off.
The operating system is a Open Suse Linux with AMD OpenCL 1.2.
What's the problem?

I just found the solution. It depends on your ENV setting for LANG. It has to be en_US.UTF-8. You can check it with env|grep LANG.
That’s probably a JIT compiler bug. In Germany floating points are written with an „,“ instead of „.“.

Related

Was: How does BPF calculate number of CPU for PERCPU_ARRAY?

I have encountered an interesting issue where a PERCPU_ARRAY created on one system with 2 processors creates an array with 2 per-CPU elements and on another system with 2 processors, an array with 128 per-CPU elements. The latter was rather unexpected to me!
The way I discovered this behavior is that a program that allocated an array for the number of CPUs (using get_nprocs_conf(3)) and then read in the PERCPU_ARRAY into it (using bpf_map_lookup_elem()) ended up writing past the end of the array and crashing.
I would like to find out what is the proper way to determine in a program that reads BPF maps the number of elements in a PERCPU_ARRAY used on a system.
Failing that, I think the second best approach is to pick a buffer for reading in that is "large enough." Here, the problem is similar: what is that number and is there way to learn it at runtime?
The question comes from reading the source of bpftool, which figures this out:
unsigned int get_possible_cpus(void)
{
int cpus = libbpf_num_possible_cpus();
if (cpus < 0) {
p_err("Can't get # of possible cpus: %s", strerror(-cpus));
exit(-1);
}
return cpus;
}
int libbpf_num_possible_cpus(void)
{
static const char *fcpu = "/sys/devices/system/cpu/possible";
static int cpus;
int err, n, i, tmp_cpus;
bool *mask;
/* ---8<--- snip */
}
So that's how they do it!

Is there a more efficient way of texturing a circle?

I'm trying to create a randomly generated "planet" (circle), and I want the areas of water, land and foliage to be decided by perlin noise, or something similar. Currently I have this (psudo)code:
for (int radius = 0; radius < circleRadius; radius++) {
for (float theta = 0; theta < TWO_PI; theta += 0.1) {
float x = radius * cosine(theta);
float y = radius * sine(theta);
int colour = whateverFunctionIMake(x, y);
setPixel(x, y, colour);
}
}
Not only does this not work (there are "gaps" in the circle because of precision issues), it's incredibly slow. Even if I increase the resolution by changing the increment to 0.01, it still has missing pixels and is even slower (I get 10fps on my mediocre computer using Java (I know not the best) and an increment of 0.01. This is certainly not acceptable for a game).
How might I achieve a similar result whilst being much less computationally expensive?
Thanks in advance.
Why not use:
(x-x0)^2 + (y-y0)^2 <= r^2
so simply:
int x0=?,y0=?,r=?; // your planet position and size
int x,y,xx,rr,col;
for (rr=r*r,x=-r;x<=r;x++)
for (xx=x*x,y=-r;y<=r;y++)
if (xx+(y*y)<=rr)
{
col = whateverFunctionIMake(x, y);
setPixel(x0+x, y0+y, col);
}
all on integers, no floating or slow operations, no gaps ... Do not forget to use randseed for the coloring function ...
[Edit1] some more stuff
Now if you want speed than you need direct pixel access (in most platforms Pixels, SetPixel, PutPixels etc are slooow. because they perform a lot of stuff like range checking, color conversions etc ... ) In case you got direct pixel access or render into your own array/image whatever you need to add clipping with screen (so you do not need to check if pixel is inside screen on each pixel) to avoid access violations if your circle is overlapping screen.
As mentioned in the comments you can get rid of the x*x and y*y inside loop using previous value (as both x,y are only incrementing). For more info about it see:
32bit SQRT in 16T without multiplication
the math is like this:
(x+1)^2 = (x+1)*(x+1) = x^2 + 2x + 1
so instead of xx = x*x we just do xx+=x+x+1 for not incremented yet x or xx+=x+x-1 if x is already incremented.
When put all together I got this:
void circle(int x,int y,int r,DWORD c)
{
// my Pixel access
int **Pixels=Main->pyx; // Pixels[y][x]
int xs=Main->xs; // resolution
int ys=Main->ys;
// circle
int sx,sy,sx0,sx1,sy0,sy1; // [screen]
int cx,cy,cx0, cy0 ; // [circle]
int rr=r*r,cxx,cyy,cxx0,cyy0; // [circle^2]
// BBOX + screen clip
sx0=x-r; if (sx0>=xs) return; if (sx0< 0) sx0=0;
sy0=y-r; if (sy0>=ys) return; if (sy0< 0) sy0=0;
sx1=x+r; if (sx1< 0) return; if (sx1>=xs) sx1=xs-1;
sy1=y+r; if (sy1< 0) return; if (sy1>=ys) sy1=ys-1;
cx0=sx0-x; cxx0=cx0*cx0;
cy0=sy0-y; cyy0=cy0*cy0;
// render
for (cxx=cxx0,cx=cx0,sx=sx0;sx<=sx1;sx++,cxx+=cx,cx++,cxx+=cx)
for (cyy=cyy0,cy=cy0,sy=sy0;sy<=sy1;sy++,cyy+=cy,cy++,cyy+=cy)
if (cxx+cyy<=rr)
Pixels[sy][sx]=c;
}
This renders a circle with radius 512 px in ~35ms so 23.5 Mpx/s filling on mine setup (AMD A8-5500 3.2GHz Win7 64bit single thread VCL/GDI 32bit app coded by BDS2006 C++). Just change the direct pixel access to style/api you use ...
[Edit2]
to measure speed on x86/x64 you can use RDTSC asm instruction here some ancient C++ code I used ages ago (on 32bit environment without native 64bit stuff):
double _rdtsc()
{
LARGE_INTEGER x; // unsigned 64bit integer variable from windows.h I think
DWORD l,h; // standard unsigned 32 bit variables
asm {
rdtsc
mov l,eax
mov h,edx
}
x.LowPart=l;
x.HighPart=h;
return double(x.QuadPart);
}
It returns clocks your CPU has elapsed since power up. Beware you should account for overflows as on fast machines the 32bit counter is overflowing in seconds. Also each core has separate counter so set affinity to single CPU. On variable speed clock before measurement heat upi CPU by some computation and to convert to time just divide by CPU clock frequency. To obtain it just do this:
t0=_rdtsc()
sleep(250);
t1=_rdtsc();
fcpu = (t1-t0)*4;
and measurement:
t0=_rdtsc()
mesured stuff
t1=_rdtsc();
time = (t1-t0)/fcpu
if t1<t0 you overflowed and you need to add the a constant to result or measure again. Also the measured process must take less than overflow period. To enhance precision ignore OS granularity. for more info see:
Measuring Cache Latencies
Cache size estimation on your system? setting affinity example
Negative clock cycle measurements with back-to-back rdtsc?

On Linux 64-bit, can ptrace() return a double?

Assuming addr is address of a local variable on stack, are the following correct ways for retrieving the values of variables (ChildPid is tracee's id)?
double data = (double) ptrace(PTRACE_PEEKDATA, ChildPid, addr, 0);
float data = (float) ptrace(PTRACE_PEEKDATA, ChildPid, addr, 0);
Thanks.
The documentation says that PTRACE_PEEKDATA returns a word. It also says
The size of a "word" is determined by the operating-system variant (e.g., for 32-bit Linux it is 32 bits).
So you can't reliably use a single ptrace() call to get at a double on a 32-bit system, just half of it. The other half's address probably depends on if the stack grows up or down. On a 64 bit system you'd have to figure out which half of the returned word has the float.
So... it's all very system dependent on what you have to do.
Casting long to double won't get you the desired result. Casting numbers converts the numeric value, it doesn't copy bits. What you need is something like:
long pt = ptrace(PTRACE_PEEKDATA, ChildPid, addr, 0);
double result;
assert (sizeof(pt) == sizeof(result), "Oops, wrong word size!");
memcpy (&result, &pt, sizeof(result));
To get a float, you need to know which half of the word it occupies (normally you shouldn't use addr which is not aligned to a word boundary). Thus you need something like the following:
long pt = ptrace(PTRACE_PEEKDATA, ChildPid, addr, 0);
float result;
assert (sizeof(pt) == 2*sizeof(result), "Oops, wrong word size!");
// either this (for the lower half of the word)
memcpy (&result, &pt, sizeof(result));
// or this (for the upper half of the word)
memcpy (&result, ((char*)&pt)+sizeof(result), sizeof(result));

Remove input driver bound to the HID interface

I'm playing with some driver code for a special kind of keyboard. And this keyboard does have special modes. According to the specification those modes could only be enabled by sending and getting feature reports.
I'm using 'hid.c' file and user mode to send HID reports. But both 'hid_read' and 'hid_get_feature_report' failed with error number -1.
I already tried detaching keyboard from kernel drivers using libusb, but when I do that, 'hid_open' fails. I guess this is due to that HID interface already using by 'input' or some driver by the kernel. So I may not need to unbind kernel hidraw driver, instead I should try unbinding the keyboard ('input') driver top of 'hidraw' driver. Am I correct?
And any idea how I could do that? And how to find what are drivers using which drivers and which low level driver bind to which driver?
I found the answer to this myself.
The answer is to dig this project and find it's hid implementation on libusb.
Or you could directly receive the report.
int HID_API_EXPORT hid_get_feature_report(hid_device *dev, unsigned char *data, size_t length)
{
int res = -1;
int skipped_report_id = 0;
int report_number = data[0];
if (report_number == 0x0) {
/* Offset the return buffer by 1, so that the report ID
will remain in byte 0. */
data++;
length--;
skipped_report_id = 1;
}
res = libusb_control_transfer(dev->device_handle,
LIBUSB_REQUEST_TYPE_CLASS|LIBUSB_RECIPIENT_INTERFACE|LIBUSB_ENDPOINT_IN,
0x01/*HID get_report*/,
(3/*HID feature*/ << 8) | report_number,
dev->interface,
(unsigned char *)data, length,
1000/*timeout millis*/);
if (res < 0)
return -1;
if (skipped_report_id)
res++;
return res;
}
I'm sorry I can't post my actual code due to some legal reasons. However the above code is from hidapi implementation.
So even you work with an old kernel , you still have the chance to make your driver working.
This answers to this question too: https://stackoverflow.com/questions/30565999/kernel-version-2-6-32-does-not-support-hidiocgfeature

Variant type storage and alignment issues

I've made a variant type to use instead of boost::variant. Mine works storing an index of the current type on a list of the possible types, and storing data in a byte array with enough space to store the biggest type.
unsigned char data[my_types::max_size];
int type;
Now, when I write a value to this variant type comes the trouble. I use the following:
template<typename T>
void set(T a) {
int t = type_index(T);
if (t != -1) {
type = t;
puts("writing atom data");
*((T *) data) = a; //THIS PART CRASHES!!!!
puts("did it!");
} else {
throw atom_bad_assignment;
}
}
The line that crashes is the one that stores data to the internal buffer. As you can see, I just cast the byte array directly to a pointer of the desired type. This gives me bad address signals and bus errors when trying to write some values.
I'm using GCC on a 64-bit system. How do I set the alignment for the byte array to make sure the address of the array is 64-bit aligned? (or properly aligned for any architecture I might port this project to).
EDIT: Thank you all, but the mistake was somewhere else. Apparently, Intel doesn't really care about alignment. Aligned stuff is faster but not mandatory, and the program works fine this way. My problem was I didn't clear the data buffer before writing stuff and this caused trouble with the constructors of some types. I will not, however, mark the question as answered, so more people can give me tips on alignment ;)
See http://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/Variable-Attributes.html
unsigned char data[my_types::max_size] __attribute__ ((aligned));
int type;
I believe
#pragma pack(64)
will work on all modern compilers; it definitely works on GCC.
A more correct solution (that doesn't mess with packing globally) would be:
#pragma pack(push, 64)
// define union here
#pragma pack(pop)

Resources