OpenCL kernel cannot work as expected (pyopencl) - python-3.x

I wrote an OpenCL function to increase 64-bits float point value in an array.
But the results is different between CPU and GPU.
import numpy as np
import pyopencl as cl
CL_INC = '''
__kernel void inc_f64(__global const double *a_g, __global double *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + 1.0;
}
'''
def test(dev_type):
ctx = cl.Context(dev_type=dev_type)
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
prg = cl.Program(ctx, CL_INC).build()
in_py = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
out_py = np.empty_like(in_py)
in_cl = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=in_py)
out_cl = cl.Buffer(ctx, mf.WRITE_ONLY, in_py.nbytes)
prg.inc_f64(queue, in_py.shape, None, in_cl, out_cl)
cl.enqueue_copy(queue, out_py, out_cl)
queue.finish()
return out_py
print('Run inc_f64() on CPU: ', end='')
print(test(cl.device_type.CPU))
print('Run inc_f64() on GPU: ', end='')
print(test(cl.device_type.GPU))
Output:
Run inc_f64() on CPU: [2. 3. 4. 5. 6.]
Run inc_f64() on GPU: [2.40000038e+001 3.20000076e+001 5.26354425e-315 0.00000000e+000
0.00000000e+000]
Hardware informations:
[0] Apple / OpenCL 1.2 (Oct 31 2017 18:30:00)
|- [0:0] CPU / OpenCL 1.2 / Intel(R) Core(TM) i7-3667U CPU # 2.00GHz
|- [0:1] GPU / OpenCL 1.2 / HD Graphics 4000
Is it a hardware limitation or just a bug in the source code?

Your GPU probably doesn't support double-precision floating point numbers. Have you checked for support for the cl_khr_fp64 extension?
Your kernel must also declare its requirement:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
For more details, see the cl_khr_fp64 extension documentation.

Related

opencl speed and OUT_OF_RESOURCES

I am very new to opencl and trying my first program. I implemented a simple sinc filtering of waveforms. The code works, however i have two questions:
Once I increase the size of the input matrix (numrows needs to go up to 100 000) I get (clEnqueueReadBuffer failed: OUT_OF_RESOURCES) even though matrix is relatively small (few mb). This is to some extent related to the work group size I think, but could someone elaborate how I could fix this issue ?
Could it be driver issue ?
UPDATE:
leaving groups size None crashes
adjusting groups size for GPU (1,600) and IntelHD (1,50) lets me go up to some 6400 rows. However for larger size it crashes on GPU and IntelHD just freezes and does nothing ( 0% on resource monitor)
2.I have Intel HD4600 and Nvidia K1100M GPU available, however the Intel is ~2 times faster. I understand partially this is due to the fact that I don't need to copy my arrays to internal Intel memory different from my external GPU. However I expected marginal difference. Is this normal or should my code be better optimized to use on GPU ? (resolved)
Thanks for your help !!
from __future__ import absolute_import, print_function
import numpy as np
import pyopencl as cl
import os
os.environ['PYOPENCL_COMPILER_OUTPUT'] = '1'
import matplotlib.pyplot as plt
def resample_opencl(y,key='GPU'):
#
# selecting to run on GPU or CPU
#
newlen = 1200
my_platform = cl.get_platforms()[0]
device =my_platform.get_devices()[0]
for found_platform in cl.get_platforms():
if (key == 'GPU') and (found_platform.name == 'NVIDIA CUDA'):
my_platform = found_platform
device =my_platform.get_devices()[0]
print("using GPU")
#
#Create context for GPU/CPU
#
ctx = cl.Context([device])
#
# Create queue for each kernel execution
#
queue = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
# queue = cl.CommandQueue(ctx)
prg = cl.Program(ctx, """
__kernel void resample(
int M,
__global const float *y_g,
__global float *res_g)
{
int row = get_global_id(0);
int col = get_global_id(1);
int gs = get_global_size(1);
__private float tmp,tmp2,x;
__private float t;
t = (float)(col)/2+1;
tmp=0;
tmp2=0;
for (int i=0; i<M ; i++)
{
x = (float)(i+1);
tmp2 = (t- x)*3.14159;
if (t == x) {
tmp += y_g[row*M + i] ;
}
else
tmp += y_g[row*M +i] * sin(tmp2)/tmp2;
}
res_g[row*gs + col] = tmp;
}
""").build()
mf = cl.mem_flags
y_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=y)
res = np.zeros((np.shape(y)[0],newlen)).astype(np.float32)
res_g = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes)
M = np.array(600).astype(np.int32)
prg.resample(queue, res.shape, (1,200),M, y_g, res_g)
event = cl.enqueue_copy(queue, res, res_g)
print("success")
event.wait()
return res,event
if __name__ == "__main__":
#
# this is the number i need to increase ( up to some 100 000)
numrows = 2000
Gaussian = lambda t : 10 * np.exp(-(t - 50)**2 / (2. * 2**2))
x = np.linspace(1, 101, 600, endpoint=False).astype(np.float32)
t = np.linspace(1, 101, 1200, endpoint=False).astype(np.float32)
y= np.zeros(( numrows,np.size(x)))
y[:] = Gaussian(x).astype(np.float32)
y = y.astype(np.float32)
res,event = resample_opencl(y,'GPU')
print ("OpenCl GPU profiler",(event.profile.end-event.profile.start)*1e-9)
#
# test plot if it worked
#
plt.figure()
plt.plot(x,y[1,:],'+')
plt.plot(t,res[1,:])
Re 1.
Your newlen has to be divisible by 200 because that is what you set as local dimensions (1,200). I increased this to 9600 and that still worked fine.
Update
After your update I would suggest not specifying local dimensions but let implementation to decide:
prg.resample(queue, res.shape, None,M, y_g, res_g)
Also it may improve the performance ifnewlen and numrows were multiply of 16.
It is not a rule that Nvidia GPU must perform better than Intel GPU especially that according to Wikipedia there is not a big difference in GFLOPS between them (549.89 vs 288–432). This GFLOPS comparison should be taken with grain of salt as one algorithm may be more suitable to one GPU than the other. In other words looking by this numbers you may expect one GPU to be typically faster than the other but that may vary from algorithm to algorithm.
Kernel for 100000 rows requires:
y_g: 100000 * 600 * 4 = 240000000 bytes =~ 229MB
res_g: 100000 * 1200 * 4 = 480000000 bytes =~ 457,8MB
Quadro K1100M has 2GB of global memory and that should be sufficient for processing 100000 rows. Intel HD 4600 from what I found is limited by memory in the system so I suspect that shouldn't be a problem too.
Re 2.
The time is not measured correctly. Instead of measuring kernel execution time, the time of copying data back to host is being measured. So no surprise that this number is lower for CPU. To measure kernel execution time do:
event = prg.resample(queue, res.shape, (1,200),M, y_g, res_g)
event.wait()
print ("OpenCl GPU profiler",(event.profile.end-event.profile.start)*1e-9)
I don't know how to measure the whole thing including copying data back to host using OpenCL profiling events in pyopencl but using just python gives similar results:
start = time.time()
... #code to be measured
end = time.time()
print(end - start)
I think I figured out the issue:
IntelHd : turning off profiling fixes everything. Can run the code without any issues.
K1100M GPU still crashes but I suspect that this might be the timeout issue as I am using the same video card on my display.

Halide AOT for OpenCL works fine as static library but not as shared object

I try to compile the code below both to static library and to object file:
Halide::Func f("f");
Halide::Var x("x");
f(x) = x;
f.gpu_tile(x, 4);
f.bound(x, 0, 16);
Halide::Target target = Halide::get_target_from_environment();
target.set_feature(Halide::Target::OpenCL);
target.set_feature(Halide::Target::Debug);
// f.compile_to_static_library("mylib", {}, "f", target);
// f.compile_to_file("mylib", {}, "f", target);
In case of static linking all works fine and output result is correct:
Halide::Buffer<int> output(16);
f(output.raw_buffer());
output.copy_to_host();
std::cout << output(10) << std::endl;
But when I try link object file into shared object,
gcc -shared -pthread mylib.o -o mylib.so
And open it from code (Ubuntu 16.04),
void* handle = dlopen("mylib.so", RTLD_NOW);
int (*func)(halide_buffer_t*);
*(void**)(&func) = dlsym(handle, "f");
func(output.raw_buffer());
I receive CL_INVALID_MEM_OBJECT error. Here is the debugging log:
CL: halide_opencl_init_kernels (user_context: 0x0, state_ptr: 0x7f1266b5a4e0, program: 0x7f1266957480, size: 1577
load_libopencl (user_context: 0x0)
Loaded OpenCL runtime library: libOpenCL.so
create_opencl_context (user_context: 0x0)
Got platform 'Intel(R) OpenCL', about to create context (t=6249430)
Multiple CL devices detected. Selecting the one with the most cores.
Device 0 has 20 cores
Device 1 has 4 cores
Selected device 0
device name: Intel(R) HD Graphics
device vendor: Intel(R) Corporation
device profile: FULL_PROFILE
global mem size: 1630 MB
max mem alloc size: 815 MB
local mem size: 65536
max compute units: 20
max workgroup size: 256
max work item dimensions: 3
max work item sizes: 256x256x256x0
clCreateContext -> 0x1899af0
clCreateCommandQueue 0x1a26a80
clCreateProgramWithSource -> 0x1a26ab0
clBuildProgram 0x1a26ab0 -D MAX_CONSTANT_BUFFER_SIZE=854799155 -D MAX_CONSTANT_ARGS=8
Time: 1.015832e+02 ms
CL: halide_opencl_run (user_context: 0x0, entry: kernel_f_s0_x___deprecated_block_id_x___block_id_x, blocks: 4x1x1, threads: 4x1x1, shmem: 0
clCreateKernel kernel_f_s0_x___deprecated_block_id_x___block_id_x -> Time: 1.361700e-02 ms
clSetKernelArg 0 4 [0x2e00010000000000 ...] 0
clSetKernelArg 1 8 [0x2149040 ...] 1
Mapped dev handle is: 0x2149040
Error: CL: clSetKernelArg failed: CL_INVALID_MEM_OBJECT
Aborted (core dumped)
Thank you very much for help! Commit state c7375fa. I'm pleasure provide extra information if it will be necessary.
Solution: In this case we have runtime duplication. Load shared object with flag RTLD_DEEPBIND.
void* handle = dlopen("mylib.so", RTLD_NOW | RTLD_DEEPBIND);
RTLD_DEEPBIND (since glibc 2.3.4)
Place the lookup scope of the symbols in this library ahead of the global scope. This means that a self-contained library will use its own symbols in preference to global symbols with the same name contained in libraries that have already been loaded. This flag is not specified in POSIX.1-2001.
https://linux.die.net/man/3/dlopen

texelFetch works on NDIVIA driver but not on Mesa

I have a 3D texture uploaded to a shader (see code below). I need to perform some bitwise operations on that texture to get bit-by-bit data. The Fragment shader I wrote works under Linux, with NVIDIA drivers given as
OpenGL version string: 4.5.0 NVIDIA 367.57
but do not work on another computer with Intel integrated GPU and Mesa drivers, version information given by:
OpenGL version string: 3.0 Mesa 11.2.0
OpenGL shading language version string: 1.30
What is the reason for this not to work on that system?
I know it supports version 130, and the compilation yields no errors.
What could be wrong, or, alternatively, how can I change this shader to NOT require version 130?
Here's the code:
// Fragment Shader
#version 130 \n
\n
in vec4 texcoord;\n
\n
uniform uint width;\n
uniform uint height;\n
uniform usampler3D textureA;\n
\n
void main() {\n
uint x = uint(texcoord.x * float(width));\n
uint y = uint(texcoord.y * float(height));\n
uint shift = x % 8u;\n
uint mask = 1u << shift;\n
uint octet = texelFetch(textureA, ivec3(x / 8u, y % 256u, y /256u), 0).r;
uint value = (octet & mask) >> shift;\n
if (value > 0u)\n
gl_FragColor = vec4(1.0, 1.0, 1.0, 1.0);\n
else\n
gl_FragColor = vec4(0.0, 0.0, 0.0, 0.0);\n
}

Invoke kernel failure through cuda-gdb?

Is there a way to invoke kernel failure using cuda-gdb? I've tried stepping through the kernel code and setting invalid index positions, odd values to variables, but I'm unable to trigger a "kernel Execution Failed" after continuing from an erroneous setting.
Does anyone know of a proper way to do this through cuda-gdb? I've read through the cuda-gdb documentation twice but might have missed some clues on how to achieve this if it is at all possible. If anyone knows of any tools/techniques that would be most appreciated, thanks.
I'm on CentOS 7 and my device's compute capability is 2.1. See below for the output of the uname -a command.
Linux john 3.10.0-327.10.1.el7.x86_64 #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Is there a way to invoke kernel failure using cuda-gdb?
Yes, it's possible. Here is a fully worked example:
$ cat t678.cu
#include <stdio.h>
__global__ void kernel(int *data){
int idx = 0; // line 4
idx += data[0];
int tval = data[idx];
data[1] = tval;
}
int main(){
int *d_data;
cudaMalloc(&d_data, 32*sizeof(int));
cudaMemset(d_data, 0, 32*sizeof(int));
kernel<<<1,1>>>(d_data);
cudaDeviceSynchronize();
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) printf("kernel fail %s\n", cudaGetErrorString(err));
}
$ nvcc -g -G -o t678 t678.cu
$ cuda-gdb ./t678
NVIDIA (R) CUDA Debugger
7.5 release
Portions Copyright (C) 2007-2015 NVIDIA Corporation
GNU gdb (GDB) 7.6.2
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/user2/misc/t678...done.
(cuda-gdb) break t678.cu:4
Breakpoint 1 at 0x4026d5: file t678.cu, line 4.
(cuda-gdb) run
Starting program: /home/user2/misc/./t678
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff700a700 (LWP 8693)]
[Switching focus to CUDA kernel 0, grid 2, block (0,0,0), thread (0,0,0), device 0, sm 14, warp 2, lane 0]
Breakpoint 1, kernel<<<(1,1,1),(1,1,1)>>> (data=0x13047a0000) at t678.cu:4
4 int idx = 0; // line 4
(cuda-gdb) step
5 idx += data[0];
(cuda-gdb) print idx
$1 = 0
(cuda-gdb) set idx=1000000
(cuda-gdb) step
6 int tval = data[idx];
(cuda-gdb) print idx
$2 = 1000000
(cuda-gdb) step
CUDA Exception: Device Illegal Address
The exception was triggered in device 0.
Program received signal CUDA_EXCEPTION_10, Device Illegal Address.
kernel<<<(1,1,1),(1,1,1)>>> (data=0x13047a0000) at t678.cu:7
7 data[1] = tval;
(cuda-gdb)
In the above cuda-gdb output, you can see that after setting the idx variable to a large value, it results in an index-out-of-bounds (illegal address) error when executing the following line in the debugger:
int tval = data[idx];

Floating Point Bug in OpenCL via ssh

I found a problem with floating point arithmetic in OpenCL. This is my kernel:
__kernel void MyKernel(__global const float4* _pInput, __global float4* _pOutput)
{
int IndexOfRow = get_global_id(0);
int NumberOfRows = get_global_size(0);
int IndexOfColumn = get_global_id(1);
int NumberOfColumns = get_global_size(1);
...
_pOutput[0] = 1.9f * 100.0f; // constant float return value
}
After the kernel execution and download of the output buffer the result is always 100 on different clients using an ssh connection. If I execute the program locally the result is 190. It seems that the digits after the decimal point are cut off.
The operating system is a Open Suse Linux with AMD OpenCL 1.2.
What's the problem?
I just found the solution. It depends on your ENV setting for LANG. It has to be en_US.UTF-8. You can check it with env|grep LANG.
That’s probably a JIT compiler bug. In Germany floating points are written with an „,“ instead of „.“.

Resources