Conditional Compilation of CUDA Function - visual-c++

I created a CUDA function for calculating the sum of an image using its histogram.
I'm trying to compile the kernel and the wrapper function for multiple compute capabilities.
Kernel:
__global__ void calc_hist(unsigned char* pSrc, int* hist, int width, int height, int pitch)
{
int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
#if __CUDA_ARCH__ > 110 //Shared Memory For Devices Above Compute 1.1
__shared__ int shared_hist[256];
#endif
int global_tid = yIndex * pitch + xIndex;
int block_tid = threadIdx.y * blockDim.x + threadIdx.x;
if(xIndex>=width || yIndex>=height) return;
#if __CUDA_ARCH__ == 110 //Calculate Histogram In Global Memory For Compute 1.1
atomicAdd(&hist[pSrc[global_tid]],1); /*< Atomic Add In Global Memory */
#elif __CUDA_ARCH__ > 110 //Calculate Histogram In Shared Memory For Compute Above 1.1
shared_hist[block_tid] = 0; /*< Clear Shared Memory */
__syncthreads();
atomicAdd(&shared_hist[pSrc[global_tid]],1); /*< Atomic Add In Shared Memory */
__syncthreads();
if(shared_hist[block_tid] > 0) /* Only Write Non Zero Bins Into Global Memory */
atomicAdd(&(hist[block_tid]),shared_hist[block_tid]);
#else
return; //Do Nothing For Devices Of Compute Capabilty 1.0
#endif
}
Wrapper Function:
int sum_8u_c1(unsigned char* pSrc, double* sum, int width, int height, int pitch, cudaStream_t stream = NULL)
{
#if __CUDA_ARCH__ == 100
printf("Compute Capability Not Supported\n");
return 0;
#else
int *hHist,*dHist;
cudaMalloc(&dHist,256*sizeof(int));
cudaHostAlloc(&hHist,256 * sizeof(int),cudaHostAllocDefault);
cudaMemsetAsync(dHist,0,256 * sizeof(int),stream);
dim3 Block(16,16);
dim3 Grid;
Grid.x = (width + Block.x - 1)/Block.x;
Grid.y = (height + Block.y - 1)/Block.y;
calc_hist<<<Grid,Block,0,stream>>>(pSrc,dHist,width,height,pitch);
cudaMemcpyAsync(hHist,dHist,256 * sizeof(int),cudaMemcpyDeviceToHost,stream);
cudaStreamSynchronize(stream);
(*sum) = 0.0;
for(int i=1; i<256; i++)
(*sum) += (hHist[i] * i);
printf("sum = %f\n",(*sum));
cudaFree(dHist);
cudaFreeHost(hHist);
return 1;
#endif
}
Question 1:
When compiling for sm_10, the wrapper and the kernel shouldn't execute. But that is not what happens. The whole wrapper function executes. The output shows sum = 0.0.
I expected the output to be Compute Capability Not Supported as I have added the printf statement in the start of the wrapper function.
How can I prevent the wrapper function from executing on sm_10? I don't want to add any run-time checks like if statements etc. Can it be achieved through template meta programming?
Question 2:
When compiling for greater than sm_10, the program executes correctly only if I add cudaStreamSynchronize after the kernel call. But if I do not synchronize, the output is sum = 0.0. Why is it happening? I want the function to be asynchronous w.r.t the host as much as possible. Is it possible to shift the only loop inside the kernel?
I am using GTX460M, CUDA 5.0, Visual Studio 2008 on Windows 8.

Ad. Question 1
As already Robert explained in the comments - __CUDA_ARCH__ is defined only when compiling device code. To clarify: when you invoke nvcc, the code is parsed and compiled twice - once for CPU and once for GPU. The existence of __CUDA_ARCH__ can be used to check which of those two passes occurs, and then for the device code - as you do in the kernel - it can be checked which GPU are you targetting.
However, for the host side it is not all lost. While you don't have __CUDA_ARCH__, you can call API function cudaGetDeviceProperties which returns lots of information about your GPU. In particular, you can be interested in fields major and minor which indicate the Compute Capability. Note - this is done at run-time, not a preprocessing stage, so the same CPU code will work on all GPUs.
Ad. Question 2
Kernel calls and cudaMemoryAsync are asynchronous. It means that if you don't call cudaStreamSynchronize (or alike) the followup CPU code will continue running even if your GPU hasn't finished your work. This means, that the data you copy from dHist to hHist might not be there yet when you begin operating on hHist in the loop. If you want to work on the output from a kernel you have to wait till the kernel finishes.
Note that cudaMemcpy (without Async) has an implicit synchronization inside.

Related

How to achieve opencl kernel pipeline

I am working on my project using OpenCl. In order to improve the performance of my algorithm, is it possible to pipeline a single kernel? If a kernel consists of many steps, lets say A,B,C, I want A accept new data as soon as it finish its part and pass it to B. I can create channels between them, but I dont know how to do it in detail.
Can I write A,B,C(3 kernels) in a .cl file ? but how to enqueueNDRange?
I am using Altera SDK for FPGA HPC development.
Thanks.
Pipeline can be achieved by using several kernels connected with channels. With all kernels running concurrently, data is transferred from one to another:
Pipeline example from Intel FPGA OpenCL SDK Programming Guide
Very basic example of such pipeline would be:
channel int foo_bar_channel;
channel float bar_baz_channel;
__kernel void foo(__global int* in) {
for (int i = 0; i < 1024; ++i) {
int value = in[i];
value = clamp(value, 0, 255); // do some work
write_channel_altera(foo_bar_channel, value); // send data to the next kernel
}
}
__kernel void bar() {
for (int i = 0; i < 1024; ++i) {
int value = read_channel_altera(foo_bar_channel); // take data from foo
float fvalue = (float) value;
write_channel_altera(bar_baz_channel, value); // send data to the next kernel
}
}
__kernel void baz(__global int* out) {
for (int i = 0; i < 1024; ++i) {n
float value = read_channel_altera(bar_baz_channel);
float s = sin(value);
out[i] = s; // write result in the end
}
}
You can write all kernels in the a single .cl file, or use different files and then #include them into a main .cl file.
We want all our kernels run concurrently, so they can accept data from each other. Since only in-order command queues are supported, we have to use different queue for each kernel:
cl_queue foo_queue = clCreateCommandQueue(...);
cl_queue bar_queue = clCreateCommandQueue(...);
cl_queue baz_queue = clCreateCommandQueue(...);
clEnqueueTask(foo_queue, foo_kernel);
clEnqueueTask(bar_queue, bar_kernel);
clEnqueueTask(baz_queue, baz_kernel);
clFinish(baz_queue); // last kernel in our pipeline
Unlike OpenCL programming for GPU, we rely on a data pipelining, so NDRange kernels would not give us any benefit. Single work-item kernels are used instead of NDRange kernels, so we enqueue them using clEnqueueTask function. Additional kernel attribute (reqd_work_group_size) can be used to mark a single work-item kernel, to give the compiler some room for optimizations.
Check the Intel FPGA SDK for OpenCL Programming Guide for more information about channels and kernel attributes (specifically, section 1.6.4 Implementing the Intel FPGA SDK for OpenCL Channels Extension):
https://www.altera.com/en_US/pdfs/literature/hb/opencl-sdk/aocl_programming_guide.pdf

CUDA Programming: Compilation Error

I am making a CUDA program that implements the data parallel prefix sum calculation operating upon N numbers. My code is also supposed to generate the numbers on the host using a random number generator. However, I seem to always run into a "unrecognized token" and "expected a declaration" error on the ending bracket of int main when attempting to compile. I am running the code on Linux.
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
#include <math.h>
__global__ void gpu_cal(int *a,int i, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid>=i && tid < n) {
a[tid] = a[tid]+a[tid-i];
}
}
int main(void)
{
int key;
int *dev_a;
int N=10;//size of 1D array
int B=1;//blocks in the grid
int T=10;//threads in a block
do{
printf ("Some limitations:\n");
printf (" Maximum number of threads per block = 1024\n");
printf (" Maximum sizes of x-dimension of thread block = 1024\n");
printf (" Maximum size of each dimension of grid of thread blocks = 65535\n");
printf (" N<=B*T\n");
do{
printf("Enter size of array in one dimension, currently %d\n",N);
scanf("%d",&N);
printf("Enter size of blocks in the grid, currently %d\n",B);
scanf("%d",&B);
printf("Enter size of threads in a block, currently %d\n",T);
scanf("%d",&T);
if(N>B*T)
printf("N>B*T, this will result in an incorrect result generated by GPU, please try again\n");
if(T>1024)
printf("T>1024, this will result in an incorrect result generated by GPU, please try again\n");
}while((N>B*T)||(T>1024));
cudaEvent_t start, stop; // using cuda events to measure time
float elapsed_time_ms1, elapsed_time_ms3;
int a[N],gpu_result[N];//for result generated by GPU
int cpu_result[N];//CPU result
cudaMalloc((void**)&dev_a,N * sizeof(int));//allocate memory on GPU
int i,j;
srand(1); //initialize random number generator
for (i=0; i < N; i++) // load array with some numbers
a[i] = (int)rand() ;
cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);//load data from host to device
cudaEventCreate(&start); // instrument code to measure start time
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
//GPU computation
for(j=0;j<log(N)/log(2);j++){
gpu_cal<<<B,T>>>(dev_a,pow(2,j),N);
cudaThreadSynchronize();
}
cudaMemcpy(gpu_result,dev_a,N*sizeof(int),cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0); // instrument code to measue end time
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms1, start, stop );
printf("\n\n\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1); // print out execution time
//CPU computation
cudaEventRecord(start, 0);
for(i=0;i<N;i++)
{
cpu_result[i]=0;
for(j=0;j<=i;j++)
{
cpu_result[i]=cpu_result[i]+a[j];
}
}
cudaEventRecord(stop, 0); // instrument code to measue end time
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms3, start, stop );
printf("Time to calculate results on CPU: %f ms.\n\n", elapsed_time_ms3); // print out execution time
//Error check
for(i=0;i < N;i++) {
if (gpu_result[i] != cpu_result[i] ) {
printf("ERROR!!! CPU and GPU create different answers\n");
break;
}
}
//Calculate speedup
printf("Speedup on GPU compared to CPU= %f\n", (float) elapsed_time_ms3 / (float) elapsed_time_ms1);
printf("\nN=%d",N);
printf("\nB=%d",B);
printf("\nT=%d",T);
printf("\n\n\nEnter '1' to repeat, or other integer to terminate\n");
scanf("%d",&key);
}while(key == 1);
cudaFree(dev_a);//deallocation
return 0;
}​
The very last } in your code is a Unicode character. If you delete this entire line, and retype the }, the error will be gone.
There are two compile errors in your code.
First, Last ending bracket is a unicode character, so you should resave your code as unicode or delete and rewrite the last ending bracket.
Second, int type variable N which used at this line - int a[N],gpu_result[N];//for result generated by GPU
was declared int type, but it's not allowed in c or c++ compiler, so you should change the N declaration as const int N.

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
{
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
}
}
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
barrier(CLK_LOCAL_MEM_FENCE);
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
}
if (localId==0) {
output[get_group_id(0)] = target[0];
}
}
https://pastebin.com/xN4yQ28N
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
}
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
}
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

What difference between VC++ 2010 Express and Borland C++ 3.1 in compiling simple c++ code file?

I already don't know what to think or what to do. Next code compiles fine in both IDEs, but in VC++ case it causes weird heap corruptions messages like:
"Windows has triggered a breakpoint in Lab4.exe.
This may be due to a corruption of the heap, which indicates a bug in Lab4.exe or any of the DLLs it has loaded.
This may also be due to the user pressing F12 while Lab4.exe has focus.
The output window may have more diagnostic information."
It happens when executing Task1_DeleteMaxElement function and i leave comments there.
Nothing like that happens if compiled in Borland C++ 3.1 and everything work as expected.
So... what's wrong with my code or VC++?
#include <conio.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <memory.h>
void PrintArray(int *arr, int arr_length);
int Task1_DeleteMaxElement(int *arr, int arr_length);
int main()
{
int *arr = NULL;
int arr_length = 0;
printf("Input the array size: ");
scanf("%i", &arr_length);
arr = (int*)realloc(NULL, arr_length * sizeof(int));
srand(time(NULL));
for (int i = 0; i < arr_length; i++)
arr[i] = rand() % 100 - 50;
PrintArray(arr, arr_length);
arr_length = Task1_DeleteMaxElement(arr, arr_length);
PrintArray(arr, arr_length);
getch();
return 0;
}
void PrintArray(int *arr, int arr_length)
{
printf("Printing array elements\n");
for (int i = 0; i < arr_length; i++)
printf("%i\t", arr[i]);
printf("\n");
}
int Task1_DeleteMaxElement(int *arr, int arr_length)
{
printf("Looking for max element for deletion...");
int current_max = arr[0];
for (int i = 0; i < arr_length; i++)
if (arr[i] > current_max)
current_max = arr[i];
int *temp_arr = NULL;
int temp_arr_length = 0;
for (int j = 0; j < arr_length; j++)
if (arr[j] < current_max)
{
temp_arr = (int*)realloc(temp_arr, temp_arr_length + 1 * sizeof(int)); //if initial array size more then 4, breakpoint activates here
temp_arr[temp_arr_length] = arr[j];
temp_arr_length++;
}
arr = (int*)realloc(arr, temp_arr_length * sizeof(int));
memcpy(arr, temp_arr, temp_arr_length);
realloc(temp_arr, 0); //if initial array size is less or 4, breakpoint activates at this line execution
return temp_arr_length;
}
My guess is VC++2010 is rightly detecting memory corruption, which is ignored by Borland C++ 3.1.
How does it work?
For example, when allocating memory for you, VC++2010's realloc could well "mark" the memory around it with some special value. If you write over those values, realloc detects the corruption, and then crashes.
The fact it works with Borland C++ 3.1 is pure luck. This is a very very old compiler (20 years!), and thus, would be more tolerant/ignorant of this kind of memory corruption (until some random, apparently unrelated crash occurred in your app).
What's the problem with your code?
The source of your error:
temp_arr = (int*)realloc(temp_arr, temp_arr_length + 1 * sizeof(int))
For the following temp_arr_length values, in 32-bit, the allocation will be of:
0 : 4 bytes = 1 int when you expect 1 (Ok)
1 : 5 bytes = 1.25 int when you expect 2 (Error!)
2 : 6 bytes = 1.5 int when you expect 3 (Error!)
You got your priotities wrong. As you can see:
temp_arr_length + 1 * sizeof(int)
should be instead
(temp_arr_length + 1) * sizeof(int)
You allocated too little memory,and thus wrote well beyond what was allocated for you.
Edit (2012-05-18)
Hans Passant commented on allocator diagnostics. I took the liberty of copying them here until he writes his own answer (I've already seen coments disappear on SO):
It is Windows that reminds you that you have heap corruption bugs, not VS. BC3 uses its own heap allocator so Windows can't see your code mis-behaving. Not noticing these bugs before is pretty remarkable but not entirely impossible.
[...] The feature is not available on XP and earlier. And sure, one of the reasons everybody bitched about Vista. Blaming the OS for what actually were bugs in the program. Win7 is perceived as a 'better' OS in no small part because Vista forced programmers to fix their bugs. And no, the Microsoft CRT has implemented malloc/new with HeapAlloc for a long time. Borland had a history of writing their own, beating Microsoft for a while until Windows caught up.
[...] the CRT uses a debug allocator like you describe, but it generates different diagnostics. Roughly, the debug allocator catches small mistakes, Windows catches gross ones.
I found the following links explaining what is done to memory by Windows/CRT allocators before and after allocation/deallocation:
http://www.codeguru.com/cpp/w-p/win32/tutorials/article.php/c9535/Inside-CRT-Debug-Heap-Management.htm
https://stackoverflow.com/a/127404/14089
http://www.nobugs.org/developer/win32/debug_crt_heap.html#table
The last link contains a table I printed and always have near me at work (this was this table I was searching for when finding the first two links... :- ...).
If it is crashing in realloc, then you are over stepping, the book keeping memory of malloc & free.
The incorrect code is as below:
temp_arr = (int*)realloc(temp_arr, temp_arr_length + 1 * sizeof(int));
should be
temp_arr = (int*)realloc(temp_arr, (temp_arr_length + 1) * sizeof(int));
Due to operator precedence of * over +, in the next run of the loop when you are expecting realloc to passed 8 bytes, it might be passing only 5 bytes. So, in your second iteration, you will be writing into 3 bytes someone else's memory, which leads to memory corruption and eventual crash.
Also
memcpy(arr, temp_arr, temp_arr_length);
should be
memcpy(arr, temp_arr, temp_arr_length * sizeof(int) );

how to generate a delay

i'm new to kernel programming and i'm trying to understand some basics of OS. I am trying to generate a delay using a technique which i've implemented successfully in a 20Mhz microcontroller.
I know this is a totally different environment as i'm using linux centOS in my 2 GHz Core 2 duo processor.
I've tried the following code but i'm not getting a delay.
#include<linux/kernel.h>
#include<linux/module.h>
int init_module (void)
{
unsigned long int i, j, k, l;
for (l = 0; l < 100; l ++)
{
for (i = 0; i < 10000; i ++)
{
for ( j = 0; j < 10000; j ++)
{
for ( k = 0; k < 10000; k ++);
}
}
}
printk ("\nhello\n");
return 0;
}
void cleanup_module (void)
{
printk ("bye");
}
When i dmesg after inserting the module as quickly as possile for me, the string "hello" is already there. If my calculation is right, the above code should give me atleast 10 seconds delay.
Why is it not working? Is there anything related to threading? How could a 20 Ghz processor execute the above code instantly without any noticable delay?
The compiler is optimizing your loop away since it has no side effects.
To actually get a 10 second (non-busy) delay, you can do something like this:
#include <linux/sched.h>
//...
unsigned long to = jiffies + (10 * HZ); /* current time + 10 seconds */
while (time_before(jiffies, to))
{
schedule();
}
or better yet:
#include <linux/delay.h>
//...
msleep(10 * 1000);
for short delays you may use mdelay, ndelay and udelay
I suggest you read Linux Device Drivers 3rd edition chapter 7.3, which deals with delays for more information
To answer the question directly, it's likely your compiler seeing that these loops don't do anything and "optimizing" them away.
As for this technique, what it looks like you're trying to do is use all of the processor to create a delay. While this may work, an OS should be designed to maximize processor time. This will just waste it.
I understand it's experimental, but just the heads up.

Resources