I am implementing the simple "for clause" using open mp in visual studio 2012.
This implementation file is .cu file to be compiled by nvcc.
When using omp, the "for clause" gets fast, but other parts get slow.
I can't find the answer although surfing the many related questions.
Code follows.
void Test()
{
unsigned char* pbDest = (unsigned char*)malloc(1000000);
unsigned char* pbSrc = (unsigned char*)malloc(3000000);
#pragma omp parallel for shared(pbDest, pbSrc)
for (int i = 0; i < 1000000; i ++)
{
pbDest[i] = (unsigned char)((299 * pbSrc[3 * i] + 587 * pbSrc[3 * i + 1] + 114 * pbSrc[3 * i + 2]) / 1000);
}
...//other part
free(pbDest);
free(pbSrc);
}
This "Test" function executes in 100ms without omp, but with it, it executes in 120ms.
So, i doubted the for clause using omp, but it's optimized correctly when use omp, from 50ms to 20ms
What is the problem.
I will be appreciated if you help me.
Related
I am currently working on my own implementation of the Buddhabrot. So far I am using the std::thread-Class from C++11 to concurrently work through the following iteration:
void iterate(float *res){
//generate starting point
std::default_random_engine generator;
std::uniform_real_distribution<double> distribution(-1.5,1.5);
double ReC,ImC;
double ReS,ImS,ReS_;
unsigned int steps;
unsigned int visitedPos[maxCalcIter];
unsigned int succSamples(0);
//iterate over it
while(succSamples < samplesPerThread){
steps = 0;
ReC = distribution(generator)-0.4;
ImC = distribution(generator);
double p(sqrt((ReC-0.25)*(ReC-0.25) + ImC*ImC));
while (( ((ReC+1)*(ReC+1) + ImC*ImC) < 0.0625) || (ReC < p - 2*p*p + 0.25)){
ReC = distribution(generator)-0.4;
ImC = distribution(generator);
p = sqrt((ReC-0.25)*(ReC-0.25) + ImC*ImC);
}
ReS = ReC;
ImS = ImC;
for (unsigned int j = maxCalcIter; (ReS*ReS + ImS*ImS < 4)&&(j--); ){
ReS_ = ReS;
ReS *= ReS;
ReS += ReC - ImS*ImS;
ImS *= 2*ReS_;
ImS += ImC;
if ((ReS+0.5)*(ReS+0.5) + ImS*ImS < 4){
visitedPos[steps] = int((ReS+2.5)*0.25*outputSize)*outputSize + int((ImS+2)*0.25*outputSize);
}
steps++;
}
if ((steps > minCalcIter)&&(ReS*ReS + ImS*ImS > 4)){
succSamples++;
for (int j = steps; j--;){
//std::cout << visitedPos[j] << std::endl;
res[visitedPos[j]]++;
}
}
}
}
So basically I am working in every thread so long that I generated enough trajectories of sufficient length which in expectation takes the same time in every thread.
But I really have the feeling that this function might me very unoptimized since its code is so very readable. Can anybody come up with some fancy optimizations? When it comes to compiling I just use:
g++ -O4 -std=c++11 -I/usr/include/OpenEXR/ -L/usr/lib64/ -lHalf -lIlmImf -lm buddha_cpu.cpp -o buddha_cpu
So any hints on crunching some more numbers/sec would be really appreciated. Also any links to further literature are totally welcome.
Did you check that -O4 is faster than -O2? Above O2, it's not sure.
Also, if this compilation is only for you, try -march=native. This will take advantage of your specific CPU architecture, but the resulting binary might crash with SIGSEV on older/different machines.
You did not show any threads, if I see correctly. Make sure your threads do not write memory locations of the same cache line. Writing memory locations in the same cache line from different threads force the CPU cores to synchronize their cache -- it's a huge performance degradation.
I created a CUDA function for calculating the sum of an image using its histogram.
I'm trying to compile the kernel and the wrapper function for multiple compute capabilities.
Kernel:
__global__ void calc_hist(unsigned char* pSrc, int* hist, int width, int height, int pitch)
{
int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
#if __CUDA_ARCH__ > 110 //Shared Memory For Devices Above Compute 1.1
__shared__ int shared_hist[256];
#endif
int global_tid = yIndex * pitch + xIndex;
int block_tid = threadIdx.y * blockDim.x + threadIdx.x;
if(xIndex>=width || yIndex>=height) return;
#if __CUDA_ARCH__ == 110 //Calculate Histogram In Global Memory For Compute 1.1
atomicAdd(&hist[pSrc[global_tid]],1); /*< Atomic Add In Global Memory */
#elif __CUDA_ARCH__ > 110 //Calculate Histogram In Shared Memory For Compute Above 1.1
shared_hist[block_tid] = 0; /*< Clear Shared Memory */
__syncthreads();
atomicAdd(&shared_hist[pSrc[global_tid]],1); /*< Atomic Add In Shared Memory */
__syncthreads();
if(shared_hist[block_tid] > 0) /* Only Write Non Zero Bins Into Global Memory */
atomicAdd(&(hist[block_tid]),shared_hist[block_tid]);
#else
return; //Do Nothing For Devices Of Compute Capabilty 1.0
#endif
}
Wrapper Function:
int sum_8u_c1(unsigned char* pSrc, double* sum, int width, int height, int pitch, cudaStream_t stream = NULL)
{
#if __CUDA_ARCH__ == 100
printf("Compute Capability Not Supported\n");
return 0;
#else
int *hHist,*dHist;
cudaMalloc(&dHist,256*sizeof(int));
cudaHostAlloc(&hHist,256 * sizeof(int),cudaHostAllocDefault);
cudaMemsetAsync(dHist,0,256 * sizeof(int),stream);
dim3 Block(16,16);
dim3 Grid;
Grid.x = (width + Block.x - 1)/Block.x;
Grid.y = (height + Block.y - 1)/Block.y;
calc_hist<<<Grid,Block,0,stream>>>(pSrc,dHist,width,height,pitch);
cudaMemcpyAsync(hHist,dHist,256 * sizeof(int),cudaMemcpyDeviceToHost,stream);
cudaStreamSynchronize(stream);
(*sum) = 0.0;
for(int i=1; i<256; i++)
(*sum) += (hHist[i] * i);
printf("sum = %f\n",(*sum));
cudaFree(dHist);
cudaFreeHost(hHist);
return 1;
#endif
}
Question 1:
When compiling for sm_10, the wrapper and the kernel shouldn't execute. But that is not what happens. The whole wrapper function executes. The output shows sum = 0.0.
I expected the output to be Compute Capability Not Supported as I have added the printf statement in the start of the wrapper function.
How can I prevent the wrapper function from executing on sm_10? I don't want to add any run-time checks like if statements etc. Can it be achieved through template meta programming?
Question 2:
When compiling for greater than sm_10, the program executes correctly only if I add cudaStreamSynchronize after the kernel call. But if I do not synchronize, the output is sum = 0.0. Why is it happening? I want the function to be asynchronous w.r.t the host as much as possible. Is it possible to shift the only loop inside the kernel?
I am using GTX460M, CUDA 5.0, Visual Studio 2008 on Windows 8.
Ad. Question 1
As already Robert explained in the comments - __CUDA_ARCH__ is defined only when compiling device code. To clarify: when you invoke nvcc, the code is parsed and compiled twice - once for CPU and once for GPU. The existence of __CUDA_ARCH__ can be used to check which of those two passes occurs, and then for the device code - as you do in the kernel - it can be checked which GPU are you targetting.
However, for the host side it is not all lost. While you don't have __CUDA_ARCH__, you can call API function cudaGetDeviceProperties which returns lots of information about your GPU. In particular, you can be interested in fields major and minor which indicate the Compute Capability. Note - this is done at run-time, not a preprocessing stage, so the same CPU code will work on all GPUs.
Ad. Question 2
Kernel calls and cudaMemoryAsync are asynchronous. It means that if you don't call cudaStreamSynchronize (or alike) the followup CPU code will continue running even if your GPU hasn't finished your work. This means, that the data you copy from dHist to hHist might not be there yet when you begin operating on hHist in the loop. If you want to work on the output from a kernel you have to wait till the kernel finishes.
Note that cudaMemcpy (without Async) has an implicit synchronization inside.
I already don't know what to think or what to do. Next code compiles fine in both IDEs, but in VC++ case it causes weird heap corruptions messages like:
"Windows has triggered a breakpoint in Lab4.exe.
This may be due to a corruption of the heap, which indicates a bug in Lab4.exe or any of the DLLs it has loaded.
This may also be due to the user pressing F12 while Lab4.exe has focus.
The output window may have more diagnostic information."
It happens when executing Task1_DeleteMaxElement function and i leave comments there.
Nothing like that happens if compiled in Borland C++ 3.1 and everything work as expected.
So... what's wrong with my code or VC++?
#include <conio.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <memory.h>
void PrintArray(int *arr, int arr_length);
int Task1_DeleteMaxElement(int *arr, int arr_length);
int main()
{
int *arr = NULL;
int arr_length = 0;
printf("Input the array size: ");
scanf("%i", &arr_length);
arr = (int*)realloc(NULL, arr_length * sizeof(int));
srand(time(NULL));
for (int i = 0; i < arr_length; i++)
arr[i] = rand() % 100 - 50;
PrintArray(arr, arr_length);
arr_length = Task1_DeleteMaxElement(arr, arr_length);
PrintArray(arr, arr_length);
getch();
return 0;
}
void PrintArray(int *arr, int arr_length)
{
printf("Printing array elements\n");
for (int i = 0; i < arr_length; i++)
printf("%i\t", arr[i]);
printf("\n");
}
int Task1_DeleteMaxElement(int *arr, int arr_length)
{
printf("Looking for max element for deletion...");
int current_max = arr[0];
for (int i = 0; i < arr_length; i++)
if (arr[i] > current_max)
current_max = arr[i];
int *temp_arr = NULL;
int temp_arr_length = 0;
for (int j = 0; j < arr_length; j++)
if (arr[j] < current_max)
{
temp_arr = (int*)realloc(temp_arr, temp_arr_length + 1 * sizeof(int)); //if initial array size more then 4, breakpoint activates here
temp_arr[temp_arr_length] = arr[j];
temp_arr_length++;
}
arr = (int*)realloc(arr, temp_arr_length * sizeof(int));
memcpy(arr, temp_arr, temp_arr_length);
realloc(temp_arr, 0); //if initial array size is less or 4, breakpoint activates at this line execution
return temp_arr_length;
}
My guess is VC++2010 is rightly detecting memory corruption, which is ignored by Borland C++ 3.1.
How does it work?
For example, when allocating memory for you, VC++2010's realloc could well "mark" the memory around it with some special value. If you write over those values, realloc detects the corruption, and then crashes.
The fact it works with Borland C++ 3.1 is pure luck. This is a very very old compiler (20 years!), and thus, would be more tolerant/ignorant of this kind of memory corruption (until some random, apparently unrelated crash occurred in your app).
What's the problem with your code?
The source of your error:
temp_arr = (int*)realloc(temp_arr, temp_arr_length + 1 * sizeof(int))
For the following temp_arr_length values, in 32-bit, the allocation will be of:
0 : 4 bytes = 1 int when you expect 1 (Ok)
1 : 5 bytes = 1.25 int when you expect 2 (Error!)
2 : 6 bytes = 1.5 int when you expect 3 (Error!)
You got your priotities wrong. As you can see:
temp_arr_length + 1 * sizeof(int)
should be instead
(temp_arr_length + 1) * sizeof(int)
You allocated too little memory,and thus wrote well beyond what was allocated for you.
Edit (2012-05-18)
Hans Passant commented on allocator diagnostics. I took the liberty of copying them here until he writes his own answer (I've already seen coments disappear on SO):
It is Windows that reminds you that you have heap corruption bugs, not VS. BC3 uses its own heap allocator so Windows can't see your code mis-behaving. Not noticing these bugs before is pretty remarkable but not entirely impossible.
[...] The feature is not available on XP and earlier. And sure, one of the reasons everybody bitched about Vista. Blaming the OS for what actually were bugs in the program. Win7 is perceived as a 'better' OS in no small part because Vista forced programmers to fix their bugs. And no, the Microsoft CRT has implemented malloc/new with HeapAlloc for a long time. Borland had a history of writing their own, beating Microsoft for a while until Windows caught up.
[...] the CRT uses a debug allocator like you describe, but it generates different diagnostics. Roughly, the debug allocator catches small mistakes, Windows catches gross ones.
I found the following links explaining what is done to memory by Windows/CRT allocators before and after allocation/deallocation:
http://www.codeguru.com/cpp/w-p/win32/tutorials/article.php/c9535/Inside-CRT-Debug-Heap-Management.htm
https://stackoverflow.com/a/127404/14089
http://www.nobugs.org/developer/win32/debug_crt_heap.html#table
The last link contains a table I printed and always have near me at work (this was this table I was searching for when finding the first two links... :- ...).
If it is crashing in realloc, then you are over stepping, the book keeping memory of malloc & free.
The incorrect code is as below:
temp_arr = (int*)realloc(temp_arr, temp_arr_length + 1 * sizeof(int));
should be
temp_arr = (int*)realloc(temp_arr, (temp_arr_length + 1) * sizeof(int));
Due to operator precedence of * over +, in the next run of the loop when you are expecting realloc to passed 8 bytes, it might be passing only 5 bytes. So, in your second iteration, you will be writing into 3 bytes someone else's memory, which leads to memory corruption and eventual crash.
Also
memcpy(arr, temp_arr, temp_arr_length);
should be
memcpy(arr, temp_arr, temp_arr_length * sizeof(int) );
I am trying to generate a comprehensive callgraph (complete with low level calls to Linux, runtime, the lot).
I have statically compiled my source files with "-fdump-rtl-expand" and created RTL files, which I passed to a PERL script called Egypt (which I believe is Graphviz/Dot) and generated a PDF file of the callgraph. This works perfectly, no problems at all.
Except, there are calls being made into some libraries that are getting shown as built-in. I was looking to see if there is a way for the callgraph not to be printed as and instead the real calls made into the libraries ?
Please let me know if the question is unclear.
http://i.imgur.com/sp58v.jpg
Basically, I am trying to avoid the callgraph from generating < built-in >
Is there a way to do that ?
-------- CODE ---------
#include <cilk/cilk.h>
#include <stdio.h>
#include <stdlib.h>
unsigned long int t0, t5;
unsigned int NOSPAWN_THRESHOLD = 32;
int fib_nospawn(int n)
{
if (n < 2)
return n;
else
{
int x = fib_nospawn(n-1);
int y = fib_nospawn(n-2);
return x + y;
}
}
// spawning fibonacci function
int fib(long int n)
{
long int x, y;
if (n < 2)
return n;
else if (n <= NOSPAWN_THRESHOLD)
{
x = fib_nospawn(n-1);
y = fib_nospawn(n-2);
return x + y;
}
else
{
x = cilk_spawn fib(n-1);
y = cilk_spawn fib(n-2);
cilk_sync;
return x + y;
}
}
int main(int argc, char *argv[])
{
int n;
long int result;
long int exec_time;
n = atoi(argv[1]);
NOSPAWN_THRESHOLD = atoi(argv[2]);
result = fib(n);
printf("%ld\n", result);
return 0;
}
I compiled the Cilk Library from source.
I might have found the partial solution to the problem:
You need to pass the following option to egypt
--include-external
This produced a slightly more comprehensive callgraph, although there still is the " visible
http://i.imgur.com/GWPJO.jpg?1
Can anyone suggest if I get more depth in the callgraph ?
You can use the GCC VCG Plugin: A gcc plugin, which can be loaded when debugging gcc, to show internal structures graphically.
gcc -fplugin=/path/to/vcg_plugin.so -fplugin-arg-vcg_plugin-cgraph foo.c
Call-graph is place to store data needed
for inter-procedural optimization. All datastructures
are divided into three components:
local_info that is produced while analyzing
the function, global_info that is result
of global walking of the call-graph on the end
of compilation and rtl_info used by RTL
back-end to propagate data from already compiled
functions to their callers.
i'm new to kernel programming and i'm trying to understand some basics of OS. I am trying to generate a delay using a technique which i've implemented successfully in a 20Mhz microcontroller.
I know this is a totally different environment as i'm using linux centOS in my 2 GHz Core 2 duo processor.
I've tried the following code but i'm not getting a delay.
#include<linux/kernel.h>
#include<linux/module.h>
int init_module (void)
{
unsigned long int i, j, k, l;
for (l = 0; l < 100; l ++)
{
for (i = 0; i < 10000; i ++)
{
for ( j = 0; j < 10000; j ++)
{
for ( k = 0; k < 10000; k ++);
}
}
}
printk ("\nhello\n");
return 0;
}
void cleanup_module (void)
{
printk ("bye");
}
When i dmesg after inserting the module as quickly as possile for me, the string "hello" is already there. If my calculation is right, the above code should give me atleast 10 seconds delay.
Why is it not working? Is there anything related to threading? How could a 20 Ghz processor execute the above code instantly without any noticable delay?
The compiler is optimizing your loop away since it has no side effects.
To actually get a 10 second (non-busy) delay, you can do something like this:
#include <linux/sched.h>
//...
unsigned long to = jiffies + (10 * HZ); /* current time + 10 seconds */
while (time_before(jiffies, to))
{
schedule();
}
or better yet:
#include <linux/delay.h>
//...
msleep(10 * 1000);
for short delays you may use mdelay, ndelay and udelay
I suggest you read Linux Device Drivers 3rd edition chapter 7.3, which deals with delays for more information
To answer the question directly, it's likely your compiler seeing that these loops don't do anything and "optimizing" them away.
As for this technique, what it looks like you're trying to do is use all of the processor to create a delay. While this may work, an OS should be designed to maximize processor time. This will just waste it.
I understand it's experimental, but just the heads up.