Why does the time to execute function f1() changes from one run to another in debug mode? Why it's always zero in release mode?
I didn't include stdio.h nor cstdio and the code compiled. How ?
#include <iostream>
#include <ctime>
void f1()
{
for( int i = 0; i < 10000; i++ );
}
int main()
{
clock_t start, finish;
start = clock();
for( int i = 0; i < 100000; i++ ) f1();
finish = clock();
double duration = (double)(finish - start) / CLOCKS_PER_SEC;
printf( "Duration = %6.2f seconds\n", duration);
}
Possible the machine you're running your test code from is too fast. Try increasing the loop count to a really huge number.
Other things to try is to test with the sleep() function.
This should confirm the behavior of your clock() measurements.
I believe the reason you are seeing zero runtime for f1() in release mode is because the compiler is optimizing the function. Since your for loop doesn't have a code block, it can effectively be pulled out during compilation.
I'm guessing that this optimization is not performed in debug mode, which would explain why you see a longer execution time. It varies between runs simply because your OS scheduler (almost certainly) does not guarantee a fixed time slot for processes.
As for why you can use printf() when you have not explicitly included <cstdio>, it's because of the <iostream> include.
From looking my headers at C:\Program Files\Microsoft Visual Studio 10.0\VC\include, I can see that iostream includes istream and ostream, both of which include ios, which includes xlocnum, which includes both cstdlib and cstdio.
Related
I have developed a distributed memory MPI application which involves processing of a grid. Now i want to apply shared memory techniques (essentially making it a hybrid - parallel program), with OpenMP, to see if it can become any faster, or more efficient. I'm having a hard time with OpenMP, especially with a nested for loop. My application involves printing the grid to the screen every half a second, but when i parallelize it with OpenMP, execution proceeds 10 times slower, or not at all. The console screen lags and refreshes itself with random / unexpected data. In other words, it is going completely wrong. Take a look at the following function, which does the printing:
void display2dGrid(char** grid, int nrows, int ncolumns, int ngen)
{
//#pragma omp parallel
updateScreen();
int y, x;
//#pragma omp parallel shared(grid) // garbage
//#pragma omp parallel private(y) // garbage output!
//#pragma omp for
for (y = 0; y < nrows; y++) {
//#pragma omp parallel shared(grid) // nothing?
//#pragma omp parallel private(x) // 10 times slower!
for (x = 0; x < ncolumns; x++) {
printf("%c ", grid[y][x]);
}
printf("\n");
}
printf("Gen #%d\n", ngen);
fflush(stdout);
}
(updateScreen() just clears the screen and writes from top left corner again.)
The function is executed by only one process, which makes it a perfect target for thread parallelization. As you can see i have tried many approaches and one is worse than the other. Best case, i get semi proper output every 2 seconds (because it refreshes very slowly). Worst case i get garbage output.
I would appreciate any help. Is there a place where i can find more information to proper parallelize loops with OpenMP? Thanks in advance.
The function is executed by only one process, which makes it a perfect target for thread parallelization.
That is actually not true. The function you are trying to parallelize is a very bad target for parallelization. The calls to printf in your example need to happen in a specific sequential order, or else, you're going to obtain a garbage result as your experienced (since the elements of your grid are going to be printed in an order that means nothing). Actually, your attempts at parallelizing were pretty good, the problem comes from the fact that the function itself is a bad target for parallelization.
Speedup when parallelizing programs comes from the fact that you are distributing workload across multiple cores. In order to be able to do that with maximum efficiency, said workloads need to be independent, or at least share state as little as possible, which is not the case here since the calls to printf need to happen in a specific order.
When you try to parallelize some work that is intrinsically sequential, you lose more time synchronizing your workers (your openmp threads), than you gain by parallizing the work itself (which is why you obtain crap time when your result gets better).
Also, as this answer (https://stackoverflow.com/a/20089967/3909725) suggests, you should not print the content of your grid at each loop (unless you are debugging), but rather perform all of your computations, and then print the content when you have finished doing what your ultimate goal is, since printing is only useful to see the result of the computation, and only slows the process.
An example :
Here is a very basic example of parallizing a program with openmp that achieves speedup. Here a dummy (yet heavy) computation is realized for each value of the i variable. The computations in each loop are completely independent, and the different threads can achieve their computations independently. The calls to printf can be achieved in whatever order since they are just informative.
Original (sequential.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int i,j;
double x=0;
for(i=0; i < 100; i++)
{
x = 100000 * fabs(cos(i*i));
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Computed i=%2d [%g]\n",i,x);
}
}
Parallelized version (parallel.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main()
{
int i,j;
double x=0;
#pragma omp parallel for
for(i=0; i < 100; i++)
{
/* Dummy heavy computation */
x = 100000 * fabs(cos(i*i));
#pragma omp parallel for reduction(+: x)
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Thread %d computed i=%2d [%g]\n",omp_get_thread_num(),i,x);
}
}
A pretty good guide to openmp can be found here : http://bisqwit.iki.fi/story/howto/openmp/
I've reduced a crash to the following toy code:
// DLLwithOMP.cpp : build into a dll *with* /openmp
#include <tchar.h>
extern "C"
{
__declspec(dllexport) void funcOMP()
{
#pragma omp parallel for
for (int i = 0; i < 100; i++)
_tprintf(_T("Please fondle my buttocks\n"));
}
}
_
// ConsoleApplication1.cpp : build into an executable *without* /openmp
#include <windows.h>
#include <stdio.h>
#include <tchar.h>
typedef void(*tDllFunc) ();
int main()
{
HMODULE hDLL = LoadLibrary(_T("DLLwithOMP.dll"));
tDllFunc pDllFunc = (tDllFunc)GetProcAddress(hDLL, "funcOMP");
pDllFunc();
FreeLibrary(hDLL);
// At this point the omp runtime vcomp140[d].dll refcount is zero
// and windows unloads it, but the omp thread team remains active.
// A crash usually ensues.
return 0;
}
Is this an MS bug? Is there some OMP thread-cleanup API I missed (probably not, but maybe)? I don't have other compilers under hand. Do they treat this scenario differently? (again, probably not) Does the OMP standard has anything to say on such a scenario?
I got an answer from Eric Brumer # MS Connect. Re-posting it here in case it is of interest to anyone in the future:
for optimal performance, the openmp threadpool spin waits for about a
second prior to shutting down in case more work becomes available. If
you unload a DLL that's in the process of spin-waiting, it will crash
in the manner you see (most of the time).
You can tell openmp not to spin-wait and the threads will immediately
block after the loop finishes. Just set OMP_WAIT_POLICY=passive in
your environment, or call SetEnvironmentVariable(L"OMP_WAIT_POLICY",
L"passive"); in your function before loading the dll. The default is
"active" which tells the threadpool to spin wait. Use the environment
variable, or just wait a few seconds before calling FreeLibrary.
Ok so here's what the problem says.
Implement a simple loop that calls a function containing a delay. Partition this loop across four threads using static, dynamic and guided scheduling. Measure execution times for each type of scheduling with respect to both the size of the loop and the size of the delay.
this is what I've done so far, I have no idea if I'm on the right track
#include <omp.h>
#include <stdio.h>
int main() {
double start_time, run_time;
omp_set_num_threads(4);
start_time = omp_get_wtime();
#pragma omp parallel
#pragma omp for schedule(static)
for (int n = 0; n < 100; n++){
printf("square of %d=%d\n", n, n*n);
printf("cube of %d=%d\n", n, n*n*n);
int ID = omp_get_thread_num();
printf("Thread(%d) \n", ID);
}
run_time = omp_get_wtime() - start_time;
printf("Time Elapsed (%f)", run_time);
getchar();
}
At first you need a loop, where the distribution makes a difference. The loop has 100 iterations, so the OpenMP schedule will only 100 times decide what is the next iteration for a thread what takes no mensurable time. The output with printf takes very long so in your code it makes no difference which schedule is used. Its better to make a loop without console output and a very high loop count like
#pragma omp parallel
{
#pragma omp for schedule(static) private(value)
for (int i = 0; i < 100000000; i++) {
value = ...
}
}
At last you have to write code in the loop which "result" is used after the loop with a printf for example. If not the body could be deleted by the compiler because of optimize the code (it is not used later so its not needed). You can concentrate the time measurings on the parallel pool without the output of the results.
If your iterations nearly takes the same time, then a static distribution should be faster. If they differ very much the dynamic and guided schedules should dominate your measurings.
quick question;
I'm using Ubuntu as my coding environment, and I am trying to write a C program for Windows for school.
The assignment says I have to do something using the system clock, and I decided to make a quick benchmarking program. Here it is:
#include <stdio.h>
#include <unistd.h>
#include <time.h>
int main () {
int i = 0;
int p = (int) getpid();
int n = 0;
clock_t cstart = clock();
clock_t cend = 0;
for (i=0; i<100000000; i++) {
long f = (((i+9)*99)%4)+(8+i*999);
if (i % p == 0)
printf("i=%d, f=%ld\n", i, f);
}
cend = clock();
printf ("%.3f cpu sec\n", ((double)cend - (double)cstart)* 1.0e-6);
return 0;
}
When I cross compile from Ubuntu to Windows using mingw32, it's fine. However, when I run the program in Windows, two issues happen:
The benchmark runs as expected, and takes roughly 5 seconds, yet the timer says it took 0.03 seconds (this doesnt happen when testing in my Ubuntu VM. If the benchmark takes 5 seconds in real time, the timer will say 5 seconds. So obviously, this is an issue.)
Then, once the program is done, the Windows terminal will close immediately.
How do I make the program stay open so you can look at your time for more than like 10 milliseconds, and how can I make the runtime of the benchmark reflect it's score like it does when I test in Ubuntu?
Thanks!
i'm new to kernel programming and i'm trying to understand some basics of OS. I am trying to generate a delay using a technique which i've implemented successfully in a 20Mhz microcontroller.
I know this is a totally different environment as i'm using linux centOS in my 2 GHz Core 2 duo processor.
I've tried the following code but i'm not getting a delay.
#include<linux/kernel.h>
#include<linux/module.h>
int init_module (void)
{
unsigned long int i, j, k, l;
for (l = 0; l < 100; l ++)
{
for (i = 0; i < 10000; i ++)
{
for ( j = 0; j < 10000; j ++)
{
for ( k = 0; k < 10000; k ++);
}
}
}
printk ("\nhello\n");
return 0;
}
void cleanup_module (void)
{
printk ("bye");
}
When i dmesg after inserting the module as quickly as possile for me, the string "hello" is already there. If my calculation is right, the above code should give me atleast 10 seconds delay.
Why is it not working? Is there anything related to threading? How could a 20 Ghz processor execute the above code instantly without any noticable delay?
The compiler is optimizing your loop away since it has no side effects.
To actually get a 10 second (non-busy) delay, you can do something like this:
#include <linux/sched.h>
//...
unsigned long to = jiffies + (10 * HZ); /* current time + 10 seconds */
while (time_before(jiffies, to))
{
schedule();
}
or better yet:
#include <linux/delay.h>
//...
msleep(10 * 1000);
for short delays you may use mdelay, ndelay and udelay
I suggest you read Linux Device Drivers 3rd edition chapter 7.3, which deals with delays for more information
To answer the question directly, it's likely your compiler seeing that these loops don't do anything and "optimizing" them away.
As for this technique, what it looks like you're trying to do is use all of the processor to create a delay. While this may work, an OS should be designed to maximize processor time. This will just waste it.
I understand it's experimental, but just the heads up.