I'm using prange for parallelizing a loop in my cython code. As this loop takes a long time, I want to print its progress as it goes. A progress bar would be nice but anything that shows the progress would do. I tried to add some form of logging in the following fashion:
for index in prange(n, nogil = True, num_threads = 5):
if index == (index//1000)*1000:
printf("%d percent done", index*100/n)
This should print the progress whenever index % 1000 == 0.
The output of this is kind of random.
0 percent done
80 percent done
20 percent done
100 percent done
60 percent done
I assume that this is because prange does not assign threads to indexes starting from 0 and going up.
Is there any way to implement such a thing in Cython?
Thanks!
There are probably ways in Cython to achieve what you want, however I find it easier to use OpenMP from C/C++ and to call the functionality from Cython (thanks to C-verbatim-code since Cython 0.28 it is quite convenient).
One thing is clear, to achieve your goal you need some kind of synchronization between threads, which might impact performance. My strategy is simple: every thread reports "done", all done task are registered and they number is reported from time to time. The count of reported tasks is protected with a mutex/lock:
%%cython -c=/openmp --link-args=/openmp
cdef extern from * nogil:
r"""
#include <omp.h>
#include <stdio.h>
static omp_lock_t cnt_lock;
static int cnt = 0;
void reset(){
omp_init_lock(&cnt_lock);
cnt = 0;
}
void destroy(){
omp_destroy_lock(&cnt_lock);
}
void report(int mod){
omp_set_lock(&cnt_lock);
// start protected code:
cnt++;
if(cnt%mod == 0){
printf("done: %d\n", cnt);
}
// end protected code block
omp_unset_lock(&cnt_lock);
}
"""
void reset()
void destroy()
void report(int mod)
from cython.parallel import prange
def prange_with_report(int n):
reset() # reset counter and init lock
cdef int index
for index in prange(n, nogil = True, num_threads = 5):
report(n//10)
destroy() # release lock
Now calling: prange_with_report(100) would print done: 10\n, done: 20\n, ..., done: 100\n.
As a slight optimization, report could be called not for every index but only for example for index%100==0 - there will be less impact on the performance but also the reporting will be less precise.
Related
I have developed a distributed memory MPI application which involves processing of a grid. Now i want to apply shared memory techniques (essentially making it a hybrid - parallel program), with OpenMP, to see if it can become any faster, or more efficient. I'm having a hard time with OpenMP, especially with a nested for loop. My application involves printing the grid to the screen every half a second, but when i parallelize it with OpenMP, execution proceeds 10 times slower, or not at all. The console screen lags and refreshes itself with random / unexpected data. In other words, it is going completely wrong. Take a look at the following function, which does the printing:
void display2dGrid(char** grid, int nrows, int ncolumns, int ngen)
{
//#pragma omp parallel
updateScreen();
int y, x;
//#pragma omp parallel shared(grid) // garbage
//#pragma omp parallel private(y) // garbage output!
//#pragma omp for
for (y = 0; y < nrows; y++) {
//#pragma omp parallel shared(grid) // nothing?
//#pragma omp parallel private(x) // 10 times slower!
for (x = 0; x < ncolumns; x++) {
printf("%c ", grid[y][x]);
}
printf("\n");
}
printf("Gen #%d\n", ngen);
fflush(stdout);
}
(updateScreen() just clears the screen and writes from top left corner again.)
The function is executed by only one process, which makes it a perfect target for thread parallelization. As you can see i have tried many approaches and one is worse than the other. Best case, i get semi proper output every 2 seconds (because it refreshes very slowly). Worst case i get garbage output.
I would appreciate any help. Is there a place where i can find more information to proper parallelize loops with OpenMP? Thanks in advance.
The function is executed by only one process, which makes it a perfect target for thread parallelization.
That is actually not true. The function you are trying to parallelize is a very bad target for parallelization. The calls to printf in your example need to happen in a specific sequential order, or else, you're going to obtain a garbage result as your experienced (since the elements of your grid are going to be printed in an order that means nothing). Actually, your attempts at parallelizing were pretty good, the problem comes from the fact that the function itself is a bad target for parallelization.
Speedup when parallelizing programs comes from the fact that you are distributing workload across multiple cores. In order to be able to do that with maximum efficiency, said workloads need to be independent, or at least share state as little as possible, which is not the case here since the calls to printf need to happen in a specific order.
When you try to parallelize some work that is intrinsically sequential, you lose more time synchronizing your workers (your openmp threads), than you gain by parallizing the work itself (which is why you obtain crap time when your result gets better).
Also, as this answer (https://stackoverflow.com/a/20089967/3909725) suggests, you should not print the content of your grid at each loop (unless you are debugging), but rather perform all of your computations, and then print the content when you have finished doing what your ultimate goal is, since printing is only useful to see the result of the computation, and only slows the process.
An example :
Here is a very basic example of parallizing a program with openmp that achieves speedup. Here a dummy (yet heavy) computation is realized for each value of the i variable. The computations in each loop are completely independent, and the different threads can achieve their computations independently. The calls to printf can be achieved in whatever order since they are just informative.
Original (sequential.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int i,j;
double x=0;
for(i=0; i < 100; i++)
{
x = 100000 * fabs(cos(i*i));
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Computed i=%2d [%g]\n",i,x);
}
}
Parallelized version (parallel.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main()
{
int i,j;
double x=0;
#pragma omp parallel for
for(i=0; i < 100; i++)
{
/* Dummy heavy computation */
x = 100000 * fabs(cos(i*i));
#pragma omp parallel for reduction(+: x)
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Thread %d computed i=%2d [%g]\n",omp_get_thread_num(),i,x);
}
}
A pretty good guide to openmp can be found here : http://bisqwit.iki.fi/story/howto/openmp/
I have to calculate the sum of the elements in a bidimensional matrix, using a separate thread to calculate the sum of each row. Then the main thread adds up these sums printing the final result.
Can you guys see what's wrong?
(I'm all new to the threads stuff)
#include <pthread.h>
#include <stdio.h>
void sumR(void* _a,int m,int n,int sum)
{
int i;
int (*a)[m]=_a;
for(i=1;i<=n;i++)
sum+=a[n][i];
}
int main()
{
int a[20][20],sum1,sum;
int m=3,n=3,k=3,i,j;
for(i=1;i<=m;i++)
{
k=k+3;
for(j=1;j<=n;j++)
a[i][j]=k;
}
sum1=0;
for(i=1;i<=m;i++)
{
sum=0;
pthread_t th;
pthread_create(&th,NULL,&sumR,&a,&m,&n,&sum);
sum1+=sum;
pthread_join(&th,NULL);
}
printf("Sum of the matrix is: %d",sum1);
return 0;
}
One problem I see is that your loop does essentially this:
for each row
start thread
add thread's sum to total
wait for thread to exit
That's not going to work because you're adding the thread's sum before the thread is done calculating it. You need to wait for the thread to finish:
start thread
wait for thread to exit
add thread's sum to total
However, that model doesn't take advantage of multiple threads. You only have one thread running at a time.
What you need to do is create all of the threads and store them in an array. Then wait for each thread to exit and add its sum to the total. Something like:
for i = 0 to num_threads-1
threads[i] = pthread_create(&threads[i], NULL, &sums[i], ...)
And then
for i = 0 to num_threads-1
pthread_join(&threads[i], ...);
sum += sums[i];
That way, all of your threads are running at the same time, and you harvest the result only when the thread is done.
Ok so here's what the problem says.
Implement a simple loop that calls a function containing a delay. Partition this loop across four threads using static, dynamic and guided scheduling. Measure execution times for each type of scheduling with respect to both the size of the loop and the size of the delay.
this is what I've done so far, I have no idea if I'm on the right track
#include <omp.h>
#include <stdio.h>
int main() {
double start_time, run_time;
omp_set_num_threads(4);
start_time = omp_get_wtime();
#pragma omp parallel
#pragma omp for schedule(static)
for (int n = 0; n < 100; n++){
printf("square of %d=%d\n", n, n*n);
printf("cube of %d=%d\n", n, n*n*n);
int ID = omp_get_thread_num();
printf("Thread(%d) \n", ID);
}
run_time = omp_get_wtime() - start_time;
printf("Time Elapsed (%f)", run_time);
getchar();
}
At first you need a loop, where the distribution makes a difference. The loop has 100 iterations, so the OpenMP schedule will only 100 times decide what is the next iteration for a thread what takes no mensurable time. The output with printf takes very long so in your code it makes no difference which schedule is used. Its better to make a loop without console output and a very high loop count like
#pragma omp parallel
{
#pragma omp for schedule(static) private(value)
for (int i = 0; i < 100000000; i++) {
value = ...
}
}
At last you have to write code in the loop which "result" is used after the loop with a printf for example. If not the body could be deleted by the compiler because of optimize the code (it is not used later so its not needed). You can concentrate the time measurings on the parallel pool without the output of the results.
If your iterations nearly takes the same time, then a static distribution should be faster. If they differ very much the dynamic and guided schedules should dominate your measurings.
I'm optimizing some instrumentation for my project (Linux,ICC,pthreads), and would like some feedback on this technique to assign a unique index to a thread, so I can use it to index into an array of per-thread data.
The old technique uses a std::map based on pthread id, but I'd like to avoid locks and a map lookup if possible (it is creating a significant amount of overhead).
Here is my new technique:
static PerThreadInfo info[MAX_THREADS]; // shared, each index is per thread
// Allow each thread a unique sequential index, used for indexing into per
// thread data.
1:static size_t GetThreadIndex()
2:{
3: static size_t threadCount = 0;
4: __thread static size_t myThreadIndex = threadCount++;
5: return myThreadIndex;
6:}
later in the code:
// add some info per thread, so it can be aggregated globally
info[ GetThreadIndex() ] = MyNewInfo();
So:
1) It looks like line 4 could be a race condition if two threads where created at exactly the same time. If so - how can I avoid this (preferably without locks)? I can't see how an atomic increment would help here.
2) Is there a better way to create a per-thread index somehow? Maybe by pre-generating the TLS index on thread creation somehow?
1) An atomic increment would help here actually, as the possible race is two threads reading and assigning the same ID to themselves, so making sure the increment (read number, add 1, store number) happens atomically fixes that race condition. On Intel a "lock; inc" would do the trick, or whatever your platform offers (like InterlockedIncrement() for Windows for example).
2) Well, you could actually make the whole info thread-local ("__thread static PerThreadInfo info;"), provided your only aim is to be able to access the data per-thread easily and under a common name. If you actually want it to be a globally accessible array, then saving the index as you do using TLS is a very straightforward and efficient way to do this. You could also pre-compute the indexes and pass them along as arguments at thread creation, as Kromey noted in his post.
Why so averse to using locks? Solving race conditions is exactly what they're designed for...
In any rate, you can use the 4th argument in pthread_create() to pass an argument to your threads' start routine; in this way, you could use your master process to generate an incrementing counter as it launches the threads, and pass this counter into each thread as it is created, giving you your unique index for each thread.
I know you tagged this [pthreads], but you also mentioned the "old technique" of using std::map. This leads me to believe that you're programming in C++. In C++11 you have std::thread, and you can pass out unique indexes (id's) to your threads at thread creation time through an ordinary function parameter.
Below is an example HelloWorld that creates N threads, assigning each an index of 0 through N-1. Each thread does nothing but say "hi" and give it's index:
#include <iostream>
#include <thread>
#include <mutex>
#include <vector>
inline void sub_print() {}
template <class A0, class ...Args>
void
sub_print(const A0& a0, const Args& ...args)
{
std::cout << a0;
sub_print(args...);
}
std::mutex&
cout_mut()
{
static std::mutex m;
return m;
}
template <class ...Args>
void
print(const Args& ...args)
{
std::lock_guard<std::mutex> _(cout_mut());
sub_print(args...);
}
void f(int id)
{
print("This is thread ", id, "\n");
}
int main()
{
const int N = 10;
std::vector<std::thread> threads;
for (int i = 0; i < N; ++i)
threads.push_back(std::thread(f, i));
for (auto i = threads.begin(), e = threads.end(); i != e; ++i)
i->join();
}
My output:
This is thread 0
This is thread 1
This is thread 4
This is thread 3
This is thread 5
This is thread 7
This is thread 6
This is thread 2
This is thread 9
This is thread 8
Im trying to write something which very quickly calculates random numbers and can be applied on multiple threads. My current code is:
/* Approximating PI using a Monte-Carlo method. */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#define N 1000000000 /* As lareg as possible for increased accuracy */
double random_function(void);
int main(void)
{
int i = 0;
double X, Y;
double count_inside_temp = 0.0, count_inside = 0.0;
unsigned int th_id = omp_get_thread_num();
#pragma omp parallel private(i, X, Y) firstprivate(count_inside_temp)
{
srand(th_id);
#pragma omp for schedule(static)
for (i = 0; i <= N; i++) {
X = 2.0 * random_function() - 1.0;
Y = 2.0 * random_function() - 1.0;
if ((X * X) + (Y * Y) < 1.0) {
count_inside_temp += 1.0;
}
}
#pragma omp atomic
count_inside += count_inside_temp;
}
printf("Approximation to PI is = %.10lf\n", (count_inside * 4.0)/ N);
return 0;
}
double random_function(void)
{
return ((double) rand() / (double) RAND_MAX);
}
This works but from observing a resource manager I know its not using all the threads. Does rand() work for multithreaded code? And if not is there a good alternative? Many Thanks. Jack
Is rand() thread safe? Maybe, maybe not:
The rand() function need not be reentrant. A function that is not required to be reentrant is not required to be thread-safe."
One test and good learning exercise would be to replace the call to rand() with, say, a fixed integer and see what happens.
The way I think of pseudo-random number generators is as a black box which take an integer as input and return an integer as output. For any given input the output is always the same, but there is no pattern in the sequence of numbers and the sequence is uniformly distributed over the range of possible outputs. (This model isn't entirely accurate, but it'll do.) The way you use this black box is to choose a staring number (the seed) use the output value in your application and as the input for the next call to the random number generator. There are two common approaches to designing an API:
Two functions, one to set the initial seed (e.g. srand(seed)) and one to retrieve the next value from the sequence (e.g. rand()). The state of the PRNG is stored internally in sort of global variable. Generating a new random number either will not be thread safe (hard to tell, but the output stream won't be reproducible) or will be slow in multithreded code (you end up with some serialization around the state value).
A interface where the PRNG state is exposed to the application programmer. Here you typically have three functions: init_prng(seed), which returns some opaque representation of the PRNG state, get_prng(state), which returns a random number and changes the state variable, and destroy_peng(state), which just cleans up allocated memory and so on. PRNGs with this type of API should all be thread safe and run in parallel with no locking (because you are in charge of managing the (now thread local) state variable.
I generally write in Fortran and use Ladd's implementation of the Mersenne Twister PRNG (that link is worth reading). There are lots of suitable PRNG's in C which expose the state to your control. PRNG looks good and using this (with initialization and destroy calls inside the parallel region and private state variables) should give you a decent speedup.
Finally, it's often the case that PRNGs can be made to perform better if you ask for a whole sequence of random numbers in one go (e.g. the compiler can vectorize the PRNG internals). Because of this libraries often have something like get_prng_array(state) functions which give you back an array full of random numbers as if you put get_prng in a loop filling the array elements - they just do it more quickly. This would be a second optimization (and would need an added for loop inside the parallel for loop. Obviously, you don't want to run out of per-thread stack space doing this!