OpenMP: poor performance of heap arrays (stack arrays work fine) - multithreading

I am a fairly experienced OpenMP user, but I have just run into a puzzling problem, and I am hopeful that someone here could help. The problem is that a simple hashing algorithm performs well for stack-allocated arrays, but poorly for arrays on the heap.
Example below uses i%M (i modulus M) to count every M-th integer in respective array element. For simplicity, imagine N=1000000, M=10. If N%M==0, then the result should be that every element of bins[] is equal to N/M:
#pragma omp for
for (int i=0; i<N; i++)
bins[ i%M ]++;
Array bins[] is private to each thread (I sum results of all threads in a critical section afterwards).
When bins[] is allocated on the stack, the program works great, with performance scaling proportionally to the number of cores.
However, if bins[] is on the heap (pointer to bins[] is on the stack), performance drops drastically. And that is a major problem!
I want parallelize binning (hashing) of certain data into heap arrays with OpenMP, and this is a major performance hit.
It is definitely not something silly like all threads trying to write into the same area of memory.
That is because each thread has its own bins[] array, results are correct with both heap- and stack-allocated bins, and there is no difference in performance for single-thread runs.
I reproduced the problem on different hardware (Intel Xeon and AMD Opteron), with GCC and Intel C++ compilers. All tests were on Linux (Ubuntu and RedHat).
There seems no reason why good performance of OpenMP should be limited to stack arrays.
Any guesses? Maybe access of threads to the heap goes through some kind of shared gateway on Linux? How do I fix that?
Complete program to play around with is below:
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main(const int argc, const char* argv[])
{
const int N=1024*1024*1024;
const int M=4;
double t1, t2;
int checksum=0;
printf("OpenMP threads: %d\n", omp_get_max_threads());
//////////////////////////////////////////////////////////////////
// Case 1: stack-allocated array
t1=omp_get_wtime();
checksum=0;
#pragma omp parallel
{ // Each openmp thread should have a private copy of
// bins_thread_stack on the stack:
int bins_thread_stack[M];
for (int j=0; j<M; j++) bins_thread_stack[j]=0;
#pragma omp for
for (int i=0; i<N; i++)
{ // Accumulating every M-th number in respective array element
const int j=i%M;
bins_thread_stack[j]++;
}
#pragma omp critical
for (int j=0; j<M; j++) checksum+=bins_thread_stack[j];
}
t2=omp_get_wtime();
printf("Time with stack array: %12.3f sec, checksum=%d (must be %d).\n", t2-t1, checksum, N);
//////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////
// Case 2: heap-allocated array
t1=omp_get_wtime();
checksum=0;
#pragma omp parallel
{ // Each openmp thread should have a private copy of
// bins_thread_heap on the heap:
int* bins_thread_heap=(int*)malloc(sizeof(int)*M);
for (int j=0; j<M; j++) bins_thread_heap[j]=0;
#pragma omp for
for (int i=0; i<N; i++)
{ // Accumulating every M-th number in respective array element
const int j=i%M;
bins_thread_heap[j]++;
}
#pragma omp critical
for (int j=0; j<M; j++) checksum+=bins_thread_heap[j];
free(bins_thread_heap);
}
t2=omp_get_wtime();
printf("Time with heap array: %12.3f sec, checksum=%d (must be %d).\n", t2-t1, checksum, N);
//////////////////////////////////////////////////////////////////
return 0;
}
Sample outputs of the program are below:
for OMP_NUM_THREADS=1
OpenMP threads: 1
Time with stack array: 2.973 sec, checksum=1073741824 (must be 1073741824).
Time with heap array: 3.091 sec, checksum=1073741824 (must be 1073741824).
and for OMP_NUM_THREADS=10
OpenMP threads: 10
Time with stack array: 0.329 sec, checksum=1073741824 (must be 1073741824).
Time with heap array: 2.150 sec, checksum=1073741824 (must be 1073741824).
I would very much appreciate any help!

This is a cute problem: with the code as above (gcc4.4, Intel i7) with 4 threads I get
OpenMP threads: 4
Time with stack array: 1.696 sec, checksum=1073741824 (must be 1073741824).
Time with heap array: 5.413 sec, checksum=1073741824 (must be 1073741824).
but if I change the malloc line to
int* bins_thread_heap=(int*)malloc(sizeof(int)*M*1024);
(Update: or even
int* bins_thread_heap=(int*)malloc(sizeof(int)*16);
)
then I get
OpenMP threads: 4
Time with stack array: 1.578 sec, checksum=1073741824 (must be 1073741824).
Time with heap array: 1.574 sec, checksum=1073741824 (must be 1073741824).
The problem here is false sharing. The default malloc is being very (space-) efficient, and putting the requested small allocations all in one block of memory, next to each other; but since the allocations are so small that multiple fit in the same cache line, that means every time one thread updates its values, it dirties the cache line of the values in neighbouring threads. By making the requested memory large enough, this is no longer an issue.
Incidentally, it should be clear why the stack-allocated case does not see this problem; different threads - different stacks - memory far enough appart that false sharing isn't an issue.
As a side point -- it doesn't really matter for M of the size you're using here, but if your M (or number of threads) were larger, the omp critical would be a big serial bottleneck; you can use OpenMP reductions to sum the checksum more efficiently
#pragma omp parallel reduction(+:checksum)
{ // Each openmp thread should have a private copy of
// bins_thread_heap on the heap:
int* bins_thread_heap=(int*)malloc(sizeof(int)*M*1024);
for (int j=0; j<M; j++) bins_thread_heap[j]=0;
#pragma omp for
for (int i=0; i<N; i++)
{ // Accumulating every M-th number in respective array element
const int j=i%M;
bins_thread_heap[j]++;
}
for (int j=0; j<M; j++)
checksum+=bins_thread_heap[j];
free(bins_thread_heap);
}

The initial question implied that heap arrays are slower than stack arrays. Unfortunately the reason for this slowness related to a particular case of cache line clashes in multi-threaded applications. It does not justify the implication that in general heap arrays are slower than stack arrays.
For most cases, there is no significant difference in performance, especially where the arrays are much larger than cache line size. The opposite can often be the case, as the use of allocatable heap arrays, targeted to the size required can lead to performance advantages over larger fixed size arrays, which demand more memory transfers.

Related

How to make a thread wait another to finish using OpenMP threads?

I want to make the following loop that fills the A matrix parallel. For every A[i][j] element that is calculated I want the price in A[i-1][j], A[i-1][j -1] and A[i0][j-1] to have been calculated first. So my thread has to wait for the threads in these positions to have calculated their results. I've tried to achieve this like this:
#pragma omp parallel for num_threads(threadcnt) \
default(none) shared(A, lenx, leny) private(j)
for (i=1; i<lenx; i++)
{
for (j=1; j<leny; j++)
{
do
{
} while (A[i-1][j-1] == -1 || A[i-1][j] == -1 || A[i][j-1] == -1);
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
My A matrix is initialized in -1 so if A[][] equals to -1 the operation in this cell is not completed. It takes more time than the serial program though.. Any idea to avoid the while loop?
The waiting loop seems sub-optimal. Apart from burning cores that are spin-waiting, you will also need a plethora of well-placed flush directives to make this code work.
One alternative, especially in the context of a more general parallelization scheme would be to use tasks and task dependences to model the dependences between the different array elements:
#pragma omp parallel
#pragma omp single
for (i=1; i<lenx; i++) {
for (j=1; j<leny; j++) {
#pragma omp task depend(in:A[i-1][j-1],A[i-1][j],A[i][j-1]) depend(out:A[i][j])
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
You may want to think about block the matrix updates, so that each task receives a block of the matrix instead of a single element, but the general idea will remain the same.
Another useful OpenMP feature could be the ordered construct and it's ability to adhere to exactly this kind of data dependency:
#pragma omp parallel for
for (int i=1; i<lenx; i++) {
for (int j=1; j<leny; j++) {
#pragma omp ordered depend(source)
#pragma omp ordered depend(sink:i-1,j-1)
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
PS: The code above is untested, but it should get the rough idea across.
Your solution cannot work. As A is initialized to -1, and A[0][j] is never modified, if i==1, it will test A[1-1][j] and will always fail. BTW, if A is initiliazed to -1, how cannot you have anything but -1 with the max?
When you have dependencies problem, there are two solutions.
First one is to sequentialize your code. Openmp has the ordered directive to do that, but the drawback is that you loose parallelism (while still paying thread creation cost). Openmp 4.5 has a way to describe dependencies with the depend and sink/source statements, but I do not know how efficient can the compiler be to deal with that. And my compilers (gcc 7.3 or clang 6.0) do not support this feature.
Second solution is to modify your algorithm to suppress dependencies. Now, you are computing the maximum of all values that are at the left or above a given element. Lets turn it to a simpler problem. Compute the maximum of all values at the left of a given element. We can easily parallelize it by computing on the different rows, as there no interrow dependency.
// compute b[i][j]=max_k<j(a[i][k]
#pragma omp parallel for
for(int i=0;i<n;i++){
for(int j=0;j<n;j++){
// max per row
if((j!=0)) b[i][j]=max(a[i][j],b[i][j-1]);
else b[i][j]=a[i][j]; // left column initialised to value of a
}
}
Consider another simple problem, to compute the prefix maximum on the different columns. It is again easy to parallelize, but this time on the inner loop, as there is not inter-column dependency.
// compute c[i][j]=max_i<k(b[k,j])
for(int i=0;i<n;i++){
#pragma omp parallel for
for(int j=0;j<n;j++){
// max per column
if((i!=0)) c[i][j]=max(b[i][j],c[i-1][j]);
else c[i][j]=b[i][j]; // upper row initialised to value of b
}
}
Now you just have to chain these computations to get the expected result. Here is the final code (with a unique array used and some cleanup in the code).
#pragma omp parallel for
for(int i=0;i<n;i++){
for(int j=1;j<n;j++){
// max per row
a[i][j]=max(a[i][j],a[i][j-1]);
}
}
for(int i=1;i<n;i++){
#pragma omp parallel for
for(int j=0;j<n;j++){
// max per column
a[i][j]=max(a[i][j],a[i-1][j]);
}
}

Proper / Efficient parallelization of a for loop with OpenMP

I have developed a distributed memory MPI application which involves processing of a grid. Now i want to apply shared memory techniques (essentially making it a hybrid - parallel program), with OpenMP, to see if it can become any faster, or more efficient. I'm having a hard time with OpenMP, especially with a nested for loop. My application involves printing the grid to the screen every half a second, but when i parallelize it with OpenMP, execution proceeds 10 times slower, or not at all. The console screen lags and refreshes itself with random / unexpected data. In other words, it is going completely wrong. Take a look at the following function, which does the printing:
void display2dGrid(char** grid, int nrows, int ncolumns, int ngen)
{
//#pragma omp parallel
updateScreen();
int y, x;
//#pragma omp parallel shared(grid) // garbage
//#pragma omp parallel private(y) // garbage output!
//#pragma omp for
for (y = 0; y < nrows; y++) {
//#pragma omp parallel shared(grid) // nothing?
//#pragma omp parallel private(x) // 10 times slower!
for (x = 0; x < ncolumns; x++) {
printf("%c ", grid[y][x]);
}
printf("\n");
}
printf("Gen #%d\n", ngen);
fflush(stdout);
}
(updateScreen() just clears the screen and writes from top left corner again.)
The function is executed by only one process, which makes it a perfect target for thread parallelization. As you can see i have tried many approaches and one is worse than the other. Best case, i get semi proper output every 2 seconds (because it refreshes very slowly). Worst case i get garbage output.
I would appreciate any help. Is there a place where i can find more information to proper parallelize loops with OpenMP? Thanks in advance.
The function is executed by only one process, which makes it a perfect target for thread parallelization.
That is actually not true. The function you are trying to parallelize is a very bad target for parallelization. The calls to printf in your example need to happen in a specific sequential order, or else, you're going to obtain a garbage result as your experienced (since the elements of your grid are going to be printed in an order that means nothing). Actually, your attempts at parallelizing were pretty good, the problem comes from the fact that the function itself is a bad target for parallelization.
Speedup when parallelizing programs comes from the fact that you are distributing workload across multiple cores. In order to be able to do that with maximum efficiency, said workloads need to be independent, or at least share state as little as possible, which is not the case here since the calls to printf need to happen in a specific order.
When you try to parallelize some work that is intrinsically sequential, you lose more time synchronizing your workers (your openmp threads), than you gain by parallizing the work itself (which is why you obtain crap time when your result gets better).
Also, as this answer (https://stackoverflow.com/a/20089967/3909725) suggests, you should not print the content of your grid at each loop (unless you are debugging), but rather perform all of your computations, and then print the content when you have finished doing what your ultimate goal is, since printing is only useful to see the result of the computation, and only slows the process.
An example :
Here is a very basic example of parallizing a program with openmp that achieves speedup. Here a dummy (yet heavy) computation is realized for each value of the i variable. The computations in each loop are completely independent, and the different threads can achieve their computations independently. The calls to printf can be achieved in whatever order since they are just informative.
Original (sequential.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int i,j;
double x=0;
for(i=0; i < 100; i++)
{
x = 100000 * fabs(cos(i*i));
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Computed i=%2d [%g]\n",i,x);
}
}
Parallelized version (parallel.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main()
{
int i,j;
double x=0;
#pragma omp parallel for
for(i=0; i < 100; i++)
{
/* Dummy heavy computation */
x = 100000 * fabs(cos(i*i));
#pragma omp parallel for reduction(+: x)
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Thread %d computed i=%2d [%g]\n",omp_get_thread_num(),i,x);
}
}
A pretty good guide to openmp can be found here : http://bisqwit.iki.fi/story/howto/openmp/

Dividing a loop task among threads using OpenMP

Suppose I want to run the following loop between threats (say 4 threads) such that each thread is in charge of calculating (N/4) where N is the number of rows of the matrix.
#pragma omp parallel num_threads(4) private(i,j,M) shared(Matrix)
{
#pragma omp for schedule(static)
for(i=0; i<N; i++)
{
for(j=0; j<N; j++)
{
M[i][j]= Matrix[i][j] + (Matrix[i][j] * Matrix[j][i]);
}
}
}
My question is: Should I explicitly specify the beginning and the end of each chunk of the matrix that each thread will calculate or OpenMP will automatically distribute the job between the threads? The reason behind this question is that I've read somewhere that OpenMP will automatically distribute the job between the threads but when I implemented it, it gave me fault segmentation error.
Thank you.

New to OpenMP and parallel programming need to partition a loop using scheduling

Ok so here's what the problem says.
Implement a simple loop that calls a function containing a delay. Partition this loop across four threads using static, dynamic and guided scheduling. Measure execution times for each type of scheduling with respect to both the size of the loop and the size of the delay.
this is what I've done so far, I have no idea if I'm on the right track
#include <omp.h>
#include <stdio.h>
int main() {
double start_time, run_time;
omp_set_num_threads(4);
start_time = omp_get_wtime();
#pragma omp parallel
#pragma omp for schedule(static)
for (int n = 0; n < 100; n++){
printf("square of %d=%d\n", n, n*n);
printf("cube of %d=%d\n", n, n*n*n);
int ID = omp_get_thread_num();
printf("Thread(%d) \n", ID);
}
run_time = omp_get_wtime() - start_time;
printf("Time Elapsed (%f)", run_time);
getchar();
}
At first you need a loop, where the distribution makes a difference. The loop has 100 iterations, so the OpenMP schedule will only 100 times decide what is the next iteration for a thread what takes no mensurable time. The output with printf takes very long so in your code it makes no difference which schedule is used. Its better to make a loop without console output and a very high loop count like
#pragma omp parallel
{
#pragma omp for schedule(static) private(value)
for (int i = 0; i < 100000000; i++) {
value = ...
}
}
At last you have to write code in the loop which "result" is used after the loop with a printf for example. If not the body could be deleted by the compiler because of optimize the code (it is not used later so its not needed). You can concentrate the time measurings on the parallel pool without the output of the results.
If your iterations nearly takes the same time, then a static distribution should be faster. If they differ very much the dynamic and guided schedules should dominate your measurings.

Why would a VC++ program that is storing 5MB of data consume 64MB of system memory?

I have been working on trying to figure out why my program is consuming so much system RAM. I'm loading a file from disk into a vector of structs of several dynamically allocated arrays. A 16MB file ends up consuming 280MB of system RAM according to task manager. The types in the file are mostly chars with some shorts and a few longs. There are 331,000 records in the file containing on average about 5 fields. I converted the vector to a struct and that reduced the memory to about 255MB but that still seems very high. With the vector taking up so much memory the program is running out of memory so I need to find a way to get the memory usage more reasonable.
I wrote a simple program to just stuff a vector (or array) with 1,000,000 char pointers. I would expect it to allocate 4+1 bytes for each giving 5MB of memory required for storage, but in fact it is using 64MB (array version) or 67MB (vector version). When the program first starts up it only consumes 400K so why is there an additional 59MB for array or 62MB for vectors being allocated? This extra memory seems to be for each container, so if I create a size_check2 and copy everything and run it the program uses up 135MB for 10MB worth of pointers and data.
Thanks in advance,
size_check.h
#pragma once
#include <vector>
class size_check
{
public:
size_check(void);
~size_check(void);
typedef unsigned long size_type;
void stuff_me( unsigned int howMany );
private:
size_type** package;
// std::vector<size_type*> package;
size_type* me;
};
size_check.cpp
#include "size_check.h"
size_check::size_check(void)
{
}
size_check::~size_check(void)
{
}
void size_check::stuff_me( unsigned int howMany )
{
package = new size_type*[howMany];
for( unsigned int i = 0; i < howMany; ++i )
{
size_type *me = new size_type;
*me = 33;
package[i] = me;
// package.push_back( me );
}
}
main.cpp
#include "size_check.h"
int main( int argc, char * argv[ ] )
{
const unsigned int buckets = 20;
const unsigned int size = 50000;
size_check* me[buckets];
for( unsigned int i = 0; i < buckets; ++i )
{
me[i] = new size_check();
me[i]->stuff_me( size );
}
printf( "done.\n" );
}
In my test using VS2010, a debug build had a working set size of 52,500KB. But a release build had a working set
size of 20,944KB.
Debug builds will usually use more memory than optimized builds due to the debug heap manager doing things like creating memory fences.
In release builds, I suspect that the heap manager reserves more memory than you are actually using as a performance optimization.
Memory Leak
package = new size_type[howMany]; // instantiate 50,000 size_type's
for( unsigned int i = 0; i < howMany; ++i )
{
size_type *me = new size_type; // Leak: results in an extra 50k size_type's being instantiated
*me = 33;
package[i] = *me; // Set a non-pointer to what is at the address of pointer "me"
// Would package[i] = 33; not suffice?
}
Furthermore, make sure you've compiled in release mode
There might be a couple reasons why you're seeing such a large memory footprint from your test program. Inside your
void size_check::stuff_me( unsigned int howMany )
{
This method is always getting called with howMany = 50000.
package = new size_type[howMany];
Assuming this is on a 32-bit setup the above statement will allocate 50,000 * 4 bytes.
for( unsigned int i = 0; i < howMany; ++i )
{
size_type *me = new size_type;
The above will allocate new storage on each iteration of the loop. Since this loops 50,000 and the allocation never gets deleted that effectively takes up another 50,000 * 4 bytes upon loop completion.
*me = 33;
package[i] = *me;
}
}
Lastly, since stuff_me() gets called 20 times from main() your program would have allocated at least ~8Mbytes upon completion. If this is on a 64-bit system than the footprint will likely double since sizeof(long) == 8bytes.
The increase in memory consumption could have something to do with the way VS implements dynamic allocation. For performance reasons, it's possible that due to the multiple calls to new your program is reserving extra memory so as to avoid hitting up the OS everytime it needs more.
FYI, when I ran your test program on mingw-gcc 4.5.2, the memory consumption was ~20Mbytes -- much lower than what you were seeing but still a substantial amount. If I changed the stuff_me method to this:
void size_check::stuff_me( unsigned int howMany )
{
package = new size_type[howMany];
size_type *me = new size_type;
for( unsigned int i = 0; i < howMany; ++i )
{
*me = 33;
package[i] = *me;
}
delete me;
}
memory consumption goes down quite a bit down to ~4-5mbytes.
I think I found the answer by delving into the new statement. In debug builds there are two items that are created when you do a new. One is _CrtMemBlockHeader which is 32 bytes in length. The other is noMansLand (a memory fence) with a size of 4 bytes which gives us an overhead of 36 bytes for each new. In my case each individual new for a char was costing me 37 bytes. In release builds the memory usage is reduced to about 1/2 but I can't tell exactly how much is allocated for each new as I can't get to the new/malloc routine.
So my work around is to allocate a large block of memory to hold the file in memory. Then parse the memory image filling in a vector of pointers to the beginning of each of the records. Then on demand, I build a record from the memory image using the pointer to the beginning of the selected record. Doing this reduced the memory footprint to <25MB.
Thanks for all your help and suggestions.

Resources