I'm doing some tests with a simple code which is written below.
The problem is that in a four core machine, I'm only getting 75% of load. The fourth core is idling, doing nothing. The code has an omp parallel, then an omp single inside of which the thread generates a task. That task generates a number of grandchildren tasks. The task will wait in a barrier until all of its children (grandchildren for the thread in the single region) finish and the thread executing the single region waits on another barrier until its direct descendant task finishes. The problem is that the thread executing the single region does not execute any of the grandchildren tasks. Given the blocksize I'm using, I'm creating thousands of tasks, so it's not a problem of available parallelism.
Am I misunderstanding OpenMP tasking? Is it related to the taskwait only waiting for the direct children? If so, how could I get the idle thread to execute available work? Imagine that I wanted to create tasks with dependencies as in OpenMP 4.0, then I would not be able to exploit all the threads available with dependencies. The barrier in the parent task would be needed as I would not want to free next tasks dependent on it until all of its children has finished.
#include <iostream>
#include <cstdlib>
#include <omp.h>
using namespace std;
#define VECSIZE 200000000
float* A;
float* B;
float* C;
void LoopDo(int start, int end) {
for (int i = start; i < end; i++)
{
C[i] += A[i]*B[i];
A[i] *= (B[i]+C[i]);
B[i] = C[i] + A[i];
C[i] *= (A[i]*C[i]);
C[i] += A[i]*B[i];
C[i] += A[i]*B[i];
....
}
void StartTasks(int bsize)
{
int nthreads = omp_get_num_threads();
cout << "bsize is: " << bsize << endl;
cout << "nthreads is: " << nthreads << endl;
#pragma omp task default(shared)
{
for (int i =0; i <VECSIZE; i+=bsize)
{
#pragma omp task default(shared) firstprivate(i,bsize)
LoopDo(i,i+bsize);
if (i + bsize >= VECSIZE) bsize = VECSIZE - i;
}
cerr << "Task creation ended" << cerr;
#pragma omp taskwait
}
#pragma omp taskwait
}
int main(int argc, char** argv)
{
A = (float*)malloc(VECSIZE*sizeof(float));
B = (float*)malloc(VECSIZE*sizeof(float));
C = (float*)malloc(VECSIZE*sizeof(float));
int bsize = atoi(argv[1]);
for (int i = 0; i < VECSIZE; i++)
{
A[i] = i; B[i] = i; C[i] = i;
}
#pragma omp parallel
{
#pragma omp single
{
StartTasks(bsize);
}
}
free(A);
free(B);
free(C);
return 0;
}
EDIT:
I tested with ICC 15.0 and it employs all the cores of my machine. Although ICC forks 5 threads instead of 4 like GCC does. The fifth ICC thread remains idle.
EDIT 2:
The following change, adding a loop with as many top level tasks as threads, gets all threads feeded with tasks. If top level tasks < ntthreads then at some executions the master thread won't execute any task and will remain idle as before. ICC as always will generate a binary which allows to use all cores.
for (int i = 0; i<nthreads;i++)
{
#pragma omp task default(shared)
{
for (int i =0; i <VECSIZE; i+=bsize)
{
#pragma omp task default(shared) firstprivate(i,bsize)
LoopDo(i,i+bsize);
if (i + bsize >= VECSIZE) bsize = VECSIZE - i;
}
cerr << "Task creation ended" << cerr;
#pragma omp taskwait
}
}
#pragma omp taskwait
Related
I have problem with OpenMP. I have to make doacross loop. For example:
for (int i = 1; i < SIZE-2; i++) {
for (int j = 2; j < SIZE-2; j++) {
tab[i][j] = tab[i][j+2] + tab[i+2][j-2];
}
}
And here I have dependency to the j-2, j+2 and i+2, and I don't know how to resolve this dependency.
You can try something like:
#pragma omp parallel for ordered(2)
for (int i = 1; i < SIZE-2; i++) {
for (int j = 2; j < SIZE-2; j++) {
#pragma omp ordered depend(sink:i,j+2) depend(sink:i+2,j-2)
tab[i][j] = tab[i][j+2] + tab[i+2][j-2];
#pragma omp ordered depend(source)
}
}
I arrived at a working solution based on the answer by dreamcrash:
#pragma omp parallel for ordered(2)
for(int i=1; i<N-2; i++){
for(int j=1; j<N-2; j++){
#pragma omp ordered depend(sink:i,j-2) depend(sink:i-2,j+1)
a[i][j] = a[i][j+2] + a[i+2][j-1];
#pragma omp ordered depend(source)
}
}
I compiled the code below on VS C++ 2017 with /openmp /O2 /arch::AVX.
When running with 8 threads the output is:
dt_loops = 1562ms
dt_eigen = 26 ms
I expected the A * B to be faster than my own handmade loops but I did not expect such a large difference. Is there anything wrong with my code? And if not how can Eigen3 do it so much faster.
I'm not very experienced in using OpenMP or any other parallelization method. I tried diferent loop orders but the one below is the fastest.
#include <iostream>
#include <chrono>
#include <Eigen/Dense>
int main() {
std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2;
int n = 1000;
Eigen::MatrixXd A = Eigen::MatrixXd::Random(n, n);
Eigen::MatrixXd B = Eigen::MatrixXd::Random(n, n);
Eigen::MatrixXd C = Eigen::MatrixXd::Zero(n, n);
start1 = std::chrono::system_clock::now();
int i, j, k;
#pragma omp parallel for private(i, j, k)
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
for (k = 0; k < n; ++k) {
C(i, j) += A(i, k) * B(k, j);
}
}
}
end1 = std::chrono::system_clock::now();
std::cout << "dt_loops = " << std::chrono::duration_cast<std::chrono::milliseconds>(end1-start1).count() << " ms" << std::endl;
Eigen::MatrixXd D = Eigen::MatrixXd::Zero(n, n);
start2 = std::chrono::system_clock::now();
D = A * B;
end2 = std::chrono::system_clock::now();
std::cout << "dt_eigen = " << std::chrono::duration_cast<std::chrono::milliseconds>(end2-start2).count() << " ms" << std::endl;
}
I have a problem running an MPI program (written in C or C++) over a cluster comprising of two nodes.
Details:
OS: Ubuntu 16.04
No. of nodes: 2 (master and slave)
Everything works well. When I run a simple mpi_hello program on the cluster with 12 as an argument (no. of processes) I see 4 mpi-hello instances running on the slave node (checked using top).
Output on master node + mpi_hello instances running on the second node (slave node)
When I try to run another program (for instance a simple program calculating and printing prime numbers in a range) it is running on the master node but i don't see any instances of it on the slave node.
#include <stdio.h>
#include<time.h>
//#include</usr/include/c++/5/iostream>
#include<mpi.h>
int main(int argc, char **argv)
{
int N, i, j, isPrime;
clock_t begin = clock();
int myrank, nprocs;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
printf("Hello from the processor %d of %d \n" , myrank, nprocs);
printf("To print all prime numbers between 1 to N\n");
printf("Enter the value of N\n");
scanf("%d",&N);
/* For every number between 2 to N, check
whether it is prime number or not */
printf("Prime numbers between %d to %d\n", 1, N);
for(i = 2; i <= N; i++){
isPrime = 0;
/* Check whether i is prime or not */
for(j = 2; j <= i/2; j++){
/* Check If any number between 2 to i/2 divides I
completely If yes the i cannot be prime number */
if(i % j == 0){
isPrime = 1;
break;
}
}
if(isPrime==0 && N!= 1)
printf("%d ",i);
}
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("\nThe time spent by the program is %f\n" , time_spent);
while(1)
{}
MPI_Finalize();
return 0;
}
What could be the possible reasons behind it ?
Are there any other ways to check if it is running on the slave node as well ?
Thanks
Okay so here is a code I worked with. A vector containing first 500 integers. Now I want to divide them into 4 processes equally (i.e. each process gets 125 integers -- the first process gets 1-125, the second 126-250 and so on). I tried to use MPI_Scatter(). but I don't see the data equally divided or even divided. Do I have to use MPI_Recv() (I have another piece of code which is functional and uses only scatter to divide data equally).
Could you pint out any problems in the code. Thanks
int main(int argc, char* argv[])
{
int root = 0;
MPI_Init(&argc, &argv);
int myrank, nprocs;
MPI_Status status;
//variables for prime number calculation
int num1, num2, count, n;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
char name[MPI_MAX_PROCESSOR_NAME + 1];
int namelen;
MPI_Get_processor_name(name, &namelen);
cout << "Enter first number: ";
cin >> num1;
cout << "Enter second number: ";
cin >> num2;
int size = 500;
int size1 = num2 / nprocs;
cout << "The size of each small vector is " << size1 << endl;
auto start = get_time::now(); //start measuring the time
vector<int> sendbuffer(size), recbuffer(size1); //vectors/buffers involved in the processing
cout << "The prime numbers between " << num1 << " and " << num2 << " are: " << endl;
if (myrank == root)
{
for (unsigned int i = 1; i <= num2; ++i) //array containing all the numbers from which you want to find prime numbers
{
sendbuffer[i] = i;
}
cout << "Processor " << myrank << " initial data";
for (int i = 1; i <= size; ++i)
{
cout << " " << sendbuffer[i];
}
cout << endl;
MPI_Scatter(&sendbuffer.front(), 125, MPI_INT, &recbuffer.front(), 125, MPI_INT, root, MPI_COMM_WORLD);
}
cout << "Process " << myrank << " now has data ";
for (int j = 1; j <= size1; ++j)
{
cout << " " << recbuffer[j];
}
cout << endl;
auto end = get_time::now();
auto diff = end - start;
cout << "Elapsed time is : " << chrono::duration_cast<ms>(diff).count() << " microseconds " << endl;
MPI_Finalize();
return 0;
}`
this is my first question, i have to write a simple program that asks the user to input an integer, where according to the input, it outputs stars according to the input.
for example:
#include <iostream>
using namespace std;
int main()
{
int n=0;
char star='*';
cout<<"Enter number Desired "<<endl;
cin>> n;
star=n;
cout<<' \n'<<star<<endl;
cout<<' \n'<<star-1<<endl;
cout<<' \n'<<star-2<<endl;
cout<<' \n'<<star-3<<endl;
cout<<' \n'<<star-4<<endl;
system ("pause");
return 0;
}
You should use a for-loop for printing out stars one by one.
An example is given below:
for (int i = 0; i < n; i++) {
cout << "*" << endl;
}
To make this loop print out less and less stars in each row, use nested for-loops:
for (int i = 0; i < n; i++) {
for (int j = i; j < n; j++) {
cout << "*" << endl;
}
cout << "\n" << endl;
}
This loop will print out n star characters in the first row, n-1 characters in the second row, and so on.
Let's say, if n == 5, then the output will be:
*****
****
***
**
*
This will print out a descending number of stars from the entered number:
#include <iostream>
using namespace std;
int main() {
int n=0;
char star='*';
cout<<"Enter number Desired "<<endl;
cin>> n;
for (int i = 0; i < n; i++)
{
for (int j = i; j < n; j++)
{
cout << "*";
}
cout << " " << endl;
}
system ("pause");
return 0;
}
I made a program in C that will create 10 threads, and inside each thread add 10,000 integers [0-100]. When the thread ends it adds the partial sum to the total sum. It is unlikely that 2 threads will end at the exact same time, but if they do will there be a problem?
#include <stdio.h>
#include <time.h>
#include<pthread.h>
pthread_t pid[10];
int i = 0;
int sum;
void* partial(void *arg)
{
int partial = 0;
pthread_t id = pthread_self();
int k = 0;
for(k = 0; k < 10000; k++) {
int r = rand() % 101;
partial += r;
}
sum += partial;
return NULL;
}
main() {
srand(time(NULL));
clock_t begin,end;
double timeSpent;
begin = clock();
while(i < 10) {
pthread_create(&(pid[i]), NULL, &partial, NULL);
printf("\n Thread created successfully\n");
i++;
}
sleep(10);
end = clock();
timeSpent = (double)(end-begin);
printf("\n Time taken: %f", timeSpent);
printf("\n sum: %d \n", sum);
return 0;
}
Yes, without locking with mutexes there is (an unlikely but possible) chance of a race condition.
If two threads do finish at the same time they will try to modify the common resource (sum) at the same time and that will lead to the common resource not being updated properly, since both threads will "race" to read the value of sum when incrementing it in the statement sum+=partial.