Non collective write using in file view - io

When trying to write blocks to a file, with my blocks being unevenly distributed across my processes, one can use MPI_File_write_at with the good offset. As this function is not a collective operation, this works well.
Exemple :
#include <cstdio>
#include <cstdlib>
#include <string>
#include <mpi.h>
int main(int argc, char* argv[])
{
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int global = 7; // prime helps have unbalanced procs
int local = (global/size) + (global%size>rank?1:0);
int strsize = 5;
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, "output.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
for (int i=0; i<local; ++i)
{
size_t idx = i * size + rank;
std::string buffer = std::string(strsize, 'a' + idx);
size_t offset = buffer.size() * idx;
MPI_File_write_at(fh, offset, buffer.c_str(), buffer.size(), MPI_CHAR, MPI_STATUS_IGNORE);
}
MPI_File_close(&fh);
MPI_Finalize();
return 0;
}
However for more complexe write, particularly when writting multi dimensional data like raw images, one may want to create a view at the file with MPI_Type_create_subarray. However, when using this methods with simple MPI_File_write (which is suppose to be non collective) I run in deadlocks. Exemple :
#include <cstdio>
#include <cstdlib>
#include <string>
#include <mpi.h>
int main(int argc, char* argv[])
{
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int global = 7; // prime helps have unbalanced procs
int local = (global/size) + (global%size>rank?1:0);
int strsize = 5;
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, "output.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
for (int i=0; i<local; ++i)
{
size_t idx = i * size + rank;
std::string buffer = std::string(strsize, 'a' + idx);
int dim = 2;
int gsizes[2] = { buffer.size(), global };
int lsizes[2] = { buffer.size(), 1 };
int offset[2] = { 0, idx };
MPI_Datatype filetype;
MPI_Type_create_subarray(dim, gsizes, lsizes, offset, MPI_ORDER_C, MPI_CHAR, &filetype);
MPI_Type_commit(&filetype);
MPI_File_set_view(fh, 0, MPI_CHAR, filetype, "native", MPI_INFO_NULL);
MPI_File_write(fh, buffer.c_str(), buffer.size(), MPI_CHAR, MPI_STATUS_IGNORE);
}
MPI_File_close(&fh);
MPI_Finalize();
return 0;
}
How to avoid such a code to lock ? Keep in mind that by real code will really use the multidimensional capabilities of MPI_Type_create_subarray and cannot just use MPI_File_write_at
Also, it is difficult for me to know the maximum number of block in a process, so I'd like to avoid doing a reduce_all and then loop on the max number of block with empty writes when localnb <= id < maxnb

You don't use MPI_REDUCE when you have a variable number of blocks per node. You use MPI_SCAN or MPI_EXSCAN: MPI IO Writing a file when offset is not known
MPI_File_set_view is collective, so if 'local' is different on each processor, you'll find yourself calling a collective routine from less than all processors in the communicator. If you really really need to do so, open the file with MPI_COMM_SELF.
the MPI_SCAN approach means each process can set the file view as needed, and then blammo you can call the collective MPI_File_write_at_all (even if some processes have zero work -- they still need to participate) and take advantage of whatever clever optimizations your MPI-IO implementation provides.

Related

MPI_Recv takes a long and strange time to return

I am trying to compare the execution time of two programs:
-the first one uses non blocking functions MPI_Ssend and MPI_Irecv, which allows to do some calculations "while messages are being sent and received".
-the other one uses blocking OpenMPI functions.
I have no problem to assess the performances of the first program and they look "good".
My problem is that the second program, that uses MPI_Recv, often takes a very long and kind of "strange" time to finish : always a little bit more than 1 second : (for example 1.001033 seconds). As if the process had to do something that takes exactly 1 second.
I modified my initial code to show you an equivalent one.
#include "mpi.h"
#include <stdlib.h>
#include <stdio.h>
#define SIZE 5000
int main(int argc, char * argv[])
{
MPI_Request trash;
int rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int myVector[SIZE];
for(int i = 0 ; i < SIZE ; i++)
myVector[i] = 100;
int buff[SIZE];
double startTime = MPI_Wtime();
if(rank == 0)
MPI_Issend(myVector, SIZE, MPI_INT, 1, 1, MPI_COMM_WORLD, &trash);
if(rank == 1) {
MPI_Recv(buff, SIZE, MPI_INT, 0, 1, MPI_COMM_WORLD, NULL);
printf("Processus 1 finished in %lf seconds\n", MPI_Wtime()-startTime);
}
MPI_Finalize();
return 0;
}
Thank you.

pthreads code not scaling up

I wrote the following very simple pthread code to test how it scales up. I am running the code on a machine with 8 logical processors and at no time do I create more than 8 threads (to avoid context switching).
With increasing number of threads, each thread has to do lesser amount of work. Also, it is evident from the code that there are no shared Data structures between the threads which might be a bottleneck. But still, my performance degrades as I increase the number of threads.
Can somebody tell me what am I doing wrong here.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int NUM_THREADS = 3;
unsigned long int COUNTER = 10000000000000;
unsigned long int LOOP_INDEX;
void* addNum(void *data)
{
unsigned long int sum = 0;
for(unsigned long int i = 0; i < LOOP_INDEX; i++) {
sum += 100;
}
return NULL;
}
int main(int argc, char** argv)
{
NUM_THREADS = atoi(argv[1]);
pthread_t *threads = (pthread_t*)malloc(sizeof(pthread_t) * NUM_THREADS);
int rc;
clock_t start, diff;
LOOP_INDEX = COUNTER/NUM_THREADS;
start = clock();
for (int t = 0; t < NUM_THREADS; t++) {
rc = pthread_create((threads + t), NULL, addNum, NULL);
if (rc) {
printf("ERROR; return code from pthread_create() is %d", rc);
exit(-1);
}
}
void *status;
for (int t = 0; t < NUM_THREADS; t++) {
rc = pthread_join(threads[t], &status);
}
diff = clock() - start;
int sec = diff / CLOCKS_PER_SEC;
printf("%d",sec);
}
Note: All the answers I found online said that the overhead of creating the threads is more than the work they are doing. To test it, I commented out everything in the "addNum()" function. But then, after doing that no matter how many threads I create, the time taken by the code is 0 seconds. So there is no overhead as such, I think.
clock() counts CPU time used, across all threads. So all that's telling you is that you're using a little bit more total CPU time, which is exactly what you would expect.
It's the total wall clock elapsed time which should be going down if your parallelisation is effective. Measure that with clock_gettime() specifying the CLOCK_MONOTONIC clock instead of clock().

Pthread create function

I am having difficulty understanding the creation of a pthread.
This is the function I declared in the beginning of my code
void *mini(void *numbers); //Thread calls this function
Initialization of thread
pthread_t minThread;
pthread_create(&minThread, NULL, (void *) mini, NULL);
void *mini(void *numbers)
{
min = (numbers[0]);
for (i = 0; i < 8; i++)
{
if ( numbers[i] < min )
{
min = numbers[i];
}
}
pthread_exit(0);
}
numbers is an array of integers
int numbers[8];
Im not sure if I created the pthread correctly.
In the function, mini, I get the following error about setting min (declared as an int) equal to numbers[0]:
Assigning to 'int' from incompatible type 'void'
My objective is to compute the minimum value in numbers[ ] (min) in this thread and use that value later to pass it to another thread to display it. Thanks for any help I can get.
You need to pass 'numbers' as the last argument to pthread_create(). The new thread can then call 'mini' on its own stack with 'numbers' as the argument.
In 'mini', you shoudl cast the void* back to an integer array in order to dereference it correctly - you cannot dereference a void* directly - it does not point to anything:)
Also, it's very confusing to have multiple vars in different threads with the name 'numbers'.
There are some minor improprieties in this pgm but it illustrates basically what you want to do. You should play around, break and improve it.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
void *mini(void *numbs)
{
int *numbers = (int *) numbs;
int *min = malloc(sizeof(int));
*min = (numbers[0]);
for (int i = 0; i < 8; i++)
if (numbers[i] < *min )
*min = numbers[i];
pthread_exit(min);
}
int main(int argc, char *argv[])
{
pthread_t minThread;
int *min;
int numbers[8] = {28, 47, 36, 45, 14, 23, 32, 16};
pthread_create(&minThread, NULL, (void *) mini, (void *) numbers);
pthread_join(minThread, (void *) &min);
printf("min: %d\n", *min);
free(min);
return(0);
}

Program to see the bytes from a file internally

Do you know if exist one program or method to see (secuences of)bytes from a text,html file?
Not to see characters, rather see the complete sequence of bytes.
recommendations?
yes, it is called hex editor... Hundreds of those exist out there.
Here are some: http://en.wikipedia.org/wiki/Comparison_of_hex_editors
A common hex editor allows you to view any file's byte sequence.
If you just want to see the existing bytes (without changing them) you can use a hex-dump program, which is much smaller and simpler than a hex editor. For example, here's one I wrote several years ago:
/* public domain by Jerry Coffin
*/
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
unsigned long offset = 0;
FILE *input;
int bytes, i, j;
unsigned char buffer[16];
char outbuffer[60];
if ( argc < 2 ) {
fprintf(stderr, "\nUsage: dump filename [filename...]");
return EXIT_FAILURE;
}
for (j=1;j<argc; ++j) {
if ( NULL ==(input=fopen(argv[j], "rb")))
continue;
printf("\n%s:\n", argv[j]);
while (0 < (bytes=fread(buffer, 1, 16, input))) {
sprintf(outbuffer, "%8.8lx: ", offset+=16);
for (i=0;i<bytes;i++) {
sprintf(outbuffer+10+3*i, "%2.2X ",buffer[i]);
if (!isprint(buffer[i]))
buffer[i] = '.';
}
printf("%-60s %*.*s\n", outbuffer, bytes, bytes, buffer);
}
fclose(input);
}
return 0;
}

Strange behaviour in OpenMP nested loop

In the following program I get different results (serial vs OpenMP), what is the reason? At the moment I can only think that perhaps the loop is too "large" for the threads and perhaps I should write it in some other way but I am not sure, any hints?
Compilation: g++-4.2 -fopenmp main.c functions.c -o main_elec_gcc.exe
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
#define NRACK 64
#define NSTARS 1024
double mysumallatomic_serial(float rocks[NRACK][3],float moon[NSTARS][3],float qr[NRACK],float ql[NSTARS]) {
int j,i;
float temp_div=0.,temp_sqrt=0.;
float difx,dify,difz;
float mod2x, mod2y, mod2z;
double S2 = 0.;
for(j=0; j<NRACK; j++){
for(i=0; i<NSTARS;i++){
difx=rocks[j][0]-moon[i][0];
dify=rocks[j][1]-moon[i][1];
difz=rocks[j][2]-moon[i][2];
mod2x=difx*difx;
mod2y=dify*dify;
mod2z=difz*difz;
temp_sqrt=sqrt(mod2x+mod2y+mod2z);
temp_div=1/temp_sqrt;
S2 += ql[i]*temp_div*qr[j];
}
}
return S2;
}
double mysumallatomic(float rocks[NRACK][3],float moon[NSTARS][3],float qr[NRACK],float ql[NSTARS]) {
float temp_div=0.,temp_sqrt=0.;
float difx,dify,difz;
float mod2x, mod2y, mod2z;
double S2 = 0.;
#pragma omp parallel for shared(S2)
for(int j=0; j<NRACK; j++){
for(int i=0; i<NSTARS;i++){
difx=rocks[j][0]-moon[i][0];
dify=rocks[j][1]-moon[i][1];
difz=rocks[j][2]-moon[i][2];
mod2x=difx*difx;
mod2y=dify*dify;
mod2z=difz*difz;
temp_sqrt=sqrt(mod2x+mod2y+mod2z);
temp_div=1/temp_sqrt;
float myterm=ql[i]*temp_div*qr[j];
#pragma omp atomic
S2 += myterm;
}
}
return S2;
int main(int argc, char *argv[]) {
float rocks[NRACK][3], moon[NSTARS][3];
float qr[NRACK], ql[NSTARS];
int i,j;
for(j=0;j<NRACK;j++){
rocks[j][0]=j;
rocks[j][1]=j+1;
rocks[j][2]=j+2;
qr[j] = j*1e-4+1e-3;
//qr[j] = 1;
}
for(i=0;i<NSTARS;i++){
moon[i][0]=12000+i;
moon[i][1]=12000+i+1;
moon[i][2]=12000+i+2;
ql[i] = i*1e-3 +1e-2 ;
//ql[i] = 1 ;
}
printf(" serial: %f\n", mysumallatomic_serial(rocks,moon,qr,ql));
printf(" openmp: %f\n", mysumallatomic(rocks,moon,qr,ql));
return(0);
}
}
I think you should use reduction instead of shared variable and remove #pragma omp atomic, like:
#pragma omp parallel for reduction(+:S2)
And it should work faster, because there are no need for atomic operations which are quite painful in terms of performance and threads synchronization.
UPDATE
You can also have some difference in results because of the operations order:
\sum_1^100(x[i]) != \sum_1^50(x[i]) + \sum_51^100(x[i])
You have data races on most of the temporary variables you are using in the parallel region - difx, dify, difz, mod2x, mod2y, mod2z, temp_sqrt, and temp_div should all be private. You should make these variables private by using a private clause on the parallel for directive.

Resources