How to fix MPI_ERR_RMA_SHARED? - azure

I wrote an MPI program in which I use shared memory through MPI_Win_Allocate_shared command, then I run the program on a Virtual Machine with 4 cpus on Azure.
Everything works well with 1 or processes, but it doesn't work with 3 or 4.
I know that MPI_Win_Allocate_shared works only if processes are on the same node, so I thought the problem was related to that. I tried to solve that with an hostfile setting "AzureVM slots=4 max_slots=8", but I still get error.
I'll report the error below:
mpiexec -np 3 --hostfile my_host --oversubscribe tables
[AzureVM][[37487,1],1][btl_openib_component.c:652:init_one_port] ibv_query_gid failed (mlx4_0:1, 0)
[AzureVM][[37487,1],0][btl_openib_component.c:652:init_one_port] ibv_query_gid failed (mlx4_0:1, 0)
[AzureVM][[37487,1],2][btl_openib_component.c:652:init_one_port] ibv_query_gid failed (mlx4_0:1, 0)
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: AzureVM
Local device: mlx4_0
--------------------------------------------------------------------------
[AzureVM:01918] 2 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[AzureVM:01918] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[AzureVM:1930] *** An error occurred in MPI_Win_allocate_shared
[AzureVM:1930] *** reported by process [2456748033,2]
[AzureVM:1930] *** on communicator MPI_COMM_WORLD
[AzureVM:1930] *** MPI_ERR_RMA_SHARED: Memory cannot be shared
[AzureVM:1930] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[AzureVM:1930] *** and potentially your MPI job)
[AzureVM:01918] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
Makefile:54: recipe for target 'table' failed
make: *** [table] Error 71
Please, could someone explain me how to solve the problem?? Thank you in advance!

Hi, have you solved the problem?
Consider adding these two lines (following the quide)
MPI_Comm nodecomm;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &nodecomm);
And after, allocate memory with
// define alloc_length (sth like: int alloc_length = 10 * sizeof(int);)
MPI_Win win;
MPI_Win_allocate_shared (alloc_length, 1, info, shmcomm, &mem, &win);
I had the same problem (a similar error log at least) and solved it exactly in the way I described above
To better understand, see this. I tested the code at the end of the answer chosen as the best one, and unfortunately, it didn't work for me. I modified it as follows:
#include <stdio.h>
#include <mpi.h>
#define ARRAY_LEN 32
int main() {
MPI_Init(NULL, NULL);
int * baseptr;
MPI_Comm nodecomm;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
MPI_INFO_NULL, &nodecomm);
int nodesize, noderank;
MPI_Comm_size(nodecomm, &nodesize);
MPI_Comm_rank(nodecomm, &noderank);
MPI_Win win;
int size = (noderank == 0)? ARRAY_LEN * sizeof(int) : 0;
MPI_Win_allocate_shared(size, 1, MPI_INFO_NULL,
nodecomm, &baseptr, &win);
if (noderank != 0) {
MPI_Aint size;
int disp_unit;
MPI_Win_shared_query(win, 0, &size, &disp_unit, &baseptr);
}
for (int i = noderank; i < ARRAY_LEN; i += nodesize)
baseptr[i] = noderank;
MPI_Barrier(nodecomm);
if (noderank == 0) {
for (int i = 0; i < nodesize; i++)
printf("%4d", baseptr[i]);
printf("\n");
}
MPI_Win_free(&win);
MPI_Finalize();
}
Now, if you name the code above as test.cpp
mpic++ test.cpp && mpirun -n 8 ./a.out will output 0 1 2 3 4 5 6 7
Some right tips I took from here
Good luck!

Related

openmp core assignation fails

My Centos 6 VM shows four cores when displaying the content of /proc/cpuinfo, and /sys/devices/system/cpu/online shows 0-3.
I am trying to run the following code on the core 2 and 3 using KMP_AFFINITY="explicit,proclist=[2-3]"
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
int main (int argc, char *argv[]) {
int nthreads, tid, cid;
#pragma omp parallel private(nthreads, tid)
{
tid = omp_get_thread_num();
cid = sched_getcpu();
printf("Hello from thread %d on core %d\n", tid, cid);
if (tid == 0) {
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
}
}
When compiled with icc (ICC) 16.0.1 20151021, it it fails to detect the available cores and executes everything on the core 0.
$ OMP_NUM_THREADS=4 ./a.out
OMP: Warning #123: Ignoring invalid OS proc ID 2.
OMP: Warning #123: Ignoring invalid OS proc ID 3.
OMP: Warning #124: No valid OS proc IDs specified - not using affinity.
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Where as gcc (GCC) 4.4.7 20120313, with GOMP_CPU_AFFINITY="2-3", executes properly on core 2 and 3, like set.
I used strace to check what's going on under the hood, and I noticed something strange :
[...]
open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 3
read(3, "0-3\n", 8192) = 4
[...]
sched_getaffinity(0, 1048576, { 1 }) = 8
sched_setaffinity(0, 8, { 4521c26fbb1c38c1 }) = -1 EFAULT (Bad addres
[...]
Could this be an error from the implementation of OpenMP made by intel?
I cannot upgrade my compiler to fix it in this case. Is it possible to use the GCC OpenMP library instead of the Intel one when compiling with icc ?
Update:
I managed to compile the code with gcc and linking it with iomp using the following command
gcc omp.c -L/opt/intel/compilers_and_libraries_2016/linux/lib/intel64_lin/ -liomp5
The execution outputs no warning, and is still not correct:
$ OMP_NUM_THREADS=4 ./a.out
Hello from thread 0 on core 0
Number of threads = 1
Same sched_setaffinity error than previously shown.

Spawning big amount of processes

I want to spawn big amount of process. So I have master process which does it.
int master(int argc, char* argv[]){
for (int i = 0; i < 50000; ++i) {
std::string name = std::to_string(i);
MSG_process_create(name.c_str(), slave, NULL, MSG_host_self());
}
return 0;
}
int slave(int argc, char* argv[]){
XBT_INFO("%s", MSG_process_get_name(MSG_process_self()));
return 0;
}
After I launch this program I have the following output:
....
....
[Master:32734:(32736) 0.000000] [master/INFO] 32734
[Master:32735:(32737) 0.000000] [master/INFO] 32735
[0.000000] /home/ken/Downloads/simgrid-master/src/simix/smx_context.cpp:187: [xbt/CRITICAL] Failed to protect stack: Cannot allocate memory
Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
Then I was advised to use contexts/stack-size parameter to change stack-size, because the previous program by default required 50000 * 8192 KiB.
I added this parameter --cfg=contexts/stack-size:10 but I have the same output:
...
...
[Master:32735:(32737) 0.000000] [master/INFO] 32735
[0.000000] /home/ken/Downloads/simgrid-master/src/simix/smx_context.cpp:187: [xbt/CRITICAL] Failed to protect stack: Cannot allocate memory
Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
Or --cfg=contexts/stack-size:100000:
...
...
[Master:32734:(32736) 0.000000] [master/INFO] 32734
[0.000000] /home/ken/Downloads/simgrid-master/src/simix/smx_context.cpp:187: [xbt/CRITICAL] Failed to protect stack: Cannot allocate memory
It might seem that my program doesn't see this parameter, but it isn't the case because stack-parameter is 5 gives me:
Finally, if nothing of the above applies, this can result from a stack overflow.
Try to increase stack size with --cfg=contexts/stack_size (current size is 1 KiB).
What did I do wrong?
can you try maybe increasing the value of the maximum number of mappings allowed per process on your system ?
You can do that with sudo sysctl -w vm.max_map_count=500000 to set the maximum value to 500000
We saw recently that this was causing some issues on some SMPI runs, maybe it's the same on your end. The "Cannot allocate memory" message may indeed be misleading, as the ENOMEM error code is set for various reasons (and according to http://man7.org/linux/man-pages/man2/mprotect.2.html, one of them may be the number of mappings).

linux redirect 100GB stdout to file fails

I have this command that writes over 100GB of data to a file.
zfs send snap1 > file
Something appears to go wrong several hours into the process. E.g., if I run the job twice, the output is slightly different. If I try to process the file with
zfs receive snap2 < file
an error is reported after several hours.
For debugging purposes, I'm guessing that there's some low probability failure in the shell redirection. Has anyone else seen problems with redirecting massive amounts of data? Any suggestions about where to proceed?
Debugging this is tedious because small examples work, and running the large case takes over 3 hours each time.
Earlier I had tried pipes:
zfs send snap1| zfs receive snap2
However this always failed with much smaller examples, for which
zfs send snap1 > file; zfs receive snap2 < file
worked. (I posted a question about that, but got no useful responses.) This is another reason that I suspect the shell.
Thanks.
The probability that the failure is in the shell (or OS) is negligible compared to a bug in zfs or a problem in how you are using it.
It just takes some minutes to test your hypothesis: compile this stupid program:
#include<unistd.h>
#include<string.h>
#define BUF 1<<20
#define INPUT 56
int main(int argc, char* argv[]) {
char buf[BUF], rbuf[BUF], *a, *b;
int len, i;
memset(buf, INPUT, sizeof(buf));
if (argc == 1)
{
while ((len = read(0, rbuf, sizeof(rbuf))) > 0)
{
a = buf; b = rbuf;
for (i = 0; i < len; ++i)
{
if (*a != *b)
return 1;
++a; ++b;
}
}
}
else
{
while (write(1, buf, sizeof(buf)) > 0);
}
return 0;
}
then try mkfifo a; ./a.out w > a in a shell and pv < a | ./a.out in another one, see how long does it take to get any bit flip.
It should get in the TiB region relatively fast...

What difference between VC++ 2010 Express and Borland C++ 3.1 in compiling simple c++ code file?

I already don't know what to think or what to do. Next code compiles fine in both IDEs, but in VC++ case it causes weird heap corruptions messages like:
"Windows has triggered a breakpoint in Lab4.exe.
This may be due to a corruption of the heap, which indicates a bug in Lab4.exe or any of the DLLs it has loaded.
This may also be due to the user pressing F12 while Lab4.exe has focus.
The output window may have more diagnostic information."
It happens when executing Task1_DeleteMaxElement function and i leave comments there.
Nothing like that happens if compiled in Borland C++ 3.1 and everything work as expected.
So... what's wrong with my code or VC++?
#include <conio.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <memory.h>
void PrintArray(int *arr, int arr_length);
int Task1_DeleteMaxElement(int *arr, int arr_length);
int main()
{
int *arr = NULL;
int arr_length = 0;
printf("Input the array size: ");
scanf("%i", &arr_length);
arr = (int*)realloc(NULL, arr_length * sizeof(int));
srand(time(NULL));
for (int i = 0; i < arr_length; i++)
arr[i] = rand() % 100 - 50;
PrintArray(arr, arr_length);
arr_length = Task1_DeleteMaxElement(arr, arr_length);
PrintArray(arr, arr_length);
getch();
return 0;
}
void PrintArray(int *arr, int arr_length)
{
printf("Printing array elements\n");
for (int i = 0; i < arr_length; i++)
printf("%i\t", arr[i]);
printf("\n");
}
int Task1_DeleteMaxElement(int *arr, int arr_length)
{
printf("Looking for max element for deletion...");
int current_max = arr[0];
for (int i = 0; i < arr_length; i++)
if (arr[i] > current_max)
current_max = arr[i];
int *temp_arr = NULL;
int temp_arr_length = 0;
for (int j = 0; j < arr_length; j++)
if (arr[j] < current_max)
{
temp_arr = (int*)realloc(temp_arr, temp_arr_length + 1 * sizeof(int)); //if initial array size more then 4, breakpoint activates here
temp_arr[temp_arr_length] = arr[j];
temp_arr_length++;
}
arr = (int*)realloc(arr, temp_arr_length * sizeof(int));
memcpy(arr, temp_arr, temp_arr_length);
realloc(temp_arr, 0); //if initial array size is less or 4, breakpoint activates at this line execution
return temp_arr_length;
}
My guess is VC++2010 is rightly detecting memory corruption, which is ignored by Borland C++ 3.1.
How does it work?
For example, when allocating memory for you, VC++2010's realloc could well "mark" the memory around it with some special value. If you write over those values, realloc detects the corruption, and then crashes.
The fact it works with Borland C++ 3.1 is pure luck. This is a very very old compiler (20 years!), and thus, would be more tolerant/ignorant of this kind of memory corruption (until some random, apparently unrelated crash occurred in your app).
What's the problem with your code?
The source of your error:
temp_arr = (int*)realloc(temp_arr, temp_arr_length + 1 * sizeof(int))
For the following temp_arr_length values, in 32-bit, the allocation will be of:
0 : 4 bytes = 1 int when you expect 1 (Ok)
1 : 5 bytes = 1.25 int when you expect 2 (Error!)
2 : 6 bytes = 1.5 int when you expect 3 (Error!)
You got your priotities wrong. As you can see:
temp_arr_length + 1 * sizeof(int)
should be instead
(temp_arr_length + 1) * sizeof(int)
You allocated too little memory,and thus wrote well beyond what was allocated for you.
Edit (2012-05-18)
Hans Passant commented on allocator diagnostics. I took the liberty of copying them here until he writes his own answer (I've already seen coments disappear on SO):
It is Windows that reminds you that you have heap corruption bugs, not VS. BC3 uses its own heap allocator so Windows can't see your code mis-behaving. Not noticing these bugs before is pretty remarkable but not entirely impossible.
[...] The feature is not available on XP and earlier. And sure, one of the reasons everybody bitched about Vista. Blaming the OS for what actually were bugs in the program. Win7 is perceived as a 'better' OS in no small part because Vista forced programmers to fix their bugs. And no, the Microsoft CRT has implemented malloc/new with HeapAlloc for a long time. Borland had a history of writing their own, beating Microsoft for a while until Windows caught up.
[...] the CRT uses a debug allocator like you describe, but it generates different diagnostics. Roughly, the debug allocator catches small mistakes, Windows catches gross ones.
I found the following links explaining what is done to memory by Windows/CRT allocators before and after allocation/deallocation:
http://www.codeguru.com/cpp/w-p/win32/tutorials/article.php/c9535/Inside-CRT-Debug-Heap-Management.htm
https://stackoverflow.com/a/127404/14089
http://www.nobugs.org/developer/win32/debug_crt_heap.html#table
The last link contains a table I printed and always have near me at work (this was this table I was searching for when finding the first two links... :- ...).
If it is crashing in realloc, then you are over stepping, the book keeping memory of malloc & free.
The incorrect code is as below:
temp_arr = (int*)realloc(temp_arr, temp_arr_length + 1 * sizeof(int));
should be
temp_arr = (int*)realloc(temp_arr, (temp_arr_length + 1) * sizeof(int));
Due to operator precedence of * over +, in the next run of the loop when you are expecting realloc to passed 8 bytes, it might be passing only 5 bytes. So, in your second iteration, you will be writing into 3 bytes someone else's memory, which leads to memory corruption and eventual crash.
Also
memcpy(arr, temp_arr, temp_arr_length);
should be
memcpy(arr, temp_arr, temp_arr_length * sizeof(int) );

What is a good Linux exit error code strategy?

I have several independent executable Perl, PHP CLI scripts and C++ programs for which I need to develop an exit error code strategy. These programs are called by other programs using a wrapper class I created to use exec() in PHP. So, I will be able to get an error code back. Based on that error code, the calling script will need to do something.
I have done a little bit of research and it seems like anything in the 1-254 (or maybe just 1-127) range could be fair game to user-defined error codes.
I was just wondering how other people have approached error handling in this situation.
The only convention is that you return 0 for success, and something other than zero for an error. Most well-known unix programs document the various return codes that they can return, and so should you. It doesn't make a lot of sense to try to make a common list for all possible error codes that any arbitrary program could return, or else you end up with tens of thousands of them like some other OS's, and even then, it doesn't always cover the specific type of error you want to return.
So just be consistent, and be sure to document whatever scheme you decide to use.
1-127 is the available range. Anything over 127 is supposed to be "abnormal" exit - terminated by a signal.
While you're at it, consider using stdout rather than exit code. Exit code is by tradition used to indicate success, failure, and may be one other state. Rather than using exit code, try using stdout the way expr and wc use it. You can then use backtick or something similar in the caller to extract the result.
the unix manifesto states -
Exit as soon and as loud as possible on error
or something like that
Don't try to encode too much meaning into the exit value: detailed statuses and error reports should go to stdout / stderr as Arkadiy suggests.
However, I have found it very useful to represent just a handful of states in the exit values, using binary digits to encode them. For example, suppose you have the following contrived meanings:
0000 : 0 (no error)
0001 : 1 (error)
0010 : 2 (I/O error)
0100 : 4 (user input error)
1000 : 8 (permission error)
Then, a user input error would have a return value of 5 (4 + 1), while a log file not having write permission might have a return value of 11 (8 + 2 + 1). As the different meanings are independently encoded in the return value, you can easily see what's happened by checking which bits are set.
As a special case, to see if there was an error you can AND the return code with 1.
By doing this, you can encode a couple of different things in the return code, in a clear and simple way. I use this only to make simple decisions such as "should the process be restarted", "do the return value and relevant logs need to be sent to an admin", that sort of thing. Any detailed diagnostic information should go to logs or to stdout / stderr.
The normal exit statuses run from 0 to 255 (see Exit codes bigger than 255 posssible for a discussion of why). Normally, status 0 indicates success; anything else is an implementation-defined error. I do know of a program that reports the state of a DBMS server via the exit status; that is a special case of implementation-defined exit statuses. Note that you get to define the implementation of the statuses of your programs.
I couldn't fit this into 300 characters; otherwise it would have been a comment to #Arkadiy's answer.
Arkadiy is right that in one part of the exit status word, values other than zero indicate the signal that terminated the process and the 8th bit normally indicates a core dump, but that section of the exit status is different from the main 0..255 status. However, the shell (whichever shell it is) is presented with a problem when a process dies as a result of a signal. There is 16 bits of data to be presented in an 8-bit value, which is always tricky. What the shells seem to do is to take the signal number and add 128 to it. So, if a process dies as a result of an interrupt (signal number 2, SIGINT), the shell reports the exit status as 130. However, the kernel reported the status as 0x0002; the shell has modified what the kernel reports.
The following C code demonstrates this. There are two programs
suicide which kills itself using a signal of your choosing (interrupt by default).
exitstatus which runs a command (such as suicide) and reports the kernel exit status.
Here's suicide.c:
/*
#(#)File: $RCSfile: suicide.c,v $
#(#)Version: $Revision: 1.2 $
#(#)Last changed: $Date: 2008/12/28 03:45:18 $
#(#)Purpose: Commit suicide using kill()
#(#)Author: J Leffler
#(#)Copyright: (C) JLSS 2008
#(#)Product: :PRODUCT:
*/
/*TABSTOP=4*/
#if __STDC_VERSION__ >= 199901L
#define _XOPEN_SOURCE 600
#else
#define _XOPEN_SOURCE 500
#endif /* __STDC_VERSION__ */
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include "stderr.h"
static const char usestr[] = "[-V][-s signal]";
#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
extern const char jlss_id_suicide_c[];
const char jlss_id_suicide_c[] = "#(#)$Id: suicide.c,v 1.2 2008/12/28 03:45:18 jleffler Exp $";
#endif /* lint */
int main(int argc, char **argv)
{
int signum = SIGINT;
int opt;
char *end;
err_setarg0(argv[0]);
while ((opt = getopt(argc, argv, "Vs:")) != -1)
{
switch (opt)
{
case 's':
signum = strtol(optarg, &end, 0);
if (*end != '\0' || signum <= 0)
err_error("invalid signal number %s\n", optarg);
break;
case 'V':
err_version("SUICIDE", &"#(#)$Revision: 1.2 $ ($Date: 2008/12/28 03:45:18 $)"[4]);
break;
default:
err_usage(usestr);
break;
}
}
if (optind != argc)
err_usage(usestr);
kill(getpid(), signum);
return(0);
}
And here's exitstatus.c:
/*
#(#)File: $RCSfile: exitstatus.c,v $
#(#)Version: $Revision: 1.2 $
#(#)Last changed: $Date: 2008/12/28 03:45:18 $
#(#)Purpose: Run command and report 16-bit exit status
#(#)Author: J Leffler
#(#)Copyright: (C) JLSS 2008
#(#)Product: :PRODUCT:
*/
/*TABSTOP=4*/
#if __STDC_VERSION__ >= 199901L
#define _XOPEN_SOURCE 600
#else
#define _XOPEN_SOURCE 500
#endif /* __STDC_VERSION__ */
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include "stderr.h"
#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
extern const char jlss_id_exitstatus_c[];
const char jlss_id_exitstatus_c[] = "#(#)$Id: exitstatus.c,v 1.2 2008/12/28 03:45:18 jleffler Exp $";
#endif /* lint */
int main(int argc, char **argv)
{
pid_t pid;
err_setarg0(argv[0]);
if (argc < 2)
err_usage("cmd [args...]");
if ((pid = fork()) < 0)
err_syserr("fork() failed: ");
else if (pid == 0)
{
/* Child */
execvp(argv[1], &argv[1]);
return(1);
}
else
{
pid_t corpse;
int status;
corpse = waitpid(pid, &status, 0);
if (corpse != pid)
err_syserr("waitpid() failed: ");
printf("0x%04X\n", status);
}
return(0);
}
The missing code, stderr.c and stderr.h, can easily be found in essentially any of my published programs. If you need it urgently, get it from the program SQLCMD at the IIUG Software Archive; alternatively, contact me by email (see my profile).

Resources