When i tried to call multiple MPI_Send or MPI_Recv in the program, the executable is getting hanged in the nodes and the root. ie, when it is trying to execute the second MPI_Send or MPI_Recv, the communication is getting blocked. At the same time the binaries are running at 100% in the machines.
When i tried to run this code in windows 7 64 bit with OpenMPI 1.6.3 64-bit, it ran successfully. But the same code is not working in Linux, ie, CentOS 6.3 x86_64 with OpenMPI 1.6.3 -64 bit. What is the problem i have done.
Posting the code below
#include <mpi.h>
int main(int argc, char** argv) {
MPI::Init();
int rank = MPI::COMM_WORLD.Get_rank();
int size = MPI::COMM_WORLD.Get_size();
char name[256] = { };
int len = 0;
MPI::Get_processor_name(name, len);
printf("Hi I'm %s:%d\n", name, rank);
if (rank == 0)
{
while (size >= 1)
{
int val, stat = 1;
MPI::Status status;
MPI::COMM_WORLD.Recv(&val, 1, MPI::INT, 1, 0, status);
int source = status.Get_source();
printf("%s:%d received %d from %d\n", name, rank, val, source);
MPI::COMM_WORLD.Send(&stat, 1, MPI::INT, 1, 2);
printf("%s:%d sent status %d\n", name, rank, stat);
size--;
}
} else
{
int val = rank + 10;
int stat = 0;
printf("%s:%d sending %d...\n", name, rank, val);
MPI::COMM_WORLD.Send(&val, 1, MPI::INT, 0, 0);
printf("%s:%d sent %d\n", name, rank, val);
MPI::Status status;
MPI::COMM_WORLD.Recv(&stat, 1, MPI::INT, 0, 2, status);
int source = status.Get_source();
printf("%s:%d received status %d from %d\n", name, rank, stat, source);
}
size = MPI::COMM_WORLD.Get_size();
if (rank == 0)
{
while (size >= 1)
{
int val, stat = 1;
MPI::Status status;
MPI::COMM_WORLD.Recv(&val, 1, MPI::INT, 1, 1, status);
int source = status.Get_source();
printf("%s:0 received %d from %d\n", name, val, source);
size--;
}
printf("all workers checked in!\n");
}
else
{
int val = rank + 10 + 5;
printf("%s:%d sending %d...\n", name, rank, val);
MPI::COMM_WORLD.Send(&val, 1, MPI::INT, 0, 1);
printf("%s:%d sent %d\n", name, rank, val);
}
MPI::Finalize();
return 0;
}
Hi Hristo, I have changed the source as you said and the code is again posting
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv)
{
int iNumProcess = 0, iRank = 0, iNameLen = 0, n;
char szNodeName[MPI_MAX_PROCESSOR_NAME] = {};
MPI_Status stMPIStatus;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &iNumProcess);
MPI_Comm_rank(MPI_COMM_WORLD, &iRank);
MPI_Get_processor_name(szNodeName, &iNameLen);
printf("Hi I'm %s:%d\n", szNodeName, iRank);
if (iRank == 0)
{
int iNode = 1;
while (iNumProcess > 1)
{
int iVal = 0, iStat = 1;
MPI_Recv(&iVal, 1, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &stMPIStatus);
printf("%s:%d received %d\n", szNodeName, iRank, iVal);
MPI_Send(&iStat, 1, MPI_INT, iNode, 1, MPI_COMM_WORLD);
printf("%s:%d sent Status %d\n", szNodeName, iRank, iStat);
MPI_Recv(&iVal, 1, MPI_INT, MPI_ANY_SOURCE, 2, MPI_COMM_WORLD, &stMPIStatus);
printf("%s:%d received %d\n", szNodeName, iRank, iVal);
iNumProcess--;
iNode++;
}
printf("all workers checked in!\n");
}
else
{
int iVal = iRank + 10;
int iStat = 0;
printf("%s:%d sending %d...\n", szNodeName, iRank, iVal);
MPI_Send(&iVal, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
printf("%s:%d sent %d\n", szNodeName, iRank, iVal);
MPI_Recv(&iStat, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &stMPIStatus);
printf("%s:%d received status %d\n", szNodeName, iRank, iVal);
iVal = 20;
printf("%s:%d sending %d...\n", szNodeName, iRank, iVal);
MPI_Send(&iVal, 1, MPI_INT, 0, 2, MPI_COMM_WORLD);
printf("%s:%d sent %d\n", szNodeName, iRank, iVal);
}
MPI_Finalize();
return 0;
}
i got the output as folows. ie, after the send send/receive, root is infinitely waiting and the nodes are ruing with 100% CPU utilisation. Its output is giving below
Hi I'm N1433:1
N1433:1 sending 11...
Hi I'm N1425:0
N1425:0 received 11
N1425:0 sent Status 1
N1433:1 sent 11
N1433:1 received status 11
N1433:1 sending 20...
Here N1433 and N1425 are machine names. Please help
The code for the master is wrong. It is always sending to and awaiting messages from the same rank - rank 1. Thus the program would only function correctly if run as mpiexec -np 2 .... What you've probably wanted to do is to use MPI_ANY_SOURCE as the source rank and then use that source rank as the destination in the send operation. You shouldn't also use while (size >= 1) since rank 0 is not talking to itself and the number of communications is expected to be one less than size.
if (rank == 0)
{
while (size > 1)
// ^^^^^^^^
{
int val, stat = 1;
MPI::Status status;
MPI::COMM_WORLD.Recv(&val, 1, MPI::INT, MPI_ANY_SOURCE, 0, status);
// Use wildcard source here ------------^^^^^^^^^^^^^^
int source = status.Get_source();
printf("%s:%d received %d from %d\n", name, rank, val, source);
MPI::COMM_WORLD.Send(&stat, 1, MPI::INT, source, 2);
// Send back to the same process --------^^^^^^
printf("%s:%d sent status %d\n", name, rank, stat);
size--;
}
} else
Doing something like this in the worker is pointless:
MPI::Status status;
MPI::COMM_WORLD.Recv(&stat, 1, MPI::INT, 0, 2, status);
// Source rank is fixed here ------------^
int source = status.Get_source();
printf("%s:%d received status %d from %d\n", name, rank, stat, source);
You have already specified rank 0 as the source in the receive operation so it would only be able to receive messages from rank 0. There is no way that status.Get_source() would return any value other than 0, unless some communication error had occurred, in which case an exception would get thrown by MPI::COMM_WORLD.Recv().
The same is also true for the second loop in your code.
By the way, your are using what used to be the official standard C++ bindings. They were deprecated in MPI-2.2 and the latest version of the standard (MPI-3.0) removed them completely as no longer supported by the MPI Forum. You should be using the C bindings instead or rely on 3-rd party C++ interfaces like Boost.MPI.
After installing and MPICH2 instead of OpenMPI, it worked successfully. I think there is some problem in using OpenMPI 1.6.3 in my cluster machines.
Related
Is it possible to use select for packet sockets? I mean whether its necessary for the socket to be connection-based in order to use select function on it properly?
I'm seeing that the behavior of a socket which I got it by the following socket function call:
int socket_fd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
is not as I expect. For example, I see that a ping packet which is 60 bytes long when received and read into a 20 bytes length buffer, it waits about a second between each recv function call. I used recvfrom and it didn't help. For this, I ask whether it's correct to use select for a packet socket?
Update
I'm going to include the code to discuss about it:
int MakeBridge(const char *if1, const char *if2)
{
int sock[2];
sock[0] = OpenSocket(if1);
sock[1] = OpenSocket(if2);
int activity;
char buf[20];
const int numSocks = 2;
int nfds = sock[0];
for (int i = 1; i < numSocks; i++)
if (sock[i] > nfds)
nfds = sock[i];
nfds++;
while (true)
{
FD_ZERO(&readfds);
for (int i = 0; i < numSocks; i++)
FD_SET(sock[i], &readfds);
activity = select(nfds, &readfds, NULL, NULL, NULL);
if (activity == -1) // sockets closed
break;
for (int i = 0; i < numSocks; i++)
if (FD_ISSET(_sock[i], &readfds))
{
int len;
CHECK(ioctl(_sock[i], FIONREAD, &len) != -1, "%s", strerror(errno));
printf("socket %d is set\n", i);
printf("total bytes available to read: %d\n", len);
CHECK(len > 0, "");
do
{
int n = min(len, sizeof(buf));
int nbr = recvfrom(_sock[i], buf, n, 0, NULL, NULL);
printf("n %d nbr %d\n", n, nbr);
CHECK(n == nbr, "");
len -= n;
} while (len);
}
}
return 0;
}
Update 2
Not going to segment messages and using large enough buffer makes the code as follows:
// This is a program which is going to make the bridge between two interfaces. brctl is going to be replaced by this program.
#define Uses_CHECK
#define Uses_close
#define Uses_errno
#define Uses_ETH_P_ALL
#define Uses_FD_SET
#define Uses_htons
#define Uses_ifreq
#define Uses_ioctl
#define Uses_printf
#define Uses_signal
#define Uses_sockaddr_ll
#define Uses_socket
#define Uses_strerror
#include <general.dep>
int _sock[2];
int _CtrlCHandler()
{
printf(" terminating...\n");
CHECK(close(_sock[0]) != -1, "");
CHECK(close(_sock[1]) != -1, "");
printf("all sockets closed successfully.\n");
}
void CtrlCHandler(int dummy)
{
_CtrlCHandler();
}
int OpenSocket(const char *ifname)
{
// getting socket
int socket_fd = socket(PF_PACKET, SOCK_RAW/*|SOCK_NONBLOCK*/, htons(ETH_P_ALL));
CHECK(socket_fd != -1, "%s", strerror(errno));
// init interface options struct with the interface name
struct ifreq if_options;
memset(&if_options, 0, sizeof(struct ifreq));
strncpy(if_options.ifr_name, ifname, sizeof(if_options.ifr_name) - 1);
if_options.ifr_name[sizeof(if_options.ifr_name) - 1] = 0;
// enable promiscuous mode
CHECK(ioctl(socket_fd, SIOCGIFFLAGS, &if_options) != -1, "%s", strerror(errno));
if_options.ifr_flags |= IFF_PROMISC;
CHECK(ioctl(socket_fd, SIOCSIFFLAGS, &if_options) != -1, "%s", strerror(errno));
// get interface index
CHECK(ioctl(socket_fd, SIOCGIFINDEX, &if_options) != -1, "%s", strerror(errno));
// bind socket to the interface
struct sockaddr_ll my_addr;
memset(&my_addr, 0, sizeof(my_addr));
my_addr.sll_family = AF_PACKET;
my_addr.sll_ifindex = if_options.ifr_ifindex;
CHECK(bind(socket_fd, (struct sockaddr *)&my_addr, sizeof(my_addr)) != -1, "%s", strerror(errno));
// socket is ready
return socket_fd;
}
int MakeBridge(const char *if1, const char *if2)
{
_sock[0] = OpenSocket(if1);
CHECK_NO_MSG(_sock[0]);
_sock[1] = OpenSocket(if2);
CHECK_NO_MSG(_sock[1]);
printf("sockets %d and %d opened successfully\n", _sock[0], _sock[1]);
fd_set readfds, orig;
int activity;
char buf[1<<16];
signal(SIGINT, CtrlCHandler);
int packetNumber = 0;
const int numSocks = _countof(_sock);
int nfds = _sock[0];
for (int i = 1; i < numSocks; i++)
if (_sock[i] > nfds)
nfds = _sock[i];
nfds++;
FD_ZERO(&orig);
for (int i = 0; i < numSocks; i++)
FD_SET(_sock[i], &orig);
while (true)
{
readfds = orig;
activity = select(nfds, &readfds, NULL, NULL, NULL);
if (activity == -1) // sockets closed
break;
CHECK(activity > 0, "");
for (int i = 0; i < numSocks; i++)
if (FD_ISSET(_sock[i], &readfds))
{
int len = recvfrom(_sock[i], buf, sizeof(buf), MSG_TRUNC, NULL, NULL);
CHECK(len > 0, "");
CHECK(len <= sizeof(buf), "small size buffer");
printf("%10d %d %d\n", ++packetNumber, i, len);
CHECK(sendto(_sock[!i], buf, len, 0, NULL, 0) == len, "");
}
}
return 0;
}
int Usage(int argc, const char *argv[])
{
printf("Usage: %s <if1> <if2>\n", argv[0]);
printf("Bridges two interfaces with the names specified.\n");
return 0;
}
int main(int argc, const char *argv[])
{
if (argc != 3)
return Usage(argc, argv);
CHECK_NO_MSG(MakeBridge(argv[1], argv[2]));
return 0;
}
I am starting to learn MPI and I was following a tutorial and wrote two files in C. The compilation and running of the first file is fine, but when perform the same thing on the second file, it wouldn't work. And after I encountered the error, now I cant run neither of the files even if I recompiled them.
I could not find a solution to my problem anywhere on the web, including on stackoverflow. This post was the closest that i come across, but it didnot provide any solution. Error occurred in MPI_Send on communicator MPI_COMM_WORLD MPI_ERR_RANK:invalid rank
First file :
#include <stdio.h>
#include <mpi.h>
Int main (int argc, char *argv){
//Initialize the MPI environment
MPI_Init(NULL, NULL);
//find out rank, size
int world_rank;
MPI_Comm_rank (MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size (MPI_COMM_WORLD, &world_size);
int number;
if (world_rank == 0){
number = -1;
MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
}
else if (world_rank == 1){
MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process 1 received number %d from process 0\n", number);
}
//Finalise the MPI environment
MPI_Finalize();
}
Second File :
#include <stdio.h>
#include <mpi.h>
int main (int argc, char **argv){
//Initialize the MPI environment
MPI_Init(NULL, NULL);
//find out rank, size
int world_rank;
MPI_Comm_rank (MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size (MPI_COMM_WORLD, &world_size);
int X, Y, Z;
if (world_rank == 0){
scanf("%d", &X);
scanf("%d", &Y);
MPI_Send(&X, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
MPI_Send(&Y, 1, MPI_INT, 1, 1, MPI_COMM_WORLD);
MPI_Recv(&Z, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process 1 received number %d from process 2\n", Z);
}
else if (world_rank == 1){
MPI_Recv(&X, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Recv(&Y, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
Z = X + Y;
MPI_Send(&Z, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
}
//Finalise the MPI environment
MPI_Finalize();
}
ERROR MESSAGES :
[ubuntu:2638] *** An error occurred in MPI_Send
[ubuntu:2638] *** reported by process [135921665,0]
[ubuntu:2638] *** on communicator MPI_COMM_WORLD
[ubuntu:2638] *** MPI_ERR_RANK: invalid rank
[ubuntu:2638] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ubuntu:2638] *** and potentially your MPI job)
UPDATE:
Here is the command line that i used
mpicc -o 123 file1.c
mpirun 123
This was ok for the first time, but not after
mpicc -o 123 file2.c
mpirun 123
This was where i first encountered the error
I am working on a MPI I/O problem. Rank 0 reads the position from a parameter file and then sends to Rank 1, 2, 3. All these processes(1,2,3) will get the text from the reading file according to the position Rank 0 gave them and write in different lines in a writing file. When I run the program in one single computer, everything is ok. But when I use 2 computers(still 4 processes, Rank 0,1 on server while Rank 1,2 on client), some random lines of the output file has gone missing! Here is my code
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
//define the message
#define MSG_MISSION_COMPLETE 78
#define MSG_EXIT 79
//define a structural message of MPI
int array_of_blocklengths[3] = { 1, 1, 1 };
MPI_Aint array_of_displacements[3] = { 0, sizeof(float), sizeof(float) + sizeof(int) };
MPI_Datatype array_of_types[3] = {MPI_FLOAT, MPI_FLOAT, MPI_INT};
MPI_Datatype location;
int master();
int slave(MPI_File fhr, MPI_File fhw);
int main(int argc, char* argv[])
{
int rank;
MPI_File fhr, fhw;
char read[] = "./sharedReadSample1.txt";
char write[] = "./sharedWriteSample1.txt";
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("%d is speaking\n", rank);
MPI_File_open(MPI_COMM_WORLD, read, MPI_MODE_RDONLY, MPI_INFO_NULL, &fhr);
MPI_File_open(MPI_COMM_WORLD, write, MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fhw);
if (rank == 0)//rank 0, dispatch the tasks
master();
else//other processes
slave(fhr, fhw);
MPI_Finalize();
printf("%d said byebye\n", rank);
MPI_File_close(&fhr);
MPI_File_close(&fhw);
return 0;
}
int master()//master, read the parameters, send them to other slave processes, get the message of task finishing, arrange next task to the slave who completed the task
{
int i, size, firstmsg, nslave;
int buf[256];
struct{
float pause;//pause time
int stand;//starting position in the file
int offset;//offset
}buf_str[10000] = { {0.0,0,0} };
MPI_Comm_size(MPI_COMM_WORLD, &size);
nslave = size - 1;//the number of slaves
FILE* fp;
FILE* fpm;//for log
fp = fopen("sharedAttributeSample1.txt", "rb");
if (fp == NULL)
{
printf("The file was not opened\n");
getchar();
//send a quit message to slaves, use the tag to tell them(>10000)
for (i = 10000; i < 10000 + nslave; i++)
{
buf[0] = MSG_EXIT;
MPI_Send(&buf[0], 1, MPI_INT, i - 10000 + 1, i, MPI_COMM_WORLD);
}
return 0;
}
else
printf("The file was opened\n");
fpm = fopen("./logs/log_master.txt","wb");
if (fpm == NULL)
printf("master log system failed to load!\n");
for (i = 0; i < 10000;i++)
{
fscanf(fp,"%f,%d,%d", &buf_str[i].pause, &buf_str[i].stand, &buf_str[i].offset);
}
MPI_Status status;
MPI_Type_struct(3, array_of_blocklengths, array_of_displacements, array_of_types, &location);
MPI_Type_commit(&location);
for (i = 0; i < nslave; i++)
{
MPI_Send(&buf_str[i], 1, location, i+1, i, MPI_COMM_WORLD);
fprintf(fpm, "initial message %d sent\n",i);
}
for (i = nslave; i < 10000; i++)
{
MPI_Recv(buf, 256, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);//receive messages from slaves
fprintf(fpm, "task %d complete massage received\n",status.MPI_TAG);
if (buf[0] == MSG_MISSION_COMPLETE)//send next task
{
firstmsg = status.MPI_SOURCE;
fprintf(fpm, "task %d is sent to %d \n", i, firstmsg);
MPI_Send(&buf_str[i], 1, location, firstmsg, i, MPI_COMM_WORLD);
}
}
for (i = 10000; i < 10000+nslave; i++)//send quitting message
{
buf[0] = MSG_EXIT;
MPI_Send(&buf_str[0], 1, location, i-10000+1, i, MPI_COMM_WORLD);
}
fclose(fp);
fclose(fpm);
return 0;
}
int slave(MPI_File fhr, MPI_File fhw)
{
struct{
float pause;
int stand;
int offset;
}buf_str;
char buf[256];
int buf_s[256];
int rank, size, nslave, i=0;
char name[30];
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
nslave = size - 1;
FILE* fps[nslave];
//open their own logging pointers
for(i=0;i<nslave;i++)
{
if(i == rank-1)
{
sprintf(name,"./logs/logfile_slave%d",i+1);
fps[i] = fopen(name, "w");
if(fps[i] == NULL)
printf("failed to open logfile of slave %d\n", i+1);
break;
}
}
MPI_Status status;
MPI_Status status_read;
MPI_Status status_write;
MPI_Type_struct(3, array_of_blocklengths, array_of_displacements, array_of_types, &location);
MPI_Type_commit(&location);
while (1)
{
//receive the message from master
MPI_Recv(&buf_str, 1, location, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
fprintf(fps[i], "process %d message %d received\n",rank,status.MPI_TAG);
if (status.MPI_TAG < 10000){//if it is a task
sleep(buf_str.pause);//sleep, to simulate a computing process
fprintf(fps[i], "process %d sleep for %f seconds\n", rank, buf_str.pause);
//read from the position given
MPI_File_read_at(fhr, buf_str.stand, buf, buf_str.offset, MPI_CHAR, &status_read);
buf[buf_str.offset] = '\n';//need a \n
MPI_File_write_at(fhw, status.MPI_TAG*(buf_str.offset+1), buf, buf_str.offset+1, MPI_CHAR, &status_write);
fprintf(fps[i], "%d has done task %d\n", rank, status.MPI_TAG);
//send task complete message to master
buf_s[0] = MSG_MISSION_COMPLETE;
MPI_Send(&buf_s, 1, MPI_INT, 0, status.MPI_TAG, MPI_COMM_WORLD);
}
else
break;
}
fclose(fps[i]);
return 0;
}
I am using MPI_Probe to send messages dynamically (where the receiver does not know the size of the message being sent). My code looks somewhat like this -
if (world_rank == 0) {
int *buffer = ...
int bufferSize = ...
MPI_Send(buffer, buffersize, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (world_rank == 1) {
MPI_Status status;
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
int count = -1;
MPI_Get_count(&status, MPI_INT, &count);
int* buffer = (int*)malloc(sizeof(int) * count);
MPI_Recv(buffer, count, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
If I am running this code in multiple threads, is there a chance that MPI_Probe gets called in one thread and the MPI_recv gets called in another thread because of the scheduler interleaving the threads. In essence, is the above code thread-safe.
First of all, MPI is not thread-safe by default. You'll have to check if your particular library has been compiled for thread safety and then initialize MPI using MPI_Init_thread instead of MPI_Init.
Supposing that your MPI instance is initialized for thread-safe routines, your code is still not thread safe due to the race-condition you already identified.
The pairing of MPI_Probe and MPI_Recv in a multi-threaded environment is not thread safe, this is a known problem in MPI-2: http://htor.inf.ethz.ch/publications/img/gregor-any_size-mpi3.pdf
There are at least two possible solutions. You can either use MPI-3 MPI_Mprobe and MPI_MRecv, or use a lock/mutex around the critical code. This could look as follows:
MPI-2 solution (using a mutex/lock):
int number_amount;
if (world_rank == 0) {
int *buffer = ...
int bufferSize = ...
MPI_Send(buffer, buffersize, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (world_rank == 1) {
MPI_Status status;
int count = -1;
/* aquire mutex/lock */
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
MPI_Get_count(&status, MPI_INT, &count);
int* buffer = (int*)malloc(sizeof(int) * count);
MPI_Recv(buffer, count, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* release mutex/lock */
}
MPI-3 solution:
int number_amount;
if (world_rank == 0) {
int *buffer = ...
int bufferSize = ...
MPI_Send(buffer, buffersize, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (world_rank == 1) {
MPI_Status status;
MPI_Message msg;
int count = -1;
MPI_Mprobe(0, 0, MPI_COMM_WORLD, &msg, &status);
MPI_Get_count(&status, MPI_INT, &count);
int* buffer = (int*)malloc(sizeof(int) * count);
MPI_Mrecv(buffer, count, MPI_INT, &msg, MPI_STATUS_IGNORE);
}
We are using Linux-2.6.28 and 2 Gb NAND Flash in our system ; After some amount of power cycle tests we are observing the following errors :
Volume operational found at volume id 3
read 21966848 bytes from volume 3 to 80400000(buf address)
UBI error: ubi_io_read: error -77 while reading 126976 bytes from PEB 1074:4096, read 126976 bytes
UBI: force data checking
UBI error: ubi_io_read: error -77 while reading 126976 bytes from PEB 1074:4096, read 126976 bytes
UBI warning: ubi_eba_read_leb: CRC error: calculated 0xa7cab743, must be 0x15716fce
read err ffffffb3
These errors are not hardware errors as if we remove the offending partition, we are able to boot the hardware fine; Maybe UBIFS is not correcting the bad UBI block.
Any UBI patches have been added in the latest kernels to address this issue ? Thanks.
The error printed is a UBI error. Lets look at the source near line 177,
ubi_err("error %d while reading %d bytes from PEB %d:%d, "
"read %zd bytes", err, len, pnum, offset, read);
So, error '-77' (normally -EBADFD) was returned from the NAND flash driver when trying to read the 'physical erase block' #1074 at offset 4096 (2nd page for 2k pages). UBI include volume management pages which are typically located at the beginning of a physical erase block (PEB for short).
Note that the latest mainline of io.c has the following comment and code,
/*
* Deliberately corrupt the buffer to improve robustness. Indeed, if we
* do not do this, the following may happen:
* 1. The buffer contains data from previous operation, e.g., read from
* another PEB previously. The data looks like expected, e.g., if we
* just do not read anything and return - the caller would not
* notice this. E.g., if we are reading a VID header, the buffer may
* contain a valid VID header from another PEB.
* 2. The driver is buggy and returns us success or -EBADMSG or
* -EUCLEAN, but it does not actually put any data to the buffer.
*
* This may confuse UBI or upper layers - they may think the buffer
* contains valid data while in fact it is just old data. This is
* especially possible because UBI (and UBIFS) relies on CRC, and
* treats data as correct even in case of ECC errors if the CRC is
* correct.
*
* Try to prevent this situation by changing the first byte of the
* buffer.
*/
*((uint8_t *)buf) ^= 0xFF;
The following code can be used to process a UBI/UbiFS dump and look for abnormalities,
/* -*- mode: c; compile-command: "gcc -Wall -g -o parse_ubi parse_ubi.c"; -*- */
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <endian.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
#define __packed __attribute__((packed))
#include "ubi-media.h"
#define bswap16 be16toh
#define bswap32 be32toh
#define bswap64 be64toh
static int dump_vid = 0;
#define CRCPOLY_LE 0xedb88320
static unsigned int crc32(unsigned int crc, void const *_p, size_t len)
{
unsigned char const *p = _p;
int i;
while (len--) {
crc ^= *p++;
for (i = 0; i < 8; i++)
crc = (crc >> 1) ^ ((crc & 1) ? CRCPOLY_LE : 0);
}
return crc;
}
#define ALEN(a) (sizeof(a)/sizeof(a[0]))
static void print_ec(struct ubi_ec_hdr *ec)
{
if(ec->version != UBI_VERSION || ec->magic != UBI_EC_HDR_MAGIC) {
printf(" Magic: %x\n", ec->magic);
printf(" Version: %d\n", (int)ec->version);
printf(" EC: %llx\n", ec->ec);
printf(" VID offset: %x\n", ec->vid_hdr_offset);
printf(" Data offset: %x\n", ec->data_offset);
printf(" Image seq: %x\n", ec->image_seq);
exit(-1);
}
}
static void read_ec(int fd, struct ubi_ec_hdr *ec)
{
int rval = read(fd, ec,sizeof(*ec));
if(rval == sizeof(*ec)) {
unsigned int crc;
crc = crc32(UBI_CRC32_INIT, ec, UBI_EC_HDR_SIZE_CRC);
ec->magic = bswap32(ec->magic);
ec->vid_hdr_offset = bswap32(ec->vid_hdr_offset);
ec->data_offset = bswap32(ec->data_offset);
ec->image_seq = bswap32(ec->image_seq);
ec->hdr_crc = bswap32(ec->hdr_crc);
ec->ec = bswap64(ec->ec);
if(crc != ec->hdr_crc)
printf("EC CRC: %x/%x\n", crc, ec->hdr_crc);
} else
memset(ec, 0, sizeof(*ec));
}
static void print_vid(int vid_num, struct ubi_vid_hdr *vid)
{
if(vid->magic != UBI_VID_HDR_MAGIC)
printf(" Magic: %x\n", vid->magic);
if(vid->version != UBI_VERSION)
printf(" Version: %d\n", (int)vid->version);
if(!dump_vid) return;
printf("VID %d\n", vid_num);
/* This is usually the same. */
if(vid->vol_id >= UBI_INTERNAL_VOL_START)
printf("Internal vol_id: %d\n", vid->vol_id - UBI_INTERNAL_VOL_START);
if(vid->vol_type != UBI_VID_DYNAMIC)
printf(" vol_type: %s\n",
vid->vol_type == UBI_VID_DYNAMIC ? "dynamic" : "static");
if(vid->used_ebs)
printf(" used_ebs: %d\n", vid->used_ebs);
if(vid->data_pad)
printf(" data_pad: %d\n", vid->data_pad);
if((vid->copy_flag != 1 && vid->data_size) ||
(vid->copy_flag == 0 && vid->data_size))
printf(" copy_flag: %d\n", (int)vid->copy_flag);
printf(" lnum: %d\n", vid->lnum);
if(vid->compat) {
const char *compat[] = {
[UBI_COMPAT_DELETE] = "delete",
[UBI_COMPAT_RO] = "ro",
[UBI_COMPAT_PRESERVE] = "preserve",
[UBI_COMPAT_REJECT] = "reject"
};
printf(" compat: %s\n", compat[vid->compat]);
}
printf(" data_size: %d\n", vid->data_size);
/* printf(" data_crc: %x\n", vid->data_crc); */
printf(" hdr_crc: %x\n", vid->hdr_crc);
printf(" sqnum: %lld\n", vid->sqnum);
}
static int read_vid(int fd, struct ubi_vid_hdr *vid)
{
int rval = read(fd, vid,sizeof(*vid));
if(rval == sizeof(*vid)) {
unsigned int crc;
crc = crc32(UBI_CRC32_INIT, vid, UBI_EC_HDR_SIZE_CRC);
vid->magic = bswap32(vid->magic);
vid->vol_id = bswap32(vid->vol_id);
vid->lnum = bswap32(vid->lnum);
vid->data_size = bswap32(vid->data_size);
vid->used_ebs = bswap32(vid->used_ebs);
vid->data_pad = bswap32(vid->data_pad);
vid->data_crc = bswap32(vid->data_crc);
vid->hdr_crc = bswap32(vid->hdr_crc);
vid->sqnum = bswap64(vid->sqnum);
if(crc != vid->hdr_crc && vid->magic == UBI_VID_HDR_MAGIC)
printf("VID CRC: %x/%x\n", crc, vid->hdr_crc);
} else
memset(vid, 0, sizeof(*vid));
return rval;
}
static void print_vtbl(struct ubi_vtbl_record *vtbl)
{
printf(" Found vtbl [%d] %s\n", vtbl->name_len, vtbl->name);
printf(" Reserved PEBs: %d\n", vtbl->reserved_pebs);
printf(" Align: %d\n", vtbl->alignment);
printf(" Pad: %d\n", vtbl->data_pad);
if(vtbl->vol_type != UBI_VID_DYNAMIC)
printf(" vol_type: %s\n",
vtbl->vol_type == UBI_VID_DYNAMIC ? "dynamic" : "static");
printf(" Update: %d\n", vtbl->upd_marker);
printf(" Flags: %d\n", (int)vtbl->flags);
}
static void read_vtbl(int fd, struct ubi_vtbl_record *vtbl)
{
int rval = read(fd, vtbl, sizeof(*vtbl));
if(rval == sizeof(*vtbl)) {
vtbl->reserved_pebs = bswap32(vtbl->reserved_pebs);
vtbl->alignment = bswap32(vtbl->alignment);
vtbl->data_pad = bswap32(vtbl->data_pad);
vtbl->crc = bswap32(vtbl->crc);
vtbl->name_len = bswap16(vtbl->name_len);
} else
memset(vtbl, 0, sizeof(*vtbl));
}
static void print_fm_sb(struct ubi_fm_sb *fm_sb)
{
int i;
if(fm_sb->magic != UBI_FM_SB_MAGIC)
printf(" Magic: %x\n", fm_sb->magic);
if(fm_sb->version != UBI_VERSION)
printf(" Version: %d\n", (int)fm_sb->version);
printf(" data_crc: %x\n", fm_sb->data_crc);
printf(" used_blocks: %x\n", fm_sb->used_blocks);
for(i = 0; i < fm_sb->used_blocks; i++)
printf(" block_loc[%d]: %d\n", i, fm_sb->block_loc[i]);
for(i=0; i < fm_sb->used_blocks; i++)
printf(" block_ec[%d]: %d\n", i, fm_sb->block_ec[i]);
printf(" sqnum: %lld\n", fm_sb->sqnum);
}
static void read_fm_sb(int fd, struct ubi_fm_sb *fm_sb)
{
int rval = read(fd, fm_sb, sizeof(*fm_sb));
if(rval == sizeof(*fm_sb)) {
int i;
fm_sb->magic = bswap32(fm_sb->magic);
fm_sb->data_crc = bswap32(fm_sb->data_crc);
fm_sb->used_blocks = bswap32(fm_sb->used_blocks);
for(i=0; i < UBI_FM_MAX_BLOCKS; i++)
fm_sb->block_loc[i] = bswap32(fm_sb->block_loc[i]);
for(i=0; i < UBI_FM_MAX_BLOCKS; i++)
fm_sb->block_ec[i] = bswap32(fm_sb->block_ec[i]);
fm_sb->sqnum = bswap64(fm_sb->sqnum);
} else
memset(fm_sb, 0, sizeof(*fm_sb));
}
/* Set logical block at physical. */
static int eba_map[1920];
static int pba_map[1920];
static void usage(char *name)
{
printf("Usage: %s -b [erase block size] -e -v <ubi file> \n", name);
printf("Where,\n -e is dump the logic to physical block map.\n");
printf(" -v is dump the VID headers.\n");
printf(" -b [size] sets the erase block size (flash dependent).\n");
}
typedef struct fastmap {
struct ubi_fm_sb fm_sb;
struct ubi_fm_hdr hdr;
struct ubi_fm_scan_pool pool1;
struct ubi_fm_scan_pool pool2;
/* Free, Used, Scrub and Erase */
struct ubi_fm_ec ec[0];
/* ... */
/* struct ubi_fm_volhdr vol; */
/* struct ubi_fm_eba eba[0]; */
} fastmap;
int main (int argc, char *argv[])
{
int fd, i, erase_block = 0, eba_flag = 0;
int c;
struct ubi_ec_hdr ec;
struct ubi_vid_hdr vid;
int erase_size = 0x20000;
int leb_size;
off_t cur_ec = 0;
int vidless_blocks = 0;
while ((c = getopt (argc, argv, "hveb:")) != -1)
switch (c)
{
case 'h': /* Help */
usage(argv[0]);
goto out;
case 'b':
erase_size = atoi(optarg);
break;
case 'e':
eba_flag = 1;
break;
case 'v':
dump_vid = 1;
break;
case '?':
if (optopt == 'b')
fprintf (stderr, "Option -%c requires an argument.\n", optopt);
else if (isprint (optopt))
fprintf (stderr, "Unknown option `-%c'.\n", optopt);
else
fprintf (stderr,
"Unknown option character `\\x%x'.\n",
optopt);
return 1;
default:
goto out;
}
if(optind >= argc) {
usage(argv[0]);
goto out;
}
fd = open(argv[optind], O_RDONLY);
if(fd < 0) {
printf("Bad file: %s\n", argv[1]);
goto out;
}
memset(eba_map, -1, sizeof(eba_map));
memset(pba_map, -1, sizeof(pba_map));
/* Process each 'erase block'. */
read_ec(fd,&ec);
while(ec.magic == UBI_EC_HDR_MAGIC) {
leb_size = erase_size - ec.data_offset;
print_ec(&ec);
/* VID present? */
if(lseek(fd, ec.vid_hdr_offset-sizeof(ec), SEEK_CUR) == -1) {
printf("Seek error: %s\n", argv[1]);
goto out;
}
if(read_vid(fd,&vid) != sizeof(vid)) {
printf("File too small: %s\n", argv[1]);
goto out;
}
if(vid.magic == UBI_VID_HDR_MAGIC) {
print_vid(erase_block, &vid);
if(vid.vol_id == 3) {
if(eba_map[vid.lnum] != -1)
printf("EBA dup: %d %d\n", eba_map[vid.lnum], erase_block);
eba_map[vid.lnum] = erase_block;
}
pba_map[erase_block] = vid.lnum;
/* Read volume table. */
if(vid.vol_id == UBI_INTERNAL_VOL_START) {
/* Seek to PEB data offset. */
if(lseek(fd,
ec.data_offset - ec.vid_hdr_offset - sizeof(vid),
SEEK_CUR) == -1)
printf("Seek error: %s\n", argv[1]);
else {
int i;
struct ubi_vtbl_record vtbl;
for(i = 0; i < UBI_MAX_VOLUMES; i++) {
read_vtbl(fd, &vtbl);
if(vtbl.reserved_pebs ||
vtbl.name_len ||
strcmp((char*)vtbl.name, "") != 0) {
printf("VTBL %d\n", i);
print_vtbl(&vtbl);
}
}
}
} else if(vid.vol_id == UBI_FM_SB_VOLUME_ID) {
printf("Found Fastmap super block #PEB %d.\n", erase_block);
if(lseek(fd,
ec.data_offset - ec.vid_hdr_offset - sizeof(vid),
SEEK_CUR) == -1)
printf("Seek error: %s\n", argv[1]);
else {
void *data = alloca(leb_size);
struct ubi_fm_sb *fm_sb = data;
read_fm_sb(fd, data);
print_fm_sb(fm_sb);
}
} else if(vid.vol_id == UBI_FM_DATA_VOLUME_ID) {
printf("Found Fastmap data block #PEB %d.\n", erase_block);
printf("UNSUPPORTED!!!\n");
}
} else if(vid.magic != 0xffffffff){
printf("VID %d corrupt! %x\n", erase_block, vid.magic);
} else {
vidless_blocks++;
}
erase_block++;
cur_ec += erase_size;
cur_ec = lseek(fd, cur_ec, SEEK_SET);
/* Process Erase counter. */
read_ec(fd,&ec);
}
printf("Found %d vidless (free) blocks.\n", vidless_blocks);
if(eba_flag) {
printf("Logical to physical.\n");
for(i = 0; i < ALEN(eba_map); i+=8)
printf("%4d: %4d %4d %4d %4d %4d %4d %4d %4d"
" %4d %4d %4d %4d %4d %4d %4d %4d\n", i,
eba_map[i], eba_map[i+1],
eba_map[i+2], eba_map[i+3],
eba_map[i+4], eba_map[i+5],
eba_map[i+6], eba_map[i+7],
eba_map[i+8], eba_map[i+9],
eba_map[i+10], eba_map[i+11],
eba_map[i+12], eba_map[i+13],
eba_map[i+14], eba_map[i+15]);
printf("Physical to logical.\n");
for(i = 0; i < ALEN(pba_map); i+=8)
printf("%4d: %4d %4d %4d %4d %4d %4d %4d %4d"
" %4d %4d %4d %4d %4d %4d %4d %4d\n", i,
pba_map[i], pba_map[i+1],
pba_map[i+2], pba_map[i+3],
pba_map[i+4], pba_map[i+5],
pba_map[i+6], pba_map[i+7],
pba_map[i+8], pba_map[i+9],
pba_map[i+10], pba_map[i+11],
pba_map[i+12], pba_map[i+13],
pba_map[i+14], pba_map[i+15]);
}
out:
return 0;
}
To build copy ubi-media.h from the UBI directory and run gcc -Wall -g -o parse_ubi parse_ubi.c. The code probably has issues on big-endian platforms; it is also not test with 2.6.28 but I believe it should work as the UBI structures shouldn't change. You may have to remove some fastmap code, if it doesn't compile. The code should give some indication on what is wrong with PEB#1074. Make a copy of the partition when failing and use the code above to analyze the UBI layer.
It is quite possible that the MTD driver does something abnormal which prevents UBI from attaching to an MTD partition. This in-turn prevents UbiFS from mounting. If you know what MTD Nand flash controller is being used, it would help others determine where the issue is.
It can be caused by MTD bugs and/or hardware bugs or UBI/UbiFS issues. If it is UBI/UbiFs, there are backport trees and newer 3.0. You can try to steal the patches from 2.6.32; after applying all, add the 3.0.
Again, the issue can be the MTD driver. Grab MTD changes for your particular CPU/SOCs NAND flash controller. I do this from the mainline; some changes are bug fixes and others infra-structure. You have to look at each patch individually