I'm writing a C program for Linux OS.
The program can start a timer: both main program and timer can send and receive characters on a serial port.
My attempt is to serialize the serial port access by a mutex in a global structure initialized on the opening with:
if (pthread_mutex_init( &pED->lockSerial, NULL) != 0)
{
lwsl_err("lockSerial init failed\n");
}
I protected all the functions that send data on the port as follow:
ssize_t cmdFirmwareVersion(EngineData *pED)
{
if (pED->fdSerialPort==-1)
return -1;
LOCK_SERIAL;
unsigned char cmd[] = { 0x00, 0x00, 0x7F };
write( pED->fdSerialPort, cmd, sizeof(cmd));
int rx = read ( pED->fdSerialPort, rxbuffer, sizeof rxbuffer);
dump( rxbuffer, rx);
UNLOCK_SERIAL;
return rx;
}
where
#define LOCK_SERIAL if (0!=pthread_mutex_lock(&pED->lockSerial)) {printf("Err lock");return 0;}
#define UNLOCK_SERIAL pthread_mutex_unlock(&pED->lockSerial);
Running the program and starting the timer I see the requests are regular. When I trigger one of this calls on other way (from a rx websocket function) the program hangs and I need to kill it.
Why the entire program stops ??
If a process hangs, it could be because of circular wait for mutexes or holding mutex and trying to lock it again. This could cause deadlock.
ps output will show thread's state as D or S if it's waiting for a resource. it will appear as the process is hung.
D uninterruptible sleep (usually IO)
S interruptible sleep (waiting for an event to complete)
I have made a thread to hold mutex and try to lock it again.
ps output and GDB shows main thread and child thread are in sleep.
xxxx#virtualBox:~$ ps -eflT |grep a.out
0 S root 3982 3982 2265 0 80 0 - 22155 - 20:28 pts/0 00:00:00 ./a.out
1 S root 3982 3984 2265 0 80 0 - 22155 - 20:28 pts/0 00:00:00 ./a.out
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7ffff7fdf740 (LWP 4625) "a.out" 0x00007ffff7bbed2d in __GI___pthread_timedjoin_ex (
threadid=140737345505024, thread_return=0x0, abstime=0x0, block= <optimized out>) at pthread_join_common.c:89
2 Thread 0x7ffff77c4700 (LWP 4629) "a.out" __lll_lock_wait ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
Please check blog Tech Easy for more information on threads.
Related
This question already has an answer here:
Why can't I use cocoa frameworks in different forked processes?
(1 answer)
Closed 6 years ago.
Before Sierra, I used to be able to initialize GLUT on the child process after forking the original process. With the latest version of Sierra, this seems to have changed. The following program crashes with a segmentation fault. If I instead move all the glut functions to the parent process, everything works. Why is there a difference between using the parent/child process?
#include <stdlib.h>
#include <GLUT/glut.h>
void pass(void){
}
int main(int argc, char* argv[]) {
pid_t childpid;
childpid = fork();
if (childpid == 0){
glutInit(&argc,argv);
glutInitDisplayMode(GLUT_SINGLE | GLUT_RGB | GLUT_DEPTH);
glutInitWindowSize(100,100);
glutCreateWindow("test");
glutDisplayFunc(pass);
glGetError();
glutMainLoop();
}else{
sleep(5);
}
exit(1);
}
The segmentation fault I get:
Crashed Thread: 0 Dispatch queue: com.apple.main-thread
Exception Type: EXC_BAD_INSTRUCTION (SIGILL)
Exception Codes: 0x0000000000000001, 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Termination Signal: Illegal instruction: 4
Termination Reason: Namespace SIGNAL, Code 0x4
Terminating Process: exc handler [0]
Application Specific Information:
BUG IN CLIENT OF LIBDISPATCH: _dispatch_main_queue_callback_4CF called from the wrong thread
crashed on child side of fork pre-exec
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libdispatch.dylib 0x00007fffe8e7bd21 _dispatch_main_queue_callback_4CF + 1291
1 com.apple.CoreFoundation 0x00007fffd3c7bbe9 __CFRUNLOOP_IS_SERVICING_THE_MAIN_DISPATCH_QUEUE__ + 9
2 com.apple.CoreFoundation 0x00007fffd3c3d00d __CFRunLoopRun + 2205
3 com.apple.CoreFoundation 0x00007fffd3c3c514 CFRunLoopRunSpecific + 420
4 com.apple.Foundation 0x00007fffd57e1c9b -[NSRunLoop(NSRunLoop) limitDateForMode:] + 196
5 com.apple.glut 0x0000000104f39e93 -[GLUTApplication run] + 321
6 com.apple.glut 0x0000000104f46b0e glutMainLoop + 279
7 a.out 0x0000000104f24ed9 main + 121 (main.c:18)
8 libdyld.dylib 0x00007fffe8ea4255 start + 1
fork() is not creating a thread, it forks the process. After calling fork you have two processes with nearly identical address space contents, but their address spaces are protected from each other. The general rule regarding fork() is, that the only sensisble thing to do after a fork in the child process, is to replace the process image with execve(); doing anything else requires a lot of foresight in the program's design.
Thread's are created in a different way. I suggest you use the actual threading primitives offered by your programming language of choice.
That being said, many OSs and GUI libraries want a process' GUI parts to run in the main thread, so that could be part of the reason as well. Also be aware that OpenGL and multithreading is a little bit finicky.
I'm running a distributed process under OpenMPI on linux.
When one process dies, mpirun will detect this and kill the other processes. But even though I get a core from the process which died, I don't get a core from the processes killed by OpenMPI.
Why am I not getting these other corefiles? How can I get these corefiles?
The other processes are just killed by Open MPI. They didn't segfault themselves. From an MPI perspective, their execution was erroneous, but from a C perspective, it was fine. As such, there's no reason for them to have dumped core.
OMPI's mpiexec kills the remaining ranks by first sending them SIGTERM and then SIGKILL (should any of them survive SIGTERM). None of those signals results in core being dumped. You could probably install a signal handler for SIGTERM that calls abort(3) in order to force core dumps on kill.
Here is some sample code that works with Open MPI 1.6.5:
#include <stdlib.h>
#include <signal.h>
#include <mpi.h>
void term_handler (int sig) {
// Restore the default SIGABRT disposition
signal(SIGABRT, SIG_DFL);
// Abort (dumps core)
abort();
}
int main (int argc, char **argv) {
int rank;
MPI_Init(&argc, &argv);
// Override the SIGTERM handler AFTER the call to MPI_Init
signal(SIGTERM, term_handler);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// Cause division-by-zero exception in rank 0
rank = 1 / rank;
// Make other ranks wait for rank 0
MPI_Bcast(&rank, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
Open MPI's MPI_Init installs special handlers for some known signals that either print useful debug information or generate backtrace files (.btr). That's why the SIGTERM handler has to be installed after the call to MPI_Init and the default action of SIGABRT (used by abort(3)) has to be restored before calling abort().
Note that the signal handler will appear at the top of the call stack in the core file:
(gdb) bt
#0 0x0000003bfd232925 in raise () from /lib64/libc.so.6
#1 0x0000003bfd234105 in abort () from /lib64/libc.so.6
#2 0x0000000000400dac in term_handler (sig=15) at test.c:8
#3 <signal handler called>
#4 0x00007fbac7ad0bc7 in mca_btl_sm_component_progress () from /path/libmpi.so.1
#5 0x00007fbac7c9fca7 in opal_progress () from /path/libmpi.so.1
...
I would rather recommend that you should use a parallel debugger such as TotalView or DDT, if you have one at your disposal.
I running my program as daemon.
Father process only wait for child process, when it is dead unexpected, fork and wait again.
for (; 1;) {
if (fork() == 0) break;
int sig = 0;
for (; 1; usleep(10000)) {
pid_t wpid = waitpid(g->pid[1], &sig, WNOHANG);
if (wpid > 0) break;
if (wpid < 0) print("wait error: %s\n", strerror(errno));
}
}
But when child process being killed with -9 signal, the child process goes to zombie process.
waitpid should return the pid of child process immediately!
But waitpid got the pid number after about 90 seconds,
cube 28139 0.0 0.0 70576 900 ? Ss 04:24 0:07 ./daemon -d
cube 28140 9.3 0.0 0 0 ? Zl 04:24 106:19 [daemon] <defunct>
Here is the strace of the father
The father does not get stuck, wait4 was called always.
strace -p 28139
Process 28139 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
wait4(28140, 0x7fff08a2681c, WNOHANG, NULL) = 0
nanosleep({0, 10000000}, NULL) = 0
wait4(28140, 0x7fff08a2681c, WNOHANG, NULL) = 0
About 90 seconds later father got the SIGCHILD and wait4 returned the pid of the dead child.
--- SIGCHLD (Child exited) # 0 (0) ---
restart_syscall(<... resuming interrupted call ...>) = 0
wait4(28140, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], WNOHANG, NULL) = 28140
Why the child process does not exit immediately? On the contrary, it turns into zombie unexpectedly.
I finally find out there were some fd leaks during deep tracing by lsof.
After fd leaks were fixed, the problem was gone.
You could simply use
for (;;) {
pid_t wpid = waitpid(-1, &sig, 0);
if (wpid > 0) break;
if (wpid < 0) print("wait error: %s\n", strerror(errno));
}
instead of sleep for a while and try again.
It looks to me like waitpid is not returning the child pid immediately simply because that process is not available.
Furthermore, it looks like you actually want your code to do this because you specify waitpid() with the NOHANG option, which, prevents blocking, essentially allowing the parent to move on if the child pid is not available.
Maybe your process using something you didn't expect? Can you trace its activity to see if you find the bottleneck?
Here is a pretty useful link that might help you:
http://infohost.nmt.edu/~eweiss/222_book/222_book/0201433079/ch08lev1sec6.html
In my test program, I start two threads, each of them just do the following logic:
1) pthread_mutex_lock()
2) sleep(1)
3) pthread_mutex_unlock()
However, I find that after some time, one of the two threads will block on pthread_mutex_lock() forever, while the other thread works normal. This is a very strange behavior and I think maybe a potential serious issue. By Linux manual, sleep() is not prohibited when a pthread_mutex_t is acquired. So my question is: is this a real problem or is there any bug in my code ?
The following is the test program. In the code, the 1st thread's output is directed to stdout, while the 2nd's is directed to stderr. So we can check these two different output to see whether the thread is blocked.
I have tested it on linux kernel (2.6.31) and (2.6.9). Both results are the same.
//======================= Test Program ===========================
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <pthread.h>
#define THREAD_NUM 2
static int data[THREAD_NUM];
static int sleepFlag = 1;
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static void * threadFunc(void *arg)
{
int* idx = (int*) arg;
FILE* fd = NULL;
if (*idx == 0)
fd = stdout;
else
fd = stderr;
while(1) {
fprintf(fd, "\n[%d]Before pthread_mutex_lock is called\n", *idx);
if (pthread_mutex_lock(&mutex) != 0) {
exit(1);
}
fprintf(fd, "[%d]pthread_mutex_lock is finisheded. Sleep some time\n", *idx);
if (sleepFlag == 1)
sleep(1);
fprintf(fd, "[%d]sleep done\n\n", *idx);
fprintf(fd, "[%d]Before pthread_mutex_unlock is called\n", *idx);
if (pthread_mutex_unlock(&mutex) != 0) {
exit(1);
}
fprintf(fd, "[%d]pthread_mutex_unlock is finisheded.\n", *idx);
}
}
// 1. compile
// gcc -o pthread pthread.c -lpthread
// 2. run
// 1) ./pthread sleep 2> /tmp/error.log # Each thread will sleep 1 second after it acquires pthread_mutex_lock
// ==> We can find that /tmp/error.log will not increase.
// or
// 2) ./pthread nosleep 2> /tmp/error.log # No sleep is done when each thread acquires pthread_mutex_lock
// ==> We can find that both stdout and /tmp/error.log increase.
int main(int argc, char *argv[]) {
if ((argc == 2) && (strcmp(argv[1], "nosleep") == 0))
{
sleepFlag = 0;
}
pthread_t t[THREAD_NUM];
int i;
for (i = 0; i < THREAD_NUM; i++) {
data[i] = i;
int ret = pthread_create(&t[i], NULL, threadFunc, &data[i]);
if (ret != 0) {
perror("pthread_create error\n");
exit(-1);
}
}
for (i = 0; i < THREAD_NUM; i++) {
int ret = pthread_join(t[i], (void*)0);
if (ret != 0) {
perror("pthread_join error\n");
exit(-1);
}
}
exit(0);
}
This is the output:
On the terminal where the program is started:
root#skyscribe:~# ./pthread sleep 2> /tmp/error.log
[0]Before pthread_mutex_lock is called
[0]pthread_mutex_lock is finisheded. Sleep some time
[0]sleep done
[0]Before pthread_mutex_unlock is called
[0]pthread_mutex_unlock is finisheded.
...
On another terminal to see the file /tmp/error.log
root#skyscribe:~# tail -f /tmp/error.log
[1]Before pthread_mutex_lock is called
And no new lines are outputed from /tmp/error.log
This is a wrong way to use mutexes. A thread should not hold a mutex for more time than it does not own it, particularly not if it sleeps while holding the mutex. There is no FIFO guarantee for locking a mutex (for efficiency reasons).
More specifically, if thread 1 unlocks the mutex while thread 2 is waiting for it, it makes thread 2 runnable but this does not force the scheduler to preempt thread 1 or make thread 2 run immediately. Most likely, it will not because thread 1 has recently slept. When thread 1 subsequently reaches the pthread_mutex_lock() call, it will generally be allowed to lock the mutex immediately, even though there is a thread waiting (and the implementation can know it). When thread 2 wakes up after that, it will find the mutex already locked and go back to sleep.
The best solution is not to hold a mutex for that long. If that is not possible, consider moving the lock-needing operations to a single thread (removing the need for the lock) or waking up the correct thread using condition variables.
There's neither a problem, nor a bug in your code, but a combination of buffering and scheduling effects. Add an fflush here:
fprintf (fd, "[%d]pthread_mutex_unlock is finisheded.\n", *idx);
fflush (fd);
and run
./a.out >1 1.log 2> 2.log &
and you'll see rather equal progress made by the two threads.
EDIT: and like #jilles above said, a mutex is supposed to be a short wait lock, as opposed to long waits like condition variable wait, waiting for I/O or sleeping. That's why a mutex is not a cancellation point too.
Does a call to MPI_Barrier affect every thread in an MPI process or only the thread
that makes the call?
For your information , my MPI application will run with MPI_THREAD_MULTIPLE.
Thanks.
The way to think of this is that MPI_Barrier (and other collectives) are blocking function calls, which block until all processes in the communicator have completed the function. That, I think, makes it a little easier to figure out what should happen; the function blocks, but other threads continue on their way unimpeded.
So consider the following chunk of code (The shared done flag being flushed to communicate between threads is not how you should be doing thread communication, so please don't use this as a template for anything. Furthermore, using a reference to done will solve this bug/optimization, see the end of comment 2):
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, char**argv) {
int ierr, size, rank;
int provided;
volatile int done=0;
MPI_Comm comm;
ierr = MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (provided == MPI_THREAD_SINGLE) {
fprintf(stderr,"Could not initialize with thread support\n");
MPI_Abort(MPI_COMM_WORLD,1);
}
comm = MPI_COMM_WORLD;
ierr = MPI_Comm_size(comm, &size);
ierr = MPI_Comm_rank(comm, &rank);
if (rank == 1) sleep(10);
#pragma omp parallel num_threads(2) default(none) shared(rank,comm,done)
{
#pragma omp single
{
/* spawn off one thread to do the barrier,... */
#pragma omp task
{
MPI_Barrier(comm);
printf("%d -- thread done Barrier\n", rank);
done = 1;
#pragma omp flush
}
/* and another to do some printing while we're waiting */
#pragma omp task
{
int *p = &done;
while(!(*p) {
printf("%d -- thread waiting\n", rank);
sleep(1);
}
}
}
}
MPI_Finalize();
return 0;
}
Rank 1 sleeps for 10 seconds, and all the ranks start a barrier in one thread. If you run this with mpirun -np 2, you'd expect the first of rank 0s threads to hit the barrier, and the other to cycle around printing and waiting -- and sure enough, that's what happens:
$ mpirun -np 2 ./threadbarrier
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
0 -- thread waiting
1 -- thread waiting
0 -- thread done Barrier
1 -- thread done Barrier