gnu.io RXTXPort.nativeDrain() blocks indefinitely - gnu

We're using jrxtx to communicate with a tty device on an embeded linux device. I've recently discovered a deadlock condition on attempting to write to the port:
...
os.write(bb.array(), 0, bb.position());
os.flush();
The flush then blocks indefinitely:
java.lang.Thread.State: RUNNABLE
at gnu.io.RXTXPort.nativeDrain(Native Method)
at gnu.io.RXTXPort$SerialOutputStream.flush(RXTXPort.java:1188)
at org.openmuc.jrxtx.JRxTxPort$SerialOutputStream.flush(JRxTxPort.java:231)
....
Since this happens 1/1000 of the time, it is likely some hardware glitch causes the initial lock-up, however restarting the JVM clears the issue so it must be temporary.
The issue is that there are calls which potentially block indefinitely.
Is there any know way to wrap this invocation so that it can time out without blocking?

Related

How to detect if a linux thread is crashed

I've this problem, I need to understand if a Linux thread is running or not due to crash and not for normal exit. The reason to do that is try to restart the thread without reset\restart all system.
The pthread_join() seems not a good option because I've several thread to monitoring and the function return on specific thread, It doesn't work in "parallel". At moment I've a keeep live signal from thread to main but I'm looking for some system call or thread attribute to understand the state
Any suggestion?
P
Thread "crashes"
How to detect if a linux thread is crashed
if (0) //...
That is, the only way that a pthreads thread can terminate abnormally while other threads in the process continue to run is via thread cancellation,* which is not well described as a "crash". In particular, if a signal is received whose effect is abnormal termination then the whole process terminates, not just the thread that handled the signal. Other kinds of errors do not cause threads to terminate.
On the other hand, if by "crash" you mean normal termination in response to the thread detecting an error condition, then you have no limitation on what the thread can do prior to terminating to communicate about its state. For example,
it could update a shared object that tracks information about your threads
it could write to a pipe designated for the purpose
it could raise a signal
If you like, you can use pthread_cleanup_push() to register thread cleanup handlers to help with that.
On the third hand, if you're asking about detecting live threads that are failing to make progress -- because they are deadlocked, for example -- then your best bet is probably to implement some form of heartbeat monitor. That would involve each thread you want to monitor periodically updating a shared object that tracks the time of each thread's last update. If a thread goes too long between beats then you can guess that it may be stalled. This requires you to instrument all the threads you want to monitor.
Thread cancellation
You should not use thread cancellation. But if you did, and if you include termination because of cancellation in your definition of "crash", then you still have all the options above available to you, but you must engage them by registering one or more cleanup handlers.
GNU-specific options
The main issues with using pthread_join() to check thread state are
it doesn't work for daemon threads, and
pthread_join() blocks until the specified thread terminates.
For daemon threads, you need one of the approaches already discussed, but for ordinary threads on GNU/Linux, Glibc provides non-standard pthread_tryjoin_np(), which performs a non-blocking attempt to join a thread, and also pthread_timedjoin_np(), which performs a join attempt with a timeout. If you are willing to rely on Glibc-specific functions then one of these might serve your purpose.
Linux-specific options
The Linux kernel makes per-process thread status information available via the /proc filesystem. See How to check the state of Linux threads?, for example. Do be aware, however, that the details vary a bit from one kernel version to another. And if you're planning to do this a lot, then also be aware that even though /proc is a virtual filesystem (so no physical disk is involved), you still access it via slow-ish I/O interfaces.
Any of the other alternatives is probably better than reading files in /proc. I mention it only for completeness.
Overall
I'm looking for some system call or thread attribute to understand the state
The pthreads API does not provide a "have you terminated?" function or any other such state-inquiry function, unless you count pthread_join(). If you want that then you need to roll your own, which you can do by means of some of the facilities already discussed.
*Do not use thread cancellation.

Why linux process with status 'D' can be killed ? [duplicate]

Sometimes whenever I write a program in Linux and it crashes due to a bug of some sort, it will become an uninterruptible process and continue running forever until I restart my computer (even if I log out). My questions are:
What causes a process to become uninterruptible?
How do I stop that from happening?
This is probably a dumb question, but is there any way to interrupt it without restarting my computer?
An uninterruptible process is a process which happens to be in a system call (kernel function) that cannot be interrupted by a signal.
To understand what that means, you need to understand the concept of an interruptible system call. The classic example is read(). This is a system call that can take a long time (seconds) since it can potentially involve spinning up a hard drive, or moving heads. During most of this time, the process will be sleeping, blocking on the hardware.
While the process is sleeping in the system call, it can receive a Unix asynchronous signal (say, SIGTERM), then the following happens:
The system call exits prematurely, and is set up to return -EINTR to user space.
The signal handler is executed.
If the process is still running, it gets the return value from the system call, and it can make the same call again.
Returning early from the system call enables the user space code to immediately alter its behavior in response to the signal. For example, terminating cleanly in reaction to SIGINT or SIGTERM.
On the other hand, some system calls are not allowed to be interrupted in this way. If the system calls stalls for some reason, the process can indefinitely remains in this unkillable state.
LWN ran a nice article that touched this topic in July.
To answer the original question:
How to prevent this from happening: figure out which driver is causing you trouble, and either stop using, or become a kernel hacker and fix it.
How to kill an uninterruptible process without rebooting: somehow make the system call terminate. Frequently the most effective manner to do this without hitting the power switch is to pull the power cord. You can also become a kernel hacker and make the driver use TASK_KILLABLE, as explained in the LWN article.
When a process is on user mode, it can be interrupted at any time (switching to kernel mode). When the kernel returns to user mode, it checks if there are any signals pending (including the ones which are used to kill the process, such as SIGTERM and SIGKILL). This means a process can be killed only on return to user mode.
The reason a process cannot be killed in kernel mode is that it could potentially corrupt the kernel structures used by all the other processes in the same machine (the same way killing a thread can potentially corrupt data structures used by other threads in the same process).
When the kernel needs to do something which could take a long time (waiting on a pipe written by another process or waiting for the hardware to do something, for instance), it sleeps by marking itself as sleeping and calling the scheduler to switch to another process (if there is no non-sleeping process, it switches to a "dummy" process which tells the cpu to slow down a bit and sits in a loop — the idle loop).
If a signal is sent to a sleeping process, it has to be woken up before it will return to user space and thus process the pending signal. Here we have the difference between the two main types of sleep:
TASK_INTERRUPTIBLE, the interruptible sleep. If a task is marked with this flag, it is sleeping, but can be woken by signals. This means the code which marked the task as sleeping is expecting a possible signal, and after it wakes up will check for it and return from the system call. After the signal is handled, the system call can potentially be automatically restarted (and I won't go into details on how that works).
TASK_UNINTERRUPTIBLE, the uninterruptible sleep. If a task is marked with this flag, it is not expecting to be woken up by anything other than whatever it is waiting for, either because it cannot easily be restarted, or because programs are expecting the system call to be atomic. This can also be used for sleeps known to be very short.
TASK_KILLABLE (mentioned in the LWN article linked to by ddaa's answer) is a new variant.
This answers your first question. As to your second question: you can't avoid uninterruptible sleeps, they are a normal thing (it happens, for instance, every time a process reads/writes from/to the disk); however, they should last only a fraction of a second. If they last much longer, it usually means a hardware problem (or a device driver problem, which looks the same to the kernel), where the device driver is waiting for the hardware to do something which will never happen. It can also mean you are using NFS and the NFS server is down (it is waiting for the server to recover; you can also use the "intr" option to avoid the problem).
Finally, the reason you cannot recover is the same reason the kernel waits until return to user mode to deliver a signal or kill the process: it would potentially corrupt the kernel's data structures (code waiting on an interruptible sleep can receive an error which tells it to return to user space, where the process can be killed; code waiting on an uninterruptible sleep is not expecting any error).
Uninterruptable processes are USUALLY waiting for I/O following a page fault.
Consider this:
The thread tries to access a page which is not in core (either an executable which is demand-loaded, a page of anonymous memory which has been swapped out, or a mmap()'d file which is demand loaded, which are much the same thing)
The kernel is now (trying to) load it in
The process can't continue until the page is available.
The process/task cannot be interrupted in this state, because it can't handle any signals; if it did, another page fault would happen and it would be back where it was.
When I say "process", I really mean "task", which under Linux (2.6) roughly translates to "thread" which may or may not have an individual "thread group" entry in /proc
In some cases, it may be waiting for a long time. A typical example of this would be where the executable or mmap'd file is on a network filesystem where the server has failed. If the I/O eventually succeeds, the task will continue. If it eventually fails, the task will generally get a SIGBUS or something.
To your 3rd question:
I think you can kill the uninterruptable processes by running
sudo kill -HUP 1.
It will restart init without ending the running processes and after running it, my uninterruptable processes were gone.
If you are talking about a "zombie" process (which is designated as "zombie" in ps output), then this is a harmless record in the process list waiting for someone to collect its return code and it could be safely ignored.
Could you please describe what and "uninterruptable process" is for you? Does it survives the "kill -9 " and happily chugs along? If that is the case, then it's stuck on some syscall, which is stuck in some driver, and you are stuck with this process till reboot (and sometimes it's better to reboot soon) or unloading of relevant driver (which is unlikely to happen). You could try to use "strace" to find out where your process is stuck and avoid it in the future.

How to exit a program when using blocking calls

I need to do a project where the application monitors incoming connections and apply some rules as defined in a xml document. The rules are either filtering (blocking or permitting) connections or redirect traffic on a certain port. In order to do this, I use functions such as accept and recv (from Winsock). All of those functions are used on different threads. I'm wondering, though, how am I supposed to clean up the program before exiting since all those blocking calls are made. Normally I'd either wait until the person exits the console through the X button or waiting for the user to input a certain character in the main thread. The thing is I'm not sure what happens if the application exits while there are still active threads/if memory is still allocated/ if sockets are in use. Are all destructors called? Are h andles and sockets correctly closed? Or do I need to somehow do it myself?
Thanks
In general, I would say no. Do not try to explicitly clean up resources like sockets, fd's, handles, threads unless you are absolutely forced to.
Exact behaviour depends on OS and how you terminate your app.
All the common desktop OS will release resources allocated to a process by the OS when a process terminates. This includes sockets, file descriptors, memory.
On Windows/Linux, if you return from your C/C++ main() without any explicit cleanup, static dtors will get called by the crt code. Dtors for dynamically allocated objects in non-main threads are not run.
Executables written in other languages may behave differently.
If, instead of returning from main(), you call a 'ProcessExit()' API directly, static destructors will not get called because the OS has no concept of dtors - it has no idea, or interest, in what language was used to generate the executable.
In either case, the OS will be called to terminate your process. The OS does this, (simple 'Dummies' version:), by first changing the state of all process threads that are not running so that they never run again. Threads that are running on other cores are then stopped. Then OS resources like fd, sockets are closed, then released, then all process memory is freed, then OS kernel process/thread objects freed, then your process no longer exists.
If you absolutely need some, or all, C++/whatever dtors called when some thread needs to stop the app, you will have to explcitly signal other threads to stop so that dtors can be run. I tend to use a globally-accessible 'CloseRequested' bool that relevant blocking calls check immediately after returning. There remains the issue of persuading the blocking calls to return.
Some blocking calls can be coded up to wait on more than one signal, so allowing the call to return by a simple event/sema/condvar/whatever signal.
Some calls, like recv(), accept(), can be pesuaded to return early by closing the fd/socket they are waiting on.
Some calls can be made to return by 'artificially' satisfying their wait condition - eg. creating a temp file just to make a folder-monitor call return so that the 'CloseRequested' bool can be checked.
If a blocking call is so annoyingly stubborn that it cannot be persuaded to return, you could redesign your app so that whatever the critical resource is that is released in the dtors can be released by another thread - maybe create the thing in another thread and pass it to the thread that blocks in a ctor parameter, something like that.
NOTE WELL: Thread shutdown code bodges, as listed above, are extra code that does not add to the normal functionality of your app. You should restrict explicit thread shutdown to those threads that hold resources that absolutely must be released by explicit user code - DB connections, say. If the OS can release the resource, it should be allowed to do so. The OS is very good at stopping all process threads before releasing resources they are using, user code is not.
Where possible, use blocking calls that take a timeout value, and have your threads loop. That gives you a place to check for a shutdown condition and exit the thread gracefully. Handles will generally be cleaned up by the system when the process exits. It is polite to shut down sockets gracefully, but not absolutely mandatory. The downside of not doing so is it can take a while for the kernel to clean up exclusive resources. For example, if you just kill a thread waiting to accept(), and then your app re-launches, it won't be able to successfully accept() on the same port until the kernel cleans up the old socket.

How to interrupt a thread performing a blocking socket connect?

I have some code that spawns a pthread that attempts to maintain a socket connection to a remote host. If the connection is ever lost, it attempts to reconnect using a blocking connect() call on its socket. Since the code runs in a separate thread, I don't really care about the fact that it uses the synchronous socket API.
That is, until it comes time for my application to exit. I would like to perform some semblance of an orderly shutdown, so I use thread synchronization primitives to wake up the thread and signal for it to exit, then perform a pthread_join() on the thread to wait for it to complete. This works great, unless the thread is in the middle of a connect() call when I command the shutdown. In that case, I have to wait for the connect to time out, which could be a long time. This makes the application appear to take a long time to shut down.
What I would like to do is to interrupt the call to connect() in some way. After the call returns, the thread will notice my exit signal and shut down cleanly. Since connect() is a system call, I thought that I might be able to intentionally interrupt it using a signal (thus making the call return EINTR), but I'm not sure if this is a robust method in a POSIX threads environment.
Does anyone have any recommendations on how to do this, either using signals or via some other method? As a note, the connect() call is down in some library code that I cannot modify, so changing to a non-blocking socket is not an option.
Try to close() the socket to interrupt the connect(). I'm not sure, but I think it will work at least on Linux. Of course, be careful to synchronize properly such that you only ever close() this socket once, or a second close() could theoretically close an unrelated file descriptor that was just opened.
EDIT: shutdown() might be more appropriate because it does not actually close the socket.
Alternatively, you might want to take a look at pthread_cancel() and pthread_kill(). However, I don't see a way to use these two without a race condition.
I advise that you abandon the multithreaded-server approach and instead go event-driven, for example by using epoll for event notification. This way you can avoid all these very basic problems that become very hard with threads, like proper shutdown. You are free to at any time do anything you want, e.g. safely close sockets and never hear from them again.
On the other hand, if in your worker thread you do a non-blocking connect() and get notified via epoll_pwait() (or ppoll() or pselect(); note the p), you may be able to avoid race conditions associated with signals.

What is an uninterruptible process?

Sometimes whenever I write a program in Linux and it crashes due to a bug of some sort, it will become an uninterruptible process and continue running forever until I restart my computer (even if I log out). My questions are:
What causes a process to become uninterruptible?
How do I stop that from happening?
This is probably a dumb question, but is there any way to interrupt it without restarting my computer?
An uninterruptible process is a process which happens to be in a system call (kernel function) that cannot be interrupted by a signal.
To understand what that means, you need to understand the concept of an interruptible system call. The classic example is read(). This is a system call that can take a long time (seconds) since it can potentially involve spinning up a hard drive, or moving heads. During most of this time, the process will be sleeping, blocking on the hardware.
While the process is sleeping in the system call, it can receive a Unix asynchronous signal (say, SIGTERM), then the following happens:
The system call exits prematurely, and is set up to return -EINTR to user space.
The signal handler is executed.
If the process is still running, it gets the return value from the system call, and it can make the same call again.
Returning early from the system call enables the user space code to immediately alter its behavior in response to the signal. For example, terminating cleanly in reaction to SIGINT or SIGTERM.
On the other hand, some system calls are not allowed to be interrupted in this way. If the system calls stalls for some reason, the process can indefinitely remains in this unkillable state.
LWN ran a nice article that touched this topic in July.
To answer the original question:
How to prevent this from happening: figure out which driver is causing you trouble, and either stop using, or become a kernel hacker and fix it.
How to kill an uninterruptible process without rebooting: somehow make the system call terminate. Frequently the most effective manner to do this without hitting the power switch is to pull the power cord. You can also become a kernel hacker and make the driver use TASK_KILLABLE, as explained in the LWN article.
When a process is on user mode, it can be interrupted at any time (switching to kernel mode). When the kernel returns to user mode, it checks if there are any signals pending (including the ones which are used to kill the process, such as SIGTERM and SIGKILL). This means a process can be killed only on return to user mode.
The reason a process cannot be killed in kernel mode is that it could potentially corrupt the kernel structures used by all the other processes in the same machine (the same way killing a thread can potentially corrupt data structures used by other threads in the same process).
When the kernel needs to do something which could take a long time (waiting on a pipe written by another process or waiting for the hardware to do something, for instance), it sleeps by marking itself as sleeping and calling the scheduler to switch to another process (if there is no non-sleeping process, it switches to a "dummy" process which tells the cpu to slow down a bit and sits in a loop — the idle loop).
If a signal is sent to a sleeping process, it has to be woken up before it will return to user space and thus process the pending signal. Here we have the difference between the two main types of sleep:
TASK_INTERRUPTIBLE, the interruptible sleep. If a task is marked with this flag, it is sleeping, but can be woken by signals. This means the code which marked the task as sleeping is expecting a possible signal, and after it wakes up will check for it and return from the system call. After the signal is handled, the system call can potentially be automatically restarted (and I won't go into details on how that works).
TASK_UNINTERRUPTIBLE, the uninterruptible sleep. If a task is marked with this flag, it is not expecting to be woken up by anything other than whatever it is waiting for, either because it cannot easily be restarted, or because programs are expecting the system call to be atomic. This can also be used for sleeps known to be very short.
TASK_KILLABLE (mentioned in the LWN article linked to by ddaa's answer) is a new variant.
This answers your first question. As to your second question: you can't avoid uninterruptible sleeps, they are a normal thing (it happens, for instance, every time a process reads/writes from/to the disk); however, they should last only a fraction of a second. If they last much longer, it usually means a hardware problem (or a device driver problem, which looks the same to the kernel), where the device driver is waiting for the hardware to do something which will never happen. It can also mean you are using NFS and the NFS server is down (it is waiting for the server to recover; you can also use the "intr" option to avoid the problem).
Finally, the reason you cannot recover is the same reason the kernel waits until return to user mode to deliver a signal or kill the process: it would potentially corrupt the kernel's data structures (code waiting on an interruptible sleep can receive an error which tells it to return to user space, where the process can be killed; code waiting on an uninterruptible sleep is not expecting any error).
Uninterruptable processes are USUALLY waiting for I/O following a page fault.
Consider this:
The thread tries to access a page which is not in core (either an executable which is demand-loaded, a page of anonymous memory which has been swapped out, or a mmap()'d file which is demand loaded, which are much the same thing)
The kernel is now (trying to) load it in
The process can't continue until the page is available.
The process/task cannot be interrupted in this state, because it can't handle any signals; if it did, another page fault would happen and it would be back where it was.
When I say "process", I really mean "task", which under Linux (2.6) roughly translates to "thread" which may or may not have an individual "thread group" entry in /proc
In some cases, it may be waiting for a long time. A typical example of this would be where the executable or mmap'd file is on a network filesystem where the server has failed. If the I/O eventually succeeds, the task will continue. If it eventually fails, the task will generally get a SIGBUS or something.
To your 3rd question:
I think you can kill the uninterruptable processes by running
sudo kill -HUP 1.
It will restart init without ending the running processes and after running it, my uninterruptable processes were gone.
If you are talking about a "zombie" process (which is designated as "zombie" in ps output), then this is a harmless record in the process list waiting for someone to collect its return code and it could be safely ignored.
Could you please describe what and "uninterruptable process" is for you? Does it survives the "kill -9 " and happily chugs along? If that is the case, then it's stuck on some syscall, which is stuck in some driver, and you are stuck with this process till reboot (and sometimes it's better to reboot soon) or unloading of relevant driver (which is unlikely to happen). You could try to use "strace" to find out where your process is stuck and avoid it in the future.

Resources