Tell if device driver timeout has occured with wait_event_interruptible_timeout - linux

Within my device driver, I am using wait_event_interruptible_timeout. How can I tell if a timeout has occurred? The macro only returns errorcode on interrupts but timeout is not an interrupt so "0" is returned.
Edit: not sure on how to tell if timeout occurred, but condition wont be set, so that sounds like the answer.

I ran into the same confusing issue a couple of weeks ago after reading the description of that function in Linux Device Drivers, Third Edition. Upon reading the comments for the various wait functions in a current kernel source tree, however, I found that the API has changed since the book was published. Newer kernels (at least 2.6.34+ and probably quite a bit further back than that), return the remaining number of jiffies to the timeout instead of an error code. So, a zero return value indicates the timeout occurred and any non-zero value should indicate a successful wakeup via the event condition. The comments in include/linux/wait.h provide a good description of the new API.

Related

How can I monitor stalled tasks?

I am running a Rust app with Tokio in prod. In the last version i had a bug, and some requests caused my code to go into an infinite loop.
What happened is while the task that got into the loop was stuck, all the other task continue to work well and processing requests, that happened until the number of stalling tasks was high enough to cause my program to be unresponsive.
My problem is took a lot of time to our monitoring systems to identify that something go wrong. For example, the task that answer to Kubernetes' health check works well and I wasn't able to identify that I have stalled tasks in my system.
So my question is if there's a way to identify and alert in such cases?
If i could find way to define timeout on task, and if it's not return to the scheduler after X seconds/millis to mark the task as stalled, that will be a good enough solution for me.
Using tracing might be an option here: following issue 2655 every tokio task should have a span. Alongside tracing-futures this means you should get a tracing event every time a task is entered or suspended (see this example), by adding the relevant data (e.g. task id / request id / ...) you should then be able to feed this information to an analysis tool in order to know:
that a task is blocked (was resumed then never suspended again)
if you add your own spans, that a "userland" span was never exited / closed, which might mean it's stuck in a non-blocking loop (which is also an issue though somewhat less so)
I think that's about the extent of it: as noted by issue 2510, tokio doesn't yet use the tracing information it generates and so provide no "built-in" introspection facilities.

epoll_wait and changing system time back

My app uses epoll_wait to perform a timed wait for IO events. If no event happens, epoll_wait is supposed to return after the timeout and my app continues.
During testing, someone turn the system clock back by a day and the part of my app that uses epoll_wait stopped working for 24 hours. Obviously, this is a problem.
I've rummaged around looking for something that might allow my app to know that the time has changed (for example, a signal) but I haven't found anything.
Is there any way to deal with abrupt time changes like this?
Use a timer based upon a monotonic time (e.g., timer_create(2)) to generate a signal and do a blocking epoll_wait, and check for -1 return code and errno set to EINTR.
The Linux kernel implementation of epoll_wait uses CLOCK_MONOTONIC so it should be immune to system clock changes. (I investigated this since I was seeing a similar problem and wanted to discount epoll_wait as a possible cause.) Before v2.6.37 it used to use jiffies which are also monotonic.
I've submitted a patch to the epoll_wait man page to clarify this.

Open MPI Virtual Timer Expired

I'm using Open MPI 1.8 on Gentoo 3.13 to manage the data transfer from one program to another via a server/client concept. Both the server and the clients are launched via mpiexec as separate processes. After some days (this is quite a heavy computation...), I sometimes receive the error
mpiexec noticed that process rank 0 with PID 17213 on node XXX exited on signal 26 (Virtual timer expired).
Unfortunately, the error is not reproducible in a reliable way, i.e., the error does not appear always and not always at the same point in the program flow. I also experienced this error on other machines. I already tracked the issue down to the ITIMER_VIRTUAL which, upon expiration, delivers SIGVTALRM (see, e.g., http://man7.org/linux/man-pages/man2/setitimer.2.html). In the BUGS section of the man page, it says that
Under very heavy loading, an ITIMER_REAL timer may expire before the signal from a previous expiration has been delivered. The second signal in such an event will be lost.
I wonder if something similar might also hold for ITIMER_VIRTUAL? Did anyone experience similar problems and can confirm the error?
The only workaround I can think of is to invoke setitimer(...) and try to manipulate the timer myself. However, I hope there is another way since I can't always modify the clients' source code. Any suggestions?
Since this question has not been answered officially, I will do it on behalf of Hristo (#HristoIliev: I hope this is ok for you). As was pointed out in the first comment to my question, there is not a single hint in the Open MPI source code which can have caused the virtual timer expiration. Indeed, the timer problem was related to a third-party library which made the code crash after an unpredictable time (depending on the current loading of the machine).

What would cause WaitForSingleObject with finite timeout not to return?

The title says it all. I am using C++ Builder to submit a form to an Internet server using TIdHTTP->Post(), to get a response. Since that call can get stuck if there is a network problem or a server problem, I am trying to run it in a separate thread. When the Post() returns, I signal the Event that I am waiting for with WaitForSingleObject, using a timeout of 1000. At one point, I was processing messages after the timeouts, but now I am just repeating the WaitForSingleObject call with a timeout of 1000 again, until the event is signaled or my total timeout period (20 seconds) has elapsed. If the timeout elapses, I would call Disconnect() on the TIdHTTP and try again.
However, I have not been able to get this to work reliably, although it usually works. I am using CodeSite to log the progress, and I can see that, on occasion, WaitForSingleObject is called, but does not return (ever). Since WaitForSingleObject is being called on the main thread, the application is then unresponsive until it is killed.
While one must always think of memory corruption when a C++ program stalls, I don't think that is what is going on. The stall is always at the WaitForSingleObject call, and if it was a memory corruption issue, I would expect that, at least sometimes, something else would go wrong.
The MSDN page for WaitForSingleObject says that the timer does not count down while the computer is asleep, and the monitor does go blank after a while, but the computer continues to run, and in any case WaitForSingleObject does not return once the mouse is moved and the monitor comes back on.
So, again, my question. What could be causing WaitForSingleObject with a finite timeout (1000 msecs) to never return?
So, the answer to my question is "Something Else". In this case, I finally tracked it down to a library I was using that also used threads. It worked with a previous version of RAD Studio, but disabling that library fixed this issue. I am moving to the current version of that library and will re-test.
I had read about problems with WFSO causing blocks, and even where Sleep might never return if there were too many threads running (http://msdn.microsoft.com/en-us/library/windows/desktop/ms686298(v=vs.85).aspx), so I thought there might be something I didn't know about threads and WFSO causing this.
Thanks to everyone for your helpful comments which pointed me in the right direction, and especially to Remy Lebeau for his suggestions on managing Post timeouts with Indy, which he maintains.

VS2012 - Why is my main UI thread showing green debugging statements?

Edit : If you're seeing this same problem (and you're accustomed to NOT seeing this under VS2010) please comment below so I know it's not just me - but be sure to check Han's answer to make sure none of those scenarios appear...
I've been updating my app to run with .NET 4.5 in VS2012 RTM and noticing something that I don't quite understand and that is unexpectedly green highlighted statements (instead of yellow).
Now I'm well aware of what this is supposed to mean, and the IDE is even showing me a little explanation tooltip.
This is the next statement to execute when this thread returns from
the current function
However there's absolutely nothing asynchronous or thread based about this code. In this simple example I'm sure you'll agree that string.ToUpper() won't be off in another thread. I can step through the code no issue.
There's nothing else going on and I am on the main thread as you can see here.
I am using async and await and MVVM-Light (the above method is the result of a RelayCommand) but I still get this behavior even when the code path is directly off an event handler such as PreviewKeyDown.
If I create a new app I cannot duplicate this - the coloring is yellow as expected - even when using await.
Anybody got any idea? It's starting to drive me crazy!!
It is green when the current instruction pointer is not exactly at the start of the statement. Some common causes:
Common in threaded code, setting a breakpoint in one thread and switching context to another. The other thread will have been interrupted by the debugger at an entirely random location. Often in code that you don't have source code or debugging info for, like String.ToUpper(), the debugger can only show the "closest" source code
Using Debugger + Break All to break into the debugger. Same idea as above, the instruction pointer will be at a random address
Getting an exception in code you don't have debugging info for. The editor shows the last entry in the Call Stack that it does have source code for. You need the call stack window to see where the actual exception was raised. Or the Exception Assistant, its reason for being
Debugging optimized code. The jitter optimizer scrambles the code pretty heavily, making it likely that the debugger can't show the current location accurately
Having out-dated debugging info or editing the code while debugging
Debugging code generated by the x64 jitter, happens when the project's Target Platform setting is AnyCPU. The x64 jitter has a number of chronic bugs that are not getting fixed, generating incorrect debugging info is one of them. Problems that were not addressed until it was completely rewritten, done by the RyuJIT project and first available in .NET version 4.6. Targeting x86 in your EXE project is the workaround.
I understand that this is old post yet I would like to answer the question with my experience.
I have encountered same issue recently in one of my WCF application. After debugging and closely looking service logs and I find out that my code was giving this error because service was hitting max allowed limit for code execution and once the service hit max allowed time limit it was trying to offload the current debugging session.
ERROR IN GREEN STATEMENT: this is the next statement to execute when thread return
So avoiding this issue you can try to look any potential code(Code/Service Timeout or any other code block) which is trying to offload your currently executing code context and try to fix it, furthermore original explanation given by #Hans is still very much relevant for trouble shooting this issue.
Actually, I am also facing this issue. This is because I missed some layout component in landscape mode, So check all the Id's and components and Run, you will not get this error.

Resources