I'm using Open MPI 1.8 on Gentoo 3.13 to manage the data transfer from one program to another via a server/client concept. Both the server and the clients are launched via mpiexec as separate processes. After some days (this is quite a heavy computation...), I sometimes receive the error
mpiexec noticed that process rank 0 with PID 17213 on node XXX exited on signal 26 (Virtual timer expired).
Unfortunately, the error is not reproducible in a reliable way, i.e., the error does not appear always and not always at the same point in the program flow. I also experienced this error on other machines. I already tracked the issue down to the ITIMER_VIRTUAL which, upon expiration, delivers SIGVTALRM (see, e.g., http://man7.org/linux/man-pages/man2/setitimer.2.html). In the BUGS section of the man page, it says that
Under very heavy loading, an ITIMER_REAL timer may expire before the signal from a previous expiration has been delivered. The second signal in such an event will be lost.
I wonder if something similar might also hold for ITIMER_VIRTUAL? Did anyone experience similar problems and can confirm the error?
The only workaround I can think of is to invoke setitimer(...) and try to manipulate the timer myself. However, I hope there is another way since I can't always modify the clients' source code. Any suggestions?
Since this question has not been answered officially, I will do it on behalf of Hristo (#HristoIliev: I hope this is ok for you). As was pointed out in the first comment to my question, there is not a single hint in the Open MPI source code which can have caused the virtual timer expiration. Indeed, the timer problem was related to a third-party library which made the code crash after an unpredictable time (depending on the current loading of the machine).
Related
I am running a Rust app with Tokio in prod. In the last version i had a bug, and some requests caused my code to go into an infinite loop.
What happened is while the task that got into the loop was stuck, all the other task continue to work well and processing requests, that happened until the number of stalling tasks was high enough to cause my program to be unresponsive.
My problem is took a lot of time to our monitoring systems to identify that something go wrong. For example, the task that answer to Kubernetes' health check works well and I wasn't able to identify that I have stalled tasks in my system.
So my question is if there's a way to identify and alert in such cases?
If i could find way to define timeout on task, and if it's not return to the scheduler after X seconds/millis to mark the task as stalled, that will be a good enough solution for me.
Using tracing might be an option here: following issue 2655 every tokio task should have a span. Alongside tracing-futures this means you should get a tracing event every time a task is entered or suspended (see this example), by adding the relevant data (e.g. task id / request id / ...) you should then be able to feed this information to an analysis tool in order to know:
that a task is blocked (was resumed then never suspended again)
if you add your own spans, that a "userland" span was never exited / closed, which might mean it's stuck in a non-blocking loop (which is also an issue though somewhat less so)
I think that's about the extent of it: as noted by issue 2510, tokio doesn't yet use the tracing information it generates and so provide no "built-in" introspection facilities.
I created an algorithm in nodejs using the cli terminal that receives an input of various values and makes the automated process of checking them in an api, so far so good, the problem is that he often stops this process and closes without any error or explanation. I have this problem with the VPS more than with my machine.
I forgot to mention that this process is done in parallel, the entries are separated into arrays and executed simultaneously x times at a time.
I thought it could be a possible memory leak since the process is a bit heavy but I have already done the test with less compliance and the result is the same and if it were he would warn me, the worst of all is that it only happens sometimes. Whoever had a similar problem could tell me how it was solved?
There is no code in the algorithm that closes the program, only after closing the loop, but it didn't have it before and even then it happened.
I've been working on a project for a little while, and the first step is building a library of syscall traces for processes. Essentially, what I'm trying to do is have system wherein every time a process requests an OS service via a syscall, relevant information (calling process, time, syscall name) of the event get logged to a file.
Theoretically, this sounds like a simple enough thing to do, however, implementing such is becoming more of a pain as time goes on. I suppose the main that's causing issues for me is a general lack of knowing where to start implementation.
Initially, I thought that this could all be handled be adding a few lines of code to the kernel entry point, but after digging through entry_64.S for a little while, I came to the conclusion that there must be an easier way. The next idea I had was to overwrite all the services pointed to by sys_call_table with my own service that did logging then called the original service. But, turns out, there are some difficulties to this method with linux kernel 5.4.18 due to sys_call_table no longer being exported. And, even when recompiling the kernel so that sys_call_table is exported, the table is in a memory protected location. Lastly, I've been experimenting with auditd. Specifically, I followed this link but it doesn't seem to be working (when I executed kill command there was is only a corresponding result in ausearch about 50% of time based on timestamps).
I'm getting a little burned out by all these dead-ends, and am really hoping to finally have this first stage in my project up and running. Does anyone have any pointers as to what I should try?
Solution: BPFTrace was exactly what I was looking for.
I used BPFTrace to log every time the kernel began execution of a syscall (excluding those initiated by BPFTrace itself)
I have been asked to debug, and improve, a complex multithreaded app, written by someone I don't have access to, that uses concurrent queues (both GCD and NSOperationQueue). I don't have access to a plan of the multithreaded architecture, that's to say a high-level design document of what is supposed to happen when. I need to create such a plan in order to understand how the app works and what it's doing.
When running the code and debugging, I can see in Xcode's Debug Navigator the various threads that are running. Is there a way of identifying where in the source-code a particular thread was spawned? And is there a way of determining to which NSOperationQueue an NSOperation belongs?
For example, I can see in the Debug Navigator (or by using LLDB's "thread backtrace" command) a thread's stacktrace, but the 'earliest' user code I can view is the overridden (NSOperation*) start method - stepping back earlier in the stack than that just shows the assembly instructions for the framework that invokes that method (e.g. __block_global_6, _dispatch_call_block_and_release and so on).
I've investigated and sought various debugging methods but without success. The nearest I got was the idea of method swizzling, but I don't think that's going to work for, say, queued NSOperation threads. Forgive my vagueness please: I'm aware that having looked as hard as I have, I'm probably asking the wrong question, and probably therefore haven't formed the question quite clearly in my own mind, but I'm asking the community for help!
Thanks
The best I can think of is to put breakpoints on dispatch_async, -[NSOperation init], -[NSOperationQueue addOperation:] and so on. You could configure those breakpoints to log their stacktrace, possibly some other info (like the block's address for dispatch_async, or the address of the queue and operation for addOperation:), and then continue running. You could then look though the logs when you're curious where a particular block came from and see what was invoked and from where. (It would still take some detective work.)
You could also accomplish something similar with dtrace if the breakpoints method is too slow.
Edit : If you're seeing this same problem (and you're accustomed to NOT seeing this under VS2010) please comment below so I know it's not just me - but be sure to check Han's answer to make sure none of those scenarios appear...
I've been updating my app to run with .NET 4.5 in VS2012 RTM and noticing something that I don't quite understand and that is unexpectedly green highlighted statements (instead of yellow).
Now I'm well aware of what this is supposed to mean, and the IDE is even showing me a little explanation tooltip.
This is the next statement to execute when this thread returns from
the current function
However there's absolutely nothing asynchronous or thread based about this code. In this simple example I'm sure you'll agree that string.ToUpper() won't be off in another thread. I can step through the code no issue.
There's nothing else going on and I am on the main thread as you can see here.
I am using async and await and MVVM-Light (the above method is the result of a RelayCommand) but I still get this behavior even when the code path is directly off an event handler such as PreviewKeyDown.
If I create a new app I cannot duplicate this - the coloring is yellow as expected - even when using await.
Anybody got any idea? It's starting to drive me crazy!!
It is green when the current instruction pointer is not exactly at the start of the statement. Some common causes:
Common in threaded code, setting a breakpoint in one thread and switching context to another. The other thread will have been interrupted by the debugger at an entirely random location. Often in code that you don't have source code or debugging info for, like String.ToUpper(), the debugger can only show the "closest" source code
Using Debugger + Break All to break into the debugger. Same idea as above, the instruction pointer will be at a random address
Getting an exception in code you don't have debugging info for. The editor shows the last entry in the Call Stack that it does have source code for. You need the call stack window to see where the actual exception was raised. Or the Exception Assistant, its reason for being
Debugging optimized code. The jitter optimizer scrambles the code pretty heavily, making it likely that the debugger can't show the current location accurately
Having out-dated debugging info or editing the code while debugging
Debugging code generated by the x64 jitter, happens when the project's Target Platform setting is AnyCPU. The x64 jitter has a number of chronic bugs that are not getting fixed, generating incorrect debugging info is one of them. Problems that were not addressed until it was completely rewritten, done by the RyuJIT project and first available in .NET version 4.6. Targeting x86 in your EXE project is the workaround.
I understand that this is old post yet I would like to answer the question with my experience.
I have encountered same issue recently in one of my WCF application. After debugging and closely looking service logs and I find out that my code was giving this error because service was hitting max allowed limit for code execution and once the service hit max allowed time limit it was trying to offload the current debugging session.
ERROR IN GREEN STATEMENT: this is the next statement to execute when thread return
So avoiding this issue you can try to look any potential code(Code/Service Timeout or any other code block) which is trying to offload your currently executing code context and try to fix it, furthermore original explanation given by #Hans is still very much relevant for trouble shooting this issue.
Actually, I am also facing this issue. This is because I missed some layout component in landscape mode, So check all the Id's and components and Run, you will not get this error.