How can I monitor stalled tasks? - rust

I am running a Rust app with Tokio in prod. In the last version i had a bug, and some requests caused my code to go into an infinite loop.
What happened is while the task that got into the loop was stuck, all the other task continue to work well and processing requests, that happened until the number of stalling tasks was high enough to cause my program to be unresponsive.
My problem is took a lot of time to our monitoring systems to identify that something go wrong. For example, the task that answer to Kubernetes' health check works well and I wasn't able to identify that I have stalled tasks in my system.
So my question is if there's a way to identify and alert in such cases?
If i could find way to define timeout on task, and if it's not return to the scheduler after X seconds/millis to mark the task as stalled, that will be a good enough solution for me.

Using tracing might be an option here: following issue 2655 every tokio task should have a span. Alongside tracing-futures this means you should get a tracing event every time a task is entered or suspended (see this example), by adding the relevant data (e.g. task id / request id / ...) you should then be able to feed this information to an analysis tool in order to know:
that a task is blocked (was resumed then never suspended again)
if you add your own spans, that a "userland" span was never exited / closed, which might mean it's stuck in a non-blocking loop (which is also an issue though somewhat less so)
I think that's about the extent of it: as noted by issue 2510, tokio doesn't yet use the tracing information it generates and so provide no "built-in" introspection facilities.

Related

Sleep() Methods and OS - Scheduler (Camunda/Groovy)

I got a question for you guys and its not as specific as usual, which could make it a little annoying to answer.
The tool i'm working with is Camunda in combination with Groovy scripts and the goal is to reduce the maximum cpu load (or peak load). I'm doing this by "stretching" the work load over a certain time frame since the platform seems to be unhappy with huge work load inputs in a short amount of time. The resulting problem is that Camunda wont react smoothly when someone tries to operate it at the UI - Level.
So i wrote a small script which basically just lets each individual process determine his own "time to sleep" before running, if a certain threshold is exceeded. This is based on how many processes are trying to run at the same time as the individual process.
It looks like:
Process wants to start -> Process asks how many other processes are running ->
waitingTime = numberOfProcesses * timeToSleep * iterationOfMeasures
CPU-Usage Curve 1,3 without the Script. Curve 2,4 With the script
Testing it i saw that i could stretch the work load and smoothe out the UI - Levels. But now i need to describe why this is working exactly.
The Questions are:
What does a sleep method do exactly ?
What does the sleep method do on CPU - Level?
How does an OS-Scheduler react to a Sleep Method?
Namely: Does the scheduler reschedule or just simply "wait" for the time given?
How can i recreate and test the question given above?
The main goal is not for you to answer this, but could you give me a hint for finding the right Literature to answer these questions? Maybe you remember a book which helped you understand this kind of things or a Professor recommended something to you. (Mine wont answer, and i cant blame him)
I'm grateful for hints and or recommendations !
i'm sure you could use timer event
https://docs.camunda.org/manual/7.15/reference/bpmn20/events/timer-events/
it allows to postpone next task trigger for some time defined by expression.
about sleep in java/groovy: https://www.javamex.com/tutorials/threads/sleep.shtml
using sleep is blocking current thread in groovy/java/camunda.
so instead of doing something effective it's just blocked.

Best way to implement background “timer” functionality in Python/Django

I am trying to implement a Django web application (on Python 3.8.5) which allows a user to create “activities” where they define an activity duration and then set the activity status to “In progress”.
The POST action to the View writes the new status, the duration and the start time (end time, based on start time and duration is also possible to add here of course).
The back-end should then keep track of the duration and automatically change the status to “Finished”.
User actions can also change the status to “Finished” before the calculated end time (i.e. the timer no longer needs to be tracked).
I am fairly new to Python so I need some advice on the smartest way to implement such a concept?
It needs to be efficient and scalable – I’m currently using a Heroku Free account so have limited system resources, but efficiency would also be important for future production implementations of course.
I have looked at the Python threading Timer, and this seems to work on a basic level, but I’ve not been able to determine what kind of constraints this places on the system – e.g. whether the spawned Timer thread might prevent the main thread from finishing and releasing resources (i.e. Heroku Dyno threads), etc.
I have read that persistence might be a problem (if the server goes down), and I haven’t found a way to cancel the timer from another process (the .cancel() method seems to rely on having the original object to cancel, and I’m not sure if this is achievable from another process).
I was also wondering about a more “background” approach, i.e. a single process which is constantly checking the database looking for activity records which have reached their end time and swapping the status.
But what would be the best way of implementing such a server?
Is it practical to read the database every second to find records with an end time of “now”? I need the status to change in real-time when the end time is reached.
Is something like Celery a good option, or is it overkill for a single process like this?
As I said I’m fairly new to these technologies, so I may be missing other obvious solutions – please feel free to enlighten me!
Thanks in advance.
To achieve this you need some kind of scheduling tasks functionality. For a fast simpler implementation is a good solution to use the Timer object from the
Threading module.
A more complete solution is tu use Celery. If you are new, deeping in it will give you a good value start using celery as a queue manager distributing your work easily across several threads or process.
You mentioned that you want it to be efficient and scalable, so I guess you will want to implement similar functionalities that will require multiprocessing and schedule so for that reason my recommendation is to use celery.
You can integrate it into your Django application easily following the documentation Integrate Django with Celery.

How does Erlang sleep (at night?)

I want to run a small clean up process every few hours on an Erlang server.
I know of the timer module. I saw an example in a tutorial used chained timer:sleep commands to wait for an event that would occur multiple days later, which I found strange. I understand that Erlang process are unique compared to those in other languages, but the idea of a process/thread sleeping for days, weeks, and even months at a time seemed odd.
So I set out to find out the details of what sleeping actually does. The closest I found was a blog post mentioning that sleep is implemented with a receive timeout, but that still left the question:
What do these sleep/sleep-like functions actually do?
Is my process taking up resources as it sleeps? Would having thousands of sleeping process use as many resources, as say, thousands of process servicing a recursive call that did nothing? Is there any performance penalty from repeatedly sleeping within processes, or sleeping for long periods of time? Is the VM constantly expending resources to see if the conditions to end the processes' sleep are up?
And as a side note, I'd appreciate if someone could comment on if there is a better way than sleeping to pause for hours or days at a time?
That is the Karma of any erlang process: it waits or dies :o)
when a process is spawned, it start executing until the last execution line, and die, returning the last evaluation.
To keep a process alive, there is no other solution to recursively loop in a never ending succession of calls.
of course there are several conditions that make it stop or sleep:
end of the loop: the process received a message which tell him to
stop recursion
a receive bloc: the process will wait until a message
matching one entry in the receive bloc is posted in the message
queue.
The VM scheduler stop it temporarily to let access to the CPU
to other processes
in the 2 last cases the execution will restart under the responsibility of the VM scheduler.
while waiting it uses no CPU bandwidth, but keeps the exact same memory layout it had when it started waiting. The Erlang OTP offers some means to reduce this memory layout to the minimum using the hibernate option (see the documentation of gen_serevr or gen_fsm, but it is for advanced usage only in my mind).
a simple way to create a "signal" that will fire a process at regular (or almost regular) interval is effectively to use receive block with timout (The timeout is limited to 65535 ms), for example:
on_tick_sec(Module,Function,Arglist,Period) ->
on_tick(Module,Function,Arglist,1000,Period,0).
on_tick_mn(Module,Function,Arglist,Period) ->
on_tick(Module,Function,Arglist,60000,Period,0).
on_tick_hr(Module,Function,Arglist,Period) ->
on_tick(Module,Function,Arglist,60000,Period*60,0).
on_tick(Module,Function,Arglist,TimeBase,Period,Period) ->
apply(Module,Function,Arglist),
on_tick(Module,Function,Arglist,TimeBase,Period,0);
on_tick(Module,Function,Arglist,TimeBase,Period,CountTimeBase) ->
receive
stop -> stopped
after TimeBase ->
on_tick(Module,Function,Arglist,TimeBase,Period,CountTimeBase+1)
end.
and usage:
1> Pid = spawn(util,on_tick_sec,[io,format,["hello~n"],5]).
<0.40.0>
hello
hello
hello
hello
2> Pid ! stop.
stop
3>
[edit]
The timer module is a standard gen_server running in a separate process. All the function in the timer module are public interfaces that execute a hidden gen_server:call or gen_server:cast to the timer server. This is a common usage to hide the internal of a server and allow further evolutions without impact on existing applications.
The server uses internally a table (ets) to store all the actions it has to do along with each timer reference and it uses its own function to be awaken when needed (at the end, the VM must take care of this ?).
So you can hibernate a process without any effect on the timer server behavior. The hibernation mechanism is
tricky, see documentation at hibernate/3 definition, you will see that yo have to "rebuild" the context by yourself since everything was removed from the process context, and a tuple(Module,Function,Arguments} is stored by the system to restart your process when needed.
cost some time in garbage collecting and process restart
It is why I said that it is really an advance feature that need good reason to be used.
There is also erlang:hibernate/3 that puts a process in "deep sleep", minimizing memory usage for it.

Open MPI Virtual Timer Expired

I'm using Open MPI 1.8 on Gentoo 3.13 to manage the data transfer from one program to another via a server/client concept. Both the server and the clients are launched via mpiexec as separate processes. After some days (this is quite a heavy computation...), I sometimes receive the error
mpiexec noticed that process rank 0 with PID 17213 on node XXX exited on signal 26 (Virtual timer expired).
Unfortunately, the error is not reproducible in a reliable way, i.e., the error does not appear always and not always at the same point in the program flow. I also experienced this error on other machines. I already tracked the issue down to the ITIMER_VIRTUAL which, upon expiration, delivers SIGVTALRM (see, e.g., http://man7.org/linux/man-pages/man2/setitimer.2.html). In the BUGS section of the man page, it says that
Under very heavy loading, an ITIMER_REAL timer may expire before the signal from a previous expiration has been delivered. The second signal in such an event will be lost.
I wonder if something similar might also hold for ITIMER_VIRTUAL? Did anyone experience similar problems and can confirm the error?
The only workaround I can think of is to invoke setitimer(...) and try to manipulate the timer myself. However, I hope there is another way since I can't always modify the clients' source code. Any suggestions?
Since this question has not been answered officially, I will do it on behalf of Hristo (#HristoIliev: I hope this is ok for you). As was pointed out in the first comment to my question, there is not a single hint in the Open MPI source code which can have caused the virtual timer expiration. Indeed, the timer problem was related to a third-party library which made the code crash after an unpredictable time (depending on the current loading of the machine).

Using TDD to drive out thread-safe code

What's a good way to leverage TDD to drive out thread-safe code? For example, say I have a factory method that utilizes lazy initialization to create only one instance of a class, and return it thereafter:
private TextLineEncoder textLineEncoder;
...
public ProtocolEncoder getEncoder() throws Exception {
if(textLineEncoder == null)
textLineEncoder = new TextLineEncoder();
return textLineEncoder;
}
Now, I want to write a test in good TDD fashion that forces me to make this code thread-safe. Specifically, when two threads call this method at the same time, I don't want to create two instances and discard one. This is easily done, but how can I write a test that makes me do it?
I'm asking this in Java, but the answer should be more broadly applicable.
You could inject a "provider" (a really simple factory) that is responsible for just this line:
textLineEncoder = new TextLineEncoder();
Then your test would inject a really slow implementation of the provider. That way the two threads in the test could more easily collide. You could go as far as have the first thread wait on a Semaphore that would be released by the second thread. Then success of the test would ensure that the waiting thread times out. By giving the first thread a head-start you can make sure that it's waiting before the second one releases.
It's hard, though possible - possibly harder than it's worth. Known solutions involve instrumenting the code under test. The discussion here, "Extreme Programming Challenge Fourteen" is worth sifting through.
In the book Clean Code there are some tips on how to test concurrent code. One tip that has helped me to find concurrency bugs, is running concurrently more tests than the CPU has cores.
In my project, running the tests takes about 2 seconds on my quad core machine. When I want to test the concurrent parts (there are some tests for that), I hold down in IntelliJ IDEA the hotkey for running all tests, until I see in the status bar that 20, 50 or 100 test runs are in execution. I follow in Windows Task Manager the CPU and memory usage, to find out when all the test runs have finished executing (memory usage goes up by 1-2 GB when they all are running and then slowly goes back down).
Then I close one by one all the test run output dialogs, and check that there were no failures. Sometimes there are failed tests or tests which are in deadlock, and then I investigate them until I find the bug and have fixed it. That has helped me to find a couple of nasty concurrency bugs. The most important thing, when facing an exception/deadlock that should not have happened, is always assuming that the code is broken, and investigating the reason ruthlessly and fixing it. There are no cosmic rays which cause programs to crash randomly - bugs in code cause programs to crash.
There are also frameworks such as http://www.alphaworks.ibm.com/tech/contest which use bytecode manipulation to force the code to do more thread switching, thus increasing the probability of making concurrency bugs visible.
When I test drove an implementation that needed to be thread safe recently I came up with the solution I provided as an answer for this question. Hope that helps even though there are no tests there. Hope link is OK raher than duplicating teh answer...
Chapter 12 of Java Concurrency in Practice is called "Testing Concurrent Programs". It documents testing for safety and liveness, but says this is a hard subject. I am not sure this problem is solvable by the tools of that chapter.
Just off the top of my head could you compare the instances returned to see if they are indeed the same instance or if they are different? That's probably where I would start with C#, I would imagine you can do the same in java

Resources