Explain surprising number of timers and finding their origin - memory-leaks

I have a CPU leak somewhere and am trying to find out the origin (the RAM increases a bit too but not as fast).
I have collected some data through dotnet-counters and found out that the number of timers in the given process keeps increasing. For instance, when 46 % of the CPU was used, it was reporting around 450 timers (see following data). Note that this count only increases, slowly but surely like my CPU usage. And also note that it's pretty much at a time where it was idle, not having customers online.
This leads to 2 questions:
1) Is it normal? Should I continue on this track? I find it odd to have this many timers.
2) In my own code, I never user the Timer class, I use the Stopwatch class once but that's it.
So what would be the best way to find the class/library that uses those timers?

1) There was indeed something fishy and it is definitely not normal.
I had a class spawning hundreds of threads and keeping them alive to listen to incoming calls.
Long story short, that many timers is not something normal.
2) The Timer was coming from somewhere deep in a library. I couldn't find an easy way to find them.
Best option is to use dotnet-trace collect. https://www.speedscope.app/ helped me a lot and it is possible to generate de traces for speedscope directly from dotnet-trace.

Related

Random slowdowns in node.js execution

I have an optimization algorithm written in node.js that uses cpu time (measured with performance.now()) as a heuristic.
However, I noticed that occasionally some trivial lines of code would cost much more than usual.
So I wrote a test program:
const timings = [];
while (true) {
const start = performance.now();
// can add any trivial line of code here, or just nothing
const end = performance.now();
const dur = end - start;
if (dur > 1) {
throw [
"dur > 1",
{
start,
end,
dur,
timings,
avg: _.mean(timings),
max: _.max(timings),
min: _.min(timings),
last: timings.slice(-10),
},
];
}
timings.push(dur);
}
The measurements showed an average of 0.00003ms and a peak >1ms (with the second highest <1ms but same order of magnitude).
The possible reasons I can think of are:
the average timing isn't the actual time for executing the code (some compiler optimization)
performance.now isn't accurate somehow
cpu scheduling related - process wasn't running normally but still counted in performance.now
occasionally node is doing something extra behind the scenes (GC etc)
something happening on the hardware/os level - caching / page faults etc
Is any of these a likely reason, or is it something else?
Whichever the cause is, is there a way to make a more accurate measurement for the algorithm to use?
The outliers are current causing the algorithm to misbehave & without knowing how to resolve this issue the best option is to use the moving average cost as a heuristic but has its downsides.
Thanks in advance!
------- Edit
I appreciate how performance.now() will never be accurate, but was a bit surprised that it could span 3-4 orders of magnitude (as opposed to 2 orders of magnitude or ideally 1.)
Would anyone have any idea/pointers as to how performance.now() works and thus what's likely the major contributor to the error range?
It'd be nice to know if the cause is due to something node/v8 doesn't have control over (hardware/os level) vs something it does have control over (a node bug/options/gc related), so I can decide whether there's a way to reduce the error range before considering other tradeoffs with using an alternative heuristic.
------- Edit 2
Thanks to #jfriend00 I now realize performance.now() doesn't measure the actual CPU time the node process executed, but just the time since when the process started.
The question now is
if there's an existing way to get actual CPU time
is this a feature request for node/v8
unless the node process doesn't have enough information from the OS to provide this
You're unlikely to be able to accurately measure the time for one trivial line of code. In fact, the overhead in executing performance.now() is probably many times higher than the time to execute one trivial line of code. You have to be careful that what you're measuring takes substantially longer to execute than the uncertainty or overhead of the measurement itself. Measuring very small executions times is not going to be an accurate endeavor.
1,3 and 5 in your list are also all possibilities. You aren't guaranteed that your code gets a dedicated CPU core that is never interrupted to service some other thread in the system. In my Windows system, even when my nodejs is the only "app" running, there are hundreds of other threads devoted to various OS services that may or may not request some time to run while my nodejs app is running and eventually get some time slice of the CPU core my nodejs app was using.
And, as best I know, performance.now() is just getting a high resolution timer from the OS that's relative to some epoch time. It has no idea when your thread is and isn't running on a CPU core and wouldn't have any way to adjust for that. It just gets a high resolution timestamp which you can compare to some other high resolution timestamp. The time elapsed is not CPU time for your thread. It's just clock time elapsed.
Is any of these a likely reason, or is it something else?
Yes, they all sound likely.
is there a way to make a more accurate measurement for the algorithm to use?
No, sub-millisecond time measurements are generally not reliable, and almost never a good idea. (Doesn't matter whether a timing API promises micro/nanosecond precision or whatever; chances are that (1) it doesn't hold up in practice, and (2) trying to rely on it creates more problems than it solves. You've just found an example of that.)
Even measuring milliseconds is fraught with peril. I once investigated a case of surprising performance, where it turned out that on that particular combination of hardware and OS, after 16ms of full load the CPU ~tripled its clock rate, which of course had nothing to do with the code that appeared to behave weirdly.
EDIT to reply to edited question:
The question now is
if there's an existing way to get actual CPU time
No.
is this a feature request for node/v8
No, because...
unless the node process doesn't have enough information from the OS to provide this
...yes.

Threading the right choice?

Im new to threads, therefore im not sure if threads are the right way to approach this.
My program needs to perform a calculation a couple of times, same logik behind it, but with different parameters. The longer the calculation, the closer it will be to the perfect answer. The calculation duration cant be measured beforehanded (from a few seconds to a couple of minutes)
The user wants to have the results in an order (from calculation 1 to X) at certain times. He is satisfied with not the perfect solution as long as it he gets a result. Once he has a solution, he is not interested in the one before (example: he has a not perfect answer from calculation 1 and demands now answer from calculation 2; even if there is a better answer now for calculation 1, he is not interested in it)
Is threading the right way to do this?
Threading sounds like a good approach for this, as you can perform your long-running computation on a background thread while keeping your UI responsive.
In order to satisfy your requirement of having results in an order, you may need a way of stopping threads that are no longer needed. Either abort them (may be extreme), or just signal them to stop and/or return the current result.
Note you may want the threads to periodically check back in with the UI to report progress (% complete), check for any abort requests, etc. Although this depends entirely upon your application and is not necessarily required.

Limiting work in progress of parallel operations of a streamed resource

I've found myself recently using the SemaphoreSlim class to limit the work in progress of a parallelisable operation on a (large) streamed resource:
// The below code is an example of the structure of the code, there are some
// omissions around handling of tasks that do not run to completion that should be in production code
SemaphoreSlim semaphore = new SemaphoreSlim(Environment.ProcessorCount * someMagicNumber);
foreach (var result in StreamResults())
{
semaphore.Wait();
var task = DoWorkAsync(result).ContinueWith(t => semaphore.Release());
...
}
This is to avoid bringing too many results into memory and the program being unable to cope (generally evidenced via an OutOfMemoryException). Though the code works and is reasonably performant, it still feels ungainly. Notably the someMagicNumber multiplier, which although tuned via profiling, may not be as optimal as it could be and isn't resilient to changes to the implementation of DoWorkAsync.
In the same way that thread pooling can overcome the obstacle of scheduling many things for execution, I would like something that can overcome the obstacle of scheduling many things to be loaded into memory based on the resources that are available.
Since it is deterministically impossible to decide whether an OutOfMemoryException will occur, I appreciate that what I'm looking for may only be achievable via statistical means or even not at all, but I hope that I'm missing something.
Here I'd say that you're probably overthinking this problem. The consequences for overshooting are rather high (the program crashes). The consequences for being too low are that the program might be slowed down. As long as you still have some buffer beyond a minimum value, further increases to the buffer will generally have little to no effect, unless the processing time of that task in the pipe is extraordinary volatile.
If your buffer is constantly filling up it generally means that the task before it in the pipe executes quite a bit quicker than the task that follows it, so even without a fairly small buffer it is likely to always ensure the task following it has some work. The buffer size needed to get 90% of the benefits of a buffer is usually going to be quite small (a few dozen items maybe) whereas the side needed to get an OOM error are like 6+ orders of magnate higher. As long as you're somewhere in-between those two numbers (and that's a pretty big range to land in) you'll be just fine.
Just run your static tests, pick a static number, maybe add a few percent extra for "just in case" and you should be good. At most, I'd move some of the magic numbers to a config file so that they can be altered without a recompile in the event that the input data or the machine specs change radically.

Get more deterministic (short) sleeps

as a student project, we're building a robot which should run through a defined course and pick up a wooden cube. The core of it is a single-board computer running debian with an ARM9 at 250MHz. So there is more than enough processing power for the controller. Additionally, it does some image processing (not when driving, only when it stops), that's why we don't use a simple microcontroller without an OS.
The whole thing works quite well at the moment, but there is one problem: The main controlling loop executes without any delays and achieves cycle frequencies of over 1kHz. This is more than enough, 100Hz would suffice as well. But every when and then, there is a single cycle which takes 100ms and more, which may greatly disturb the controller.
I suspect that there are some other tasks which cause this delay, since the scheduler may detect that they haven't got any CPU time for an extended period of time.
So what I'd like most is the following: A short sleep of maybe 5ms in the controller's mainloop which does really only take 5ms but gives some processor time to the rest of the system. I can include a delay of e.g. 500us using nanosleep, but this always takes more than 10ms to execute, so it is not really an alternative. I'd prefer something like a voluntary sleep to give waiting tasks the opportunity to do something, but returns as quickly as possible.
The system is unloaded otherwise, so there is nothing which could really need a lot of processing for a long time.
Is there any way to do this in userspace, i.e. without having to stick to things like RTAI?
Thanks,
Philipp
I suggest that you stick to real time interfaces when doing motor control; I've seen a 1000 kg truck slam into a wall during a demo due to that the OS decided to think about something else for once... :-)
If you want to stay away from RTAI (but you shouldn't); a (perhaps) quick fix is to slap an Arduino board for the actual driving, and keep the linux board for high level processing.
To fix the "wall problem" implement a watchdog in the driver board, that stops the thing if no command have arrived for a while...
A real-time problem needs a real-time OS.
Because a real-time OS is predictable. You can set task priorities such that the ones that need to produce results at specified times don't get preempted by tasks that need processing power but are not time constrained.
Ok, I've found a solution which makes it better, altough not perfect. There is another thread which explains the sched_setscheduler() function. My init code now looks like this:
// Lock memory to reduce probability of high latency due to swapping
int rtres = mlockall(MCL_CURRENT | MCL_FUTURE);
if (rtres) {
cerr << "WARNING: mlockall() failed: " << rtres << endl;
}
// Set real-time scheduler policy to get more time deterministic behaviour
struct sched_param rtparams;
rtparams.sched_priority = sched_get_priority_max(SCHED_FIFO);
rtres = sched_setscheduler(0, SCHED_FIFO, &rtparams);
if (rtres) {
cerr << "WARNING: sched_setscheduler() failed: " << rtres << endl;
}
Additionally, I've removed the short sleeps from the mainloop. The process now eats all available processing power and is (obviously) unresponsive to any actions from the outside world (such as console keystrokes and such), but this is OK for the task at hand.
The mainloop stats show that most iterations take less than a millisecond to complete, but there are still a few (every 1000th or so) which need approx. 100ms. This is still not good, but at least there are no more delays which are even longer.
Since it is "only" a student project, we can live with that and note it as a candidate for futher improvement.
Anyway, thanks for the suggestions. For next time, I'd know better how to cope with realtime requirements and take an RT OS from the beginning.
Regards,
Philipp
We are running 100 Hz loops on an ARM7 board at work, using a standard Linux with the RT patch, which allows (almost) all locks in the kernel to be preempted. Combining this with a high-priority thread gives us the necessary performance in both kernel and user space.
As the only thing you need to do is to apply the patch and configure the kernel to use full preemption, it's pretty easy to use also - no need to change anything in the software architecture, although I'm not familiar enough with Debian to say whether the patch will apply cleanly.

Progress bar and multiple threads, decoupling GUI and logic - which design pattern would be the best?

I'm looking for a design pattern that would fit my application design.
My application processes large amounts of data and produces some graphs.
Data processing (fetching from files, CPU intensive calculations) and graph operations (drawing, updating) are done in seperate threads.
Graph can be scrolled - in this case new data portions need to be processed.
Because there can be several series on a graph, multiple threads can be spawned (two threads per serie, one for dataset update and one for graph update).
I don't want to create multiple progress bars. Instead, I'd like to have single progress bar that inform about global progress. At the moment I can think of MVC and Observer/Observable, but it's a little bit blurry :) Maybe somebody could point me in a right direction, thanks.
I once spent the best part of a week trying to make a smooth, non-hiccupy progress bar over a very complex algorithm.
The algorithm had 6 different steps. Each step had timing characteristics that were seriously dependent on A) the underlying data being processed, not just the "amount" of data but also the "type" of data and B) 2 of the steps scaled extremely well with increasing number of cpus, 2 steps ran in 2 threads and 2 steps were effectively single-threaded.
The mix of data effectively had a much larger impact on execution time of each step than number of cores.
The solution that finally cracked it was really quite simple. I made 6 functions that analyzed the data set and tried to predict the actual run-time of each analysis step. The heuristic in each function analyzed both the data sets under analysis and the number of cpus. Based on run-time data from my own 4 core machine, each function basically returned the number of milliseconds it was expected to take, on my machine.
f1(..) + f2(..) + f3(..) + f4(..) + f5(..) + f6(..) = total runtime in milliseconds
Now given this information, you can effectively know what percentage of the total execution time each step is supposed to take. Now if you say step1 is supposed to take 40% of the execution time, you basically need to find out how to emit 40 1% events from that algorithm. Say the for-loop is processing 100,000 items, you could probably do:
for (int i = 0; i < numItems; i++){
if (i % (numItems / percentageOfTotalForThisStep) == 0) emitProgressEvent();
.. do the actual processing ..
}
This algorithm gave us a silky smooth progress bar that performed flawlessly. Your implementation technology can have different forms of scaling and features available in the progress bar, but the basic way of thinking about the problem is the same.
And yes, it did not really matter that the heuristic reference numbers were worked out on my machine - the only real problem is if you want to change the numbers when running on a different machine. But you still know the ratio (which is the only really important thing here), so you can see how your local hardware runs differently from the one I had.
Now the average SO reader may wonder why on earth someone would spend a week making a smooth progress bar. The feature was requested by the head salesman, and I believe he used it in sales meetings to get contracts. Money talks ;)
In situations with threads or asynchronous processes/tasks like this, I find it helpful to have an abstract type or object in the main thread that represents (and ideally encapsulates) each process. So, for each worker thread, there will presumably be an object (let's call it Operation) in the main thread to manage that worker, and obviously there will be some kind of list-like data structure to hold these Operations.
Where applicable, each Operation provides the start/stop methods for its worker, and in some cases - such as yours - numeric properties representing the progress and expected total time or work of that particular Operation's task. The units don't necessarily need to be time-based, if you know you'll be performing 6,230 calculations, you can just think of these properties as calculation counts. Furthermore, each task will need to have some way of updating its owning Operation of its current progress in whatever mechanism is appropriate (callbacks, closures, event dispatching, or whatever mechanism your programming language/threading framework provides).
So while your actual work is being performed off in separate threads, a corresponding Operation object in the "main" thread is continually being updated/notified of its worker's progress. The progress bar can update itself accordingly, mapping the total of the Operations' "expected" times to its total, and the total of the Operations' "progress" times to its current progress, in whatever way makes sense for your progress bar framework.
Obviously there's a ton of other considerations/work that needs be done in actually implementing this, but I hope this gives you the gist of it.
Multiple progress bars aren't such a bad idea, mind you. Or maybe a complex progress bar that shows several threads running (like download manager programs sometimes have). As long as the UI is intuitive, your users will appreciate the extra data.
When I try to answer such design questions I first try to look at similar or analogous problems in other application, and how they're solved. So I would suggest you do some research by considering other applications that display complex progress (like the download manager example) and try to adapt an existing solution to your application.
Sorry I can't offer more specific design, this is just general advice. :)
Stick with Observer/Observable for this kind of thing. Some object observes the various series processing threads and reports status by updating the summary bar.

Resources