Estimating WCET of a task on Linux - linux

I want to approximate the Worst Case Execution Time (WCET) for a set of tasks on linux. Most professional tools are either expensive (1000s $), or don't support my processor architecture.
Since, I don't need a tight bound, my line of thought is that I :
disable frequency scaling
disbale unnecesary background services and tasks
set the program affinity to run on a specified core
run the program for 50,000 times with various inputs
Profiling it and storing the total number of cycles it had completed to
execute.
Given the largest clock cycle count and knowing the core frequency, I can get an estimate
Is this is a sound Practical approach?
Secondly, to account for interference from other tasks, I will run the whole task set (40) tasks in parallel with each randomly assigned a core and do the same thing for 50,000 times.
Once I get the estimate, a 10% safe margin will be added to account for unforseeble interference and untested path. This 10% margin has been suggested in the paper "Approximation of Worst Case Execution time in Preepmtive Multitasking Systems" by Corti, Brega and Gross

Some comments:
1) Even attempting to compute worst case bounds in this way means making assumptions that there aren't uncommon inputs that cause tasks to take much more or even much less time. An extreme example would be a bug that causes one of the tasks to go into an infinite loop, or that causes the whole thing to deadlock. You need something like a code review to establish that the time taken will always be pretty much the same, regardless of input.
2) It is possible that the input data does influence the time taken to some extent. Even if this isn't apparent to you, it could happen because of the details of the implementation of some library function that you call. So you need to run your tests on a representative selection of real life data.
3) When you have got your 50K test results, I would draw some sort of probability plot - see e.g. http://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htm and links off it. I would be looking for isolated points that show that in a few cases some runs were suspiciously slow or suspiciously fast, because the code review from (1) said there shouldn't be runs like this. I would also want to check that adding 10% to the maximum seen takes me a good distance away from the points I have plotted. You could also plot time taken against different parameters from the input data to check that there wasn't any pattern there.
4) If you want to try a very sophisticated approach, you could try fitting a statistical distribution to the values you have found - see e.g. https://en.wikipedia.org/wiki/Generalized_Pareto_distribution. But plotting the data and looking at it is probably the most important thing to do.

Related

Random slowdowns in node.js execution

I have an optimization algorithm written in node.js that uses cpu time (measured with performance.now()) as a heuristic.
However, I noticed that occasionally some trivial lines of code would cost much more than usual.
So I wrote a test program:
const timings = [];
while (true) {
const start = performance.now();
// can add any trivial line of code here, or just nothing
const end = performance.now();
const dur = end - start;
if (dur > 1) {
throw [
"dur > 1",
{
start,
end,
dur,
timings,
avg: _.mean(timings),
max: _.max(timings),
min: _.min(timings),
last: timings.slice(-10),
},
];
}
timings.push(dur);
}
The measurements showed an average of 0.00003ms and a peak >1ms (with the second highest <1ms but same order of magnitude).
The possible reasons I can think of are:
the average timing isn't the actual time for executing the code (some compiler optimization)
performance.now isn't accurate somehow
cpu scheduling related - process wasn't running normally but still counted in performance.now
occasionally node is doing something extra behind the scenes (GC etc)
something happening on the hardware/os level - caching / page faults etc
Is any of these a likely reason, or is it something else?
Whichever the cause is, is there a way to make a more accurate measurement for the algorithm to use?
The outliers are current causing the algorithm to misbehave & without knowing how to resolve this issue the best option is to use the moving average cost as a heuristic but has its downsides.
Thanks in advance!
------- Edit
I appreciate how performance.now() will never be accurate, but was a bit surprised that it could span 3-4 orders of magnitude (as opposed to 2 orders of magnitude or ideally 1.)
Would anyone have any idea/pointers as to how performance.now() works and thus what's likely the major contributor to the error range?
It'd be nice to know if the cause is due to something node/v8 doesn't have control over (hardware/os level) vs something it does have control over (a node bug/options/gc related), so I can decide whether there's a way to reduce the error range before considering other tradeoffs with using an alternative heuristic.
------- Edit 2
Thanks to #jfriend00 I now realize performance.now() doesn't measure the actual CPU time the node process executed, but just the time since when the process started.
The question now is
if there's an existing way to get actual CPU time
is this a feature request for node/v8
unless the node process doesn't have enough information from the OS to provide this
You're unlikely to be able to accurately measure the time for one trivial line of code. In fact, the overhead in executing performance.now() is probably many times higher than the time to execute one trivial line of code. You have to be careful that what you're measuring takes substantially longer to execute than the uncertainty or overhead of the measurement itself. Measuring very small executions times is not going to be an accurate endeavor.
1,3 and 5 in your list are also all possibilities. You aren't guaranteed that your code gets a dedicated CPU core that is never interrupted to service some other thread in the system. In my Windows system, even when my nodejs is the only "app" running, there are hundreds of other threads devoted to various OS services that may or may not request some time to run while my nodejs app is running and eventually get some time slice of the CPU core my nodejs app was using.
And, as best I know, performance.now() is just getting a high resolution timer from the OS that's relative to some epoch time. It has no idea when your thread is and isn't running on a CPU core and wouldn't have any way to adjust for that. It just gets a high resolution timestamp which you can compare to some other high resolution timestamp. The time elapsed is not CPU time for your thread. It's just clock time elapsed.
Is any of these a likely reason, or is it something else?
Yes, they all sound likely.
is there a way to make a more accurate measurement for the algorithm to use?
No, sub-millisecond time measurements are generally not reliable, and almost never a good idea. (Doesn't matter whether a timing API promises micro/nanosecond precision or whatever; chances are that (1) it doesn't hold up in practice, and (2) trying to rely on it creates more problems than it solves. You've just found an example of that.)
Even measuring milliseconds is fraught with peril. I once investigated a case of surprising performance, where it turned out that on that particular combination of hardware and OS, after 16ms of full load the CPU ~tripled its clock rate, which of course had nothing to do with the code that appeared to behave weirdly.
EDIT to reply to edited question:
The question now is
if there's an existing way to get actual CPU time
No.
is this a feature request for node/v8
No, because...
unless the node process doesn't have enough information from the OS to provide this
...yes.

Measuring a feature's share of a web service's execution time

I have a piece of code that includes a specific feature that I can turn on and off. I want to know the execution time of the feature.
I need to measure this externally, i.e. by simply measuring execution time with a load test tool. Assume that I cannot track the feature's execution time internally.
Now, I execute two runs (on/off) and simply assume that the difference between the resulting execution time is my feature's execution time.
I know that it is not entirely correct to do this as I'm looking at two separate runs that may be influenced by networking, programmatic overhead, or the gravitational pull of the moon. Still, I hope I can assume that the result will still be viable if I have a sufficiently large number of requests.
Now for the real question. I do the above using the average response time. Which is not perfect, but more or less ok.
My question is, what if I now use a percentile (say, 95th) instead?
Would my imperfect subtract-A-from-B approach become significantly more imperfect when using percentiles?
I would stick to the percentiles as the "average" approach can mask the problem, for example if you have very low response times during the initial phase of the test when the load is low and very high response times during the main phase of test when the load is immense the arithmetic mean approach will give you okayish values while with the percentiles you will get the information that the response time for 95% of requests was X or higher.
More information: Understanding Your Reports: Part 3 - Key Statistics Performance Testers Need to Understand

Jmeter - how to get higher randomize effect?

I need to simulate "real traffic" on Web farm, by other words I need to generate high peaks but as well periods which less or even no HTTP requests (hits) at all. Reason for that is to test some atomized mechanisms for adding and reducing CPU and memory for Web servers itself (that is another story). That is why I need "totally random" sceneries when I have loads but as well period with zero or less traffic (so I can add or reduce compute power).
This is situation that I get now, as you can see I always have some avg load its always around some number of hits, even if I change 10 to 100 threads. Values (results) will always have some average value. There are no periods with less or more traffic which would be separated be +10 mints or so, only by few seconds.
Current situation
I would like to get "higher" variations by HITS/REQUESTS with some time breaks between it.
Situation that I want: i.stack.imgur.com/I4LhU.png
I tried several timers but no success and I do not want to use "Ultimate Thread Group" and similar components because I want test to be totaly randome and not predefined with time breaks and pause periods (thread delays). I would like test which will be totally randomized by it self - which could for example generate from 1 to 100 users per XY time.
This is my current Jmeter setup: i.stack.imgur.com/I4LhU.png
I do not know if I am missing some parameter in current setup or there is totally another way to do this.
Thanks a lot!
If this is something you really want (I strongly believe that the test needs to be repeatable, not random), I would suggest using Constant Throughput Timer for this. Despite the word "Constant" in its name you can use a Function or a Variable there, for instance __Random() and you will get different controllable "spikes" each iteration.
Moreover, you put a __P() function and amend its value via Beanshell Server while the test is running

Parallel MonteCarlo: reproducibility or real randomness?

I'm preparing a college exam in parallel computing.
The main purpose is to speedup as much as possible a Montecarlo simulation about electron drift in earth magnetic field.
I've already developed something with two layers of parallelization:
MPI used to make te code run on several machines
OpenMP to run parallel simulation inside the single computer
Now comes the question: I would like to keep on-demand the task execution.
The fastest computer must be able to execute more work the the slower ones.
The problem partition is done via master-worker cycle, so there is no actual struggle about achieving this result.
Since the number of tasks (a block of n electrons to simulate) executed by a worker is not prior defined I have two roads to follow:
every thread in every worker has is own RNG initialized with random generated seed (different generation method). The unbalancing of the cluster will change results, but in this approach the result is as casual as possible.
every electron has his own seed, granting reproducibility of the simulation despite of which worker runs the single task. Must have a better RNG.
Lets's poll about this. What's your suggestion?
Have fun
gf
What to poll about here?
Clearly, only approach #2 is a feasible one. Each source particle starts with it's own and stable seed. It makes result reproducible AND debuggable (for a lack of better word).
Well-known Monte Carlo code MCNP5+ used this scheme for good, runs on multi-cores and MPI. To implement it you'll need RNG with fast skip-ahead (a.k.a. leapfrog or discard) feature. And there are quite a few of them. They are based upon fast exponent computation, paper by F. Brown, "Random Number Generation with Arbitrary Stride", Trans. Am. Nucl. Soc. (Nov. 1994). Basically, skip-ahead is log(N) with Brown approach.
Simplest version which is about the same as MCNP5 one is here https://github.com/Iwan-Zotow/LCG-PLE63
More complicated (and slower, but higher quality) RNG is here http://www.pcg-random.org/

Progress bar and multiple threads, decoupling GUI and logic - which design pattern would be the best?

I'm looking for a design pattern that would fit my application design.
My application processes large amounts of data and produces some graphs.
Data processing (fetching from files, CPU intensive calculations) and graph operations (drawing, updating) are done in seperate threads.
Graph can be scrolled - in this case new data portions need to be processed.
Because there can be several series on a graph, multiple threads can be spawned (two threads per serie, one for dataset update and one for graph update).
I don't want to create multiple progress bars. Instead, I'd like to have single progress bar that inform about global progress. At the moment I can think of MVC and Observer/Observable, but it's a little bit blurry :) Maybe somebody could point me in a right direction, thanks.
I once spent the best part of a week trying to make a smooth, non-hiccupy progress bar over a very complex algorithm.
The algorithm had 6 different steps. Each step had timing characteristics that were seriously dependent on A) the underlying data being processed, not just the "amount" of data but also the "type" of data and B) 2 of the steps scaled extremely well with increasing number of cpus, 2 steps ran in 2 threads and 2 steps were effectively single-threaded.
The mix of data effectively had a much larger impact on execution time of each step than number of cores.
The solution that finally cracked it was really quite simple. I made 6 functions that analyzed the data set and tried to predict the actual run-time of each analysis step. The heuristic in each function analyzed both the data sets under analysis and the number of cpus. Based on run-time data from my own 4 core machine, each function basically returned the number of milliseconds it was expected to take, on my machine.
f1(..) + f2(..) + f3(..) + f4(..) + f5(..) + f6(..) = total runtime in milliseconds
Now given this information, you can effectively know what percentage of the total execution time each step is supposed to take. Now if you say step1 is supposed to take 40% of the execution time, you basically need to find out how to emit 40 1% events from that algorithm. Say the for-loop is processing 100,000 items, you could probably do:
for (int i = 0; i < numItems; i++){
if (i % (numItems / percentageOfTotalForThisStep) == 0) emitProgressEvent();
.. do the actual processing ..
}
This algorithm gave us a silky smooth progress bar that performed flawlessly. Your implementation technology can have different forms of scaling and features available in the progress bar, but the basic way of thinking about the problem is the same.
And yes, it did not really matter that the heuristic reference numbers were worked out on my machine - the only real problem is if you want to change the numbers when running on a different machine. But you still know the ratio (which is the only really important thing here), so you can see how your local hardware runs differently from the one I had.
Now the average SO reader may wonder why on earth someone would spend a week making a smooth progress bar. The feature was requested by the head salesman, and I believe he used it in sales meetings to get contracts. Money talks ;)
In situations with threads or asynchronous processes/tasks like this, I find it helpful to have an abstract type or object in the main thread that represents (and ideally encapsulates) each process. So, for each worker thread, there will presumably be an object (let's call it Operation) in the main thread to manage that worker, and obviously there will be some kind of list-like data structure to hold these Operations.
Where applicable, each Operation provides the start/stop methods for its worker, and in some cases - such as yours - numeric properties representing the progress and expected total time or work of that particular Operation's task. The units don't necessarily need to be time-based, if you know you'll be performing 6,230 calculations, you can just think of these properties as calculation counts. Furthermore, each task will need to have some way of updating its owning Operation of its current progress in whatever mechanism is appropriate (callbacks, closures, event dispatching, or whatever mechanism your programming language/threading framework provides).
So while your actual work is being performed off in separate threads, a corresponding Operation object in the "main" thread is continually being updated/notified of its worker's progress. The progress bar can update itself accordingly, mapping the total of the Operations' "expected" times to its total, and the total of the Operations' "progress" times to its current progress, in whatever way makes sense for your progress bar framework.
Obviously there's a ton of other considerations/work that needs be done in actually implementing this, but I hope this gives you the gist of it.
Multiple progress bars aren't such a bad idea, mind you. Or maybe a complex progress bar that shows several threads running (like download manager programs sometimes have). As long as the UI is intuitive, your users will appreciate the extra data.
When I try to answer such design questions I first try to look at similar or analogous problems in other application, and how they're solved. So I would suggest you do some research by considering other applications that display complex progress (like the download manager example) and try to adapt an existing solution to your application.
Sorry I can't offer more specific design, this is just general advice. :)
Stick with Observer/Observable for this kind of thing. Some object observes the various series processing threads and reports status by updating the summary bar.

Resources