Get more deterministic (short) sleeps - linux

as a student project, we're building a robot which should run through a defined course and pick up a wooden cube. The core of it is a single-board computer running debian with an ARM9 at 250MHz. So there is more than enough processing power for the controller. Additionally, it does some image processing (not when driving, only when it stops), that's why we don't use a simple microcontroller without an OS.
The whole thing works quite well at the moment, but there is one problem: The main controlling loop executes without any delays and achieves cycle frequencies of over 1kHz. This is more than enough, 100Hz would suffice as well. But every when and then, there is a single cycle which takes 100ms and more, which may greatly disturb the controller.
I suspect that there are some other tasks which cause this delay, since the scheduler may detect that they haven't got any CPU time for an extended period of time.
So what I'd like most is the following: A short sleep of maybe 5ms in the controller's mainloop which does really only take 5ms but gives some processor time to the rest of the system. I can include a delay of e.g. 500us using nanosleep, but this always takes more than 10ms to execute, so it is not really an alternative. I'd prefer something like a voluntary sleep to give waiting tasks the opportunity to do something, but returns as quickly as possible.
The system is unloaded otherwise, so there is nothing which could really need a lot of processing for a long time.
Is there any way to do this in userspace, i.e. without having to stick to things like RTAI?
Thanks,
Philipp

I suggest that you stick to real time interfaces when doing motor control; I've seen a 1000 kg truck slam into a wall during a demo due to that the OS decided to think about something else for once... :-)
If you want to stay away from RTAI (but you shouldn't); a (perhaps) quick fix is to slap an Arduino board for the actual driving, and keep the linux board for high level processing.
To fix the "wall problem" implement a watchdog in the driver board, that stops the thing if no command have arrived for a while...

A real-time problem needs a real-time OS.
Because a real-time OS is predictable. You can set task priorities such that the ones that need to produce results at specified times don't get preempted by tasks that need processing power but are not time constrained.

Ok, I've found a solution which makes it better, altough not perfect. There is another thread which explains the sched_setscheduler() function. My init code now looks like this:
// Lock memory to reduce probability of high latency due to swapping
int rtres = mlockall(MCL_CURRENT | MCL_FUTURE);
if (rtres) {
cerr << "WARNING: mlockall() failed: " << rtres << endl;
}
// Set real-time scheduler policy to get more time deterministic behaviour
struct sched_param rtparams;
rtparams.sched_priority = sched_get_priority_max(SCHED_FIFO);
rtres = sched_setscheduler(0, SCHED_FIFO, &rtparams);
if (rtres) {
cerr << "WARNING: sched_setscheduler() failed: " << rtres << endl;
}
Additionally, I've removed the short sleeps from the mainloop. The process now eats all available processing power and is (obviously) unresponsive to any actions from the outside world (such as console keystrokes and such), but this is OK for the task at hand.
The mainloop stats show that most iterations take less than a millisecond to complete, but there are still a few (every 1000th or so) which need approx. 100ms. This is still not good, but at least there are no more delays which are even longer.
Since it is "only" a student project, we can live with that and note it as a candidate for futher improvement.
Anyway, thanks for the suggestions. For next time, I'd know better how to cope with realtime requirements and take an RT OS from the beginning.
Regards,
Philipp

We are running 100 Hz loops on an ARM7 board at work, using a standard Linux with the RT patch, which allows (almost) all locks in the kernel to be preempted. Combining this with a high-priority thread gives us the necessary performance in both kernel and user space.
As the only thing you need to do is to apply the patch and configure the kernel to use full preemption, it's pretty easy to use also - no need to change anything in the software architecture, although I'm not familiar enough with Debian to say whether the patch will apply cleanly.

Related

Random slowdowns in node.js execution

I have an optimization algorithm written in node.js that uses cpu time (measured with performance.now()) as a heuristic.
However, I noticed that occasionally some trivial lines of code would cost much more than usual.
So I wrote a test program:
const timings = [];
while (true) {
const start = performance.now();
// can add any trivial line of code here, or just nothing
const end = performance.now();
const dur = end - start;
if (dur > 1) {
throw [
"dur > 1",
{
start,
end,
dur,
timings,
avg: _.mean(timings),
max: _.max(timings),
min: _.min(timings),
last: timings.slice(-10),
},
];
}
timings.push(dur);
}
The measurements showed an average of 0.00003ms and a peak >1ms (with the second highest <1ms but same order of magnitude).
The possible reasons I can think of are:
the average timing isn't the actual time for executing the code (some compiler optimization)
performance.now isn't accurate somehow
cpu scheduling related - process wasn't running normally but still counted in performance.now
occasionally node is doing something extra behind the scenes (GC etc)
something happening on the hardware/os level - caching / page faults etc
Is any of these a likely reason, or is it something else?
Whichever the cause is, is there a way to make a more accurate measurement for the algorithm to use?
The outliers are current causing the algorithm to misbehave & without knowing how to resolve this issue the best option is to use the moving average cost as a heuristic but has its downsides.
Thanks in advance!
------- Edit
I appreciate how performance.now() will never be accurate, but was a bit surprised that it could span 3-4 orders of magnitude (as opposed to 2 orders of magnitude or ideally 1.)
Would anyone have any idea/pointers as to how performance.now() works and thus what's likely the major contributor to the error range?
It'd be nice to know if the cause is due to something node/v8 doesn't have control over (hardware/os level) vs something it does have control over (a node bug/options/gc related), so I can decide whether there's a way to reduce the error range before considering other tradeoffs with using an alternative heuristic.
------- Edit 2
Thanks to #jfriend00 I now realize performance.now() doesn't measure the actual CPU time the node process executed, but just the time since when the process started.
The question now is
if there's an existing way to get actual CPU time
is this a feature request for node/v8
unless the node process doesn't have enough information from the OS to provide this
You're unlikely to be able to accurately measure the time for one trivial line of code. In fact, the overhead in executing performance.now() is probably many times higher than the time to execute one trivial line of code. You have to be careful that what you're measuring takes substantially longer to execute than the uncertainty or overhead of the measurement itself. Measuring very small executions times is not going to be an accurate endeavor.
1,3 and 5 in your list are also all possibilities. You aren't guaranteed that your code gets a dedicated CPU core that is never interrupted to service some other thread in the system. In my Windows system, even when my nodejs is the only "app" running, there are hundreds of other threads devoted to various OS services that may or may not request some time to run while my nodejs app is running and eventually get some time slice of the CPU core my nodejs app was using.
And, as best I know, performance.now() is just getting a high resolution timer from the OS that's relative to some epoch time. It has no idea when your thread is and isn't running on a CPU core and wouldn't have any way to adjust for that. It just gets a high resolution timestamp which you can compare to some other high resolution timestamp. The time elapsed is not CPU time for your thread. It's just clock time elapsed.
Is any of these a likely reason, or is it something else?
Yes, they all sound likely.
is there a way to make a more accurate measurement for the algorithm to use?
No, sub-millisecond time measurements are generally not reliable, and almost never a good idea. (Doesn't matter whether a timing API promises micro/nanosecond precision or whatever; chances are that (1) it doesn't hold up in practice, and (2) trying to rely on it creates more problems than it solves. You've just found an example of that.)
Even measuring milliseconds is fraught with peril. I once investigated a case of surprising performance, where it turned out that on that particular combination of hardware and OS, after 16ms of full load the CPU ~tripled its clock rate, which of course had nothing to do with the code that appeared to behave weirdly.
EDIT to reply to edited question:
The question now is
if there's an existing way to get actual CPU time
No.
is this a feature request for node/v8
No, because...
unless the node process doesn't have enough information from the OS to provide this
...yes.

Is it possible to guarantee a thread wakes up and runs every one second?

while (true) {
sleep(seconds(1));
log(get_current_time());
}
This question isn't specific to any one language.
Is it possible to guarantee that this thread records log entries exactly one second apart? From what I understand, after 1 second of sleep, the thread will be woken up and marked as runnable in the kernel but depending on which other threads are runnable and their relative priorities (or perhaps other busy work that must be done), the kernel may not schedule the thread to run until a small amount of time after. This may cause the log entries to be 1.1 seconds apart.
Can anything be done in user-level code to reduce this variance? Besides just having as few other threads as possible.
Depends what you mean by "exactly." No clock can ever be "exact." and sleep(1) only guarantees to sleep for at least one second. It could sleep longer, but on modern Linux, probably only a tiny bit longer.
If what you really want is, for the long-term average number of iterations to be one per second, then you need to do this (pseudo-code because I don't have access to a real computer right now):
dueDate = get_current_time();
while (true) {
sleepInterval = dueDate - get_current_time();
if (sleepInterval > 0) {
sleep(sleepInterval);
}
doWhateverNeedsToBeDone();
dueDate += oneSecond;
}
This prevents errors in the timing from accumulating. The sleep() call might not wake at exactly the right time, but the next dueDate always is incremented by exactly one second.
It helps if your get_current_time()-like function and your sleep()-like function both work on a fine grain. Linux has clocks that return time in nanoseconds, and it has the ability to sleep for a given number of nanoseconds, but I forget the actual names of those functions. You can find them easily enough in Google.
If that trick doesn't give you enough accuracy using regular Linux system calls, then you may need to run a "real-time" enabled version of Linux, and use priviliged system calls to enable real-time scheduling for your process.

Explain surprising number of timers and finding their origin

I have a CPU leak somewhere and am trying to find out the origin (the RAM increases a bit too but not as fast).
I have collected some data through dotnet-counters and found out that the number of timers in the given process keeps increasing. For instance, when 46 % of the CPU was used, it was reporting around 450 timers (see following data). Note that this count only increases, slowly but surely like my CPU usage. And also note that it's pretty much at a time where it was idle, not having customers online.
This leads to 2 questions:
1) Is it normal? Should I continue on this track? I find it odd to have this many timers.
2) In my own code, I never user the Timer class, I use the Stopwatch class once but that's it.
So what would be the best way to find the class/library that uses those timers?
1) There was indeed something fishy and it is definitely not normal.
I had a class spawning hundreds of threads and keeping them alive to listen to incoming calls.
Long story short, that many timers is not something normal.
2) The Timer was coming from somewhere deep in a library. I couldn't find an easy way to find them.
Best option is to use dotnet-trace collect. https://www.speedscope.app/ helped me a lot and it is possible to generate de traces for speedscope directly from dotnet-trace.

How to measure multithreaded process time on a multitasking environment?

Since I am running performance evaluation tests of my multithreaded program on a (preemptive) multitasking, multicore environment, the process can get swapped out periodically. I want to compute the latency, i.e., only the duration when the process was active. This will allow me to extrapolate how the performance would be on a non-multitasking environment, i.e., where only one program is running (most of the time), or on different workloads.
Usually two kinds of time are measured:
The wall-clock time (i.e., the time since the process started) but this includes the time when the process was swapped out.
The processor time (i.e., sum total of CPU time used by all threads) but this is not useful to compute the latency of the process.
I believe what I need is makespan of times of individual threads, which can be different from the maximum CPU time used by any thread due to the task dependency structure among the threads. For example, in a process with 2 threads, thread 1 is heavily loaded in the first two-third of the runtime (for CPU time t) while thread 2 is loaded in the later two-third of the runtime of the process (again, for CPU time t). In this case:
wall-clock time would return 3t/2 + context switch time + time used by other processes in between,
max CPU time of all threads would return a value close to t, and
total CPU time is close to 2t.
What I hope to receive as output of measure is the makespan, i.e., 3t/2.
Furthermore, multi-threading brings indeterminacy on its own. This issue can probably be taken care of running the test multiple times and summarizing the results.
Moreover, the latency also depends on how the OS schedules the threads; things get more complicated if some threads of a process wait for CPU while others run. But lets forget about this.
Is there an efficient way to compute/approximate this makespan time? For providing code examples, please use any programming language, but preferably C or C++ on linux.
PS: I understand this definition of makespan is different from what is used in scheduling problems. The definition used in scheduling problems is similar to wall-clock time.
Reformulation of the Question
I have written a multi-threaded application which takes X seconds to execute on my K-core machine.
How do I estimate how long the program will take to run on a single-core computer?
Empirically
The obvious solution is to get a computer with one core, and run your application, and use Wall-Clock time and/or CPU time as you wish.
...Oh, wait, your computer already has one core (it also has some others, but we won't need to use them).
How to do this will depend on the Operating System, but one of the first results I found from Google explains a few approaches for Windows XP and Vista.
http://masolution.blogspot.com/2008/01/how-to-use-only-one-core-of-multi-core.html
Following that you could:
Assign your Application's process to a single core's affinity. (you can also do this in your code).
Start your operating system only knowing about one of your cores. (and then switch back afterwards)
Independent Parallelism
Estimating this analytically requires knowledge about your program, the method of parallelism, etc.
As an simple example, suppose I write a multi-threaded program that calculates the ten billionth decimal digit of pi and the ten billionth decimal digit of e.
My code looks like:
public static int main()
{
Task t1 = new Task( calculatePiDigit );
Task t2 = new Task( calculateEDigit );
t1.Start();
t2.Start();
Task.waitall( t1, t2 );
}
And the happens-before graph looks like:
Clearly these are independent.
In this case
Time calculatePiDigit() by itself.
Time calculateEDigit() by itself.
Add the times together.
2-Stage Pipeline
When the tasks are not independent, you won't be able to just add the individual times together.
In this next example, I create a multi-threaded application to: take 10 images, convert them to grayscale, and then run a line detection algorithm. For some external reason, every images are not allowed to be processed out of order. Because of this, I create a pipeline pattern.
My code looks something like this:
ConcurrentQueue<Image> originalImages = new ConcurrentQueue<Image>();
ConcurrentQueue<Image> grayscaledImages = new ConcurrentQueue<Image>();
ConcurrentQueue<Image> completedImages = new ConcurrentQueue<Image>();
public static int main()
{
PipeLineStage p1 = new PipeLineStage(originalImages, grayScale, grayscaledImages);
PipeLineStage p2 = new PipeLineStage(grayscaledImages, lineDetect, completedImages);
p1.Start();
p2.Start();
originalImages.add( image1 );
originalImages.add( image2 );
//...
originalImages.add( image10 );
originalImages.add( CancellationToken );
Task.WaitAll( p1, p2 );
}
A data centric happens-before graph:
If this program had been designed as a sequential program to begin with, for cache reasons it would be more efficient to take each image one at a time and move them to completed, before moving to the next image.
Anyway, we know that GrayScale() will be called 10 times and LineDetection() will be called 10 times, so we can just time each independently and then multiply them by 10.
But what about the costs of pushing/popping/polling the ConcurrentQueues?
Assuming the images are large, that time will be negligible.
If there are millions of small images, with many consumers at each stage, then you will probably find that the overhead of waiting on locks, mutexes, etc, is very small when a program is run sequentially (assuming that the amount of work performed in the critical sections is small, such as inside the concurrent queue).
Costs of Context Switching?
Take a look at this question:
How to estimate the thread context switching overhead?
Basically, you will have context switches in multi-core environments and in single-core environments.
The overhead to perform a context switch is quite small, but they also occur very many times per second.
The danger is that the cache gets fully disrupted between context switches.
For example, ideally:
image1 gets loaded into the cache as a result of doing GrayScale
LineDetection will run much faster on image1, since it is in the cache
However, this could happen:
image1 gets loaded into the cache as a result of doing GrayScale
image2 gets loaded into the cache as a result of doing GrayScale
now pipeline stage 2 runs LineDetection on image1, but image1 isn't in the cache anymore.
Conclusion
Nothing beats timing on the same environment it will be run in.
Next best is to simulate that environment as well as you can.
Regardless, understanding your program's design should give you an idea of what to expect in a new environment.

Speed Up with multithreading

i have a parse method in my program, which first reads a file from disk then, parses the lines and creats an object for every line. For every file a collection with the objects from the lines is saved afterwards. The files are about 300MB.
This takes about 2.5-3 minutes to complete.
My question: Can i expect a significant speed up if i split the tasks up to one thread just reading files from disk, another parsing the lines and a third saving the collections? Or would this maybe slow down the process?
How long is it common for a modern notebook harddisk to read 300MB? I think, the bottleneck is the cpu in my task, because if i execute the method one core of cpu is always at 100% while the disk is idle more then the half time.
greetings, rain
EDIT:
private CANMessage parseLine(String line)
{
try
{
CANMessage canMsg = new CANMessage();
int offset = 0;
int offset_add = 0;
char[] delimiterChars = { ' ', '\t' };
string[] elements = line.Split(delimiterChars);
if (!isMessageLine(ref elements))
{
return canMsg = null;
}
offset = getPositionOfFirstWord(ref elements);
canMsg.TimeStamp = Double.Parse(elements[offset]);
offset += 3;
offset_add = getOffsetForShortId(ref elements, ref offset);
canMsg.ID = UInt16.Parse(elements[offset], System.Globalization.NumberStyles.HexNumber);
offset += 17; // for signs between identifier and data length number
canMsg.DataLength = Convert.ToInt16(elements[offset + offset_add]);
offset += 1;
parseDataBytes(ref elements, ref offset, ref offset_add, ref canMsg);
return canMsg;
}
catch (Exception exp)
{
MessageBox.Show(line);
MessageBox.Show(exp.Message + "\n\n" + exp.StackTrace);
return null;
}
}
}
So this is the parse method. It works this way, but maybe you are right and it is inefficient. I have .NET Framwork 4.0 and i am on Windows 7. I have a Core i7 where every core has HypterThreading, so i am only using about 1/8 of the cpu.
EDIT2: I am using Visual Studio 2010 Professional. It looks like the tools for a performance profiling are not available in this version (according to msdn MSDN Beginners Guide to Performance Profiling).
EDIT3: I changed the code now to use threads. It looks now like this:
foreach (string str in checkedListBoxImport.CheckedItems)
{
toImport.Add(str);
}
for(int i = 0; i < toImport.Count; i++)
{
String newString = new String(toImport.ElementAt(i).ToArray());
Thread t = new Thread(() => importOperation(newString));
t.Start();
}
While the parsing you saw above is called in the importOperation(...).
With this code it was possible to reduce the time from about 2.5 minutes to "only" 40 seconds. I got some concurrency problems i have to track but at least this is much faster then before.
Thank you for your advice.
It's unlikely that you are going to get consistent metrics for laptop hard disk performance as we have no idea how old your laptop is nor do we know if it is sold state or spinning.
Considering you have already done some basic profiling, I'd wager the CPU really is your bottleneck as it is impossible for a single threaded application to use more than 100% of a single cpu. This is of course ignoring your operating system splitting the process over multiple cores and other oddities. If you were getting 5% CPU usage instead, it'd be most likely were bottle necking at IO.
That said your best bet would be to create a new thread task for each file you are processing and send that to a pooled thread manager. Your thread manager should limit the number of threads you are running to either the number of cores you have available or if memory is an issue (you did say you were generating 300MB files after all) the maximum amount of ram you can use for the process.
Finally, to answer the reason why you don't want to use a separate thread for each operation, consider what you already know about your performance bottlenecks. You are bottle necked on cpu processing and not IO. This means that if you split your application into separate threads your read and write threads would be starved most of the time waiting for your processing thread to finish. Additionally, even if you made them process asynchronously, you have the very real risk of running out of memory as your read thread continues to consume data that your processing thread can't keep up with.
Thus, be careful not to start each thread immediately and let them instead be managed by some form of blocking queue. Otherwise you run the risk of slowing your system to a crawl as you spend more time in context switches than processing. This is of course assuming you don't crash first.
It's unclear how many of these 300MB files you've got. A single 300MB file takes about 5 or 6 seconds to read on my netbook, with a quick test. It does indeed sound like you're CPU-bound.
It's possible that threading will help, although it's likely to complicate things significantly of course. You should also profile your current code - it may well be that you're just parsing inefficiently. (For example, if you're using C# or Java and you're concatenating strings in a loop, that's frequently a performance "gotcha" which can be easily remedied.)
If you do opt for a multi-threaded approach, then to avoid thrashing the disk, you may want to have one thread read each file into memory (one at a time) and then pass that data to a pool of parsing threads. Of course, that assumes you've also got enough memory to do so.
If you could specify the platform and provide your parsing code, we may be able to help you optimize it. At the moment all we can really say is that yes, it sounds like you're CPU bound.
That long for only 300 MB is bad.
There's different things that could be impacting performance as well depending upon the situation, but typically it's reading the hard disk is still likely the biggest bottleneck unless you have something intense going on during the parsing, and which seems the case here because it only takes several seconds to read 300MB from a harddisk (unless it's way bad fragged maybe).
If you have some inefficient algorithm in the parsing, then picking or coming up with a better algorithm would probably be more beneficial. If you absolutely need that algorithm and there's no algorithmic improvement available, it sounds like you might be stuck.
Also, don't try to multithread to read and write at the same time with the multithreading, you'll likely slow things way down to increased seeking.
Given that you think this is a CPU bound task, you should see some overall increase in throughput with separate IO threads (since otherwise your only processing thread would block waiting for IO during disk read/write operations).
Interestingly I had a similar issue recently and did see a significant net improvement by running separate IO threads (and enough calculation threads to load all CPU cores).
You don't state your platform, but I used the Task Parallel Library and a BlockingCollection for my .NET solution and the implementation was almost trivial. MSDN provides a good example.
UPDATE:
As Jon notes, the time spent on IO is probably small compared to the time spent calculating, so while you can expect an improvement, the best use of time may be profiling and improving the calculation itself. Using multiple threads for the calculation will speed up significantly.
Hmm.. 300MB of lines that have to be split up into a lot of CAN message objects - nasty! I suspect the trick might be to thread off the message assembly while avoiding excessive disk-thrashing between the read and write operations.
If I was doing this as a 'fresh' requirement, (and of course, with my 20/20 hindsight, knowing that CPU was going to be the problem), I would probably use just one thread for reading, one for writing the disk and, initially at least, one thread for the message object assembly. Using more than one thread for message assembly means the complication of resequencing the objects after processing to prevent the output file being written out-of-order.
I would define a nice disk-friendly sized chunk-class of lines and message-object array instances, say 1024 of them, and create a pool of chunks at startup, 16 say, and shove them onto a storage queue. This controls and caps memory use, greatly reduces new/dispose/malloc/free, (looks like you have a lot of this at the moment!), improves the efficiency of the disk r/w operations as only large r/w are performed, (except for the last chunk which will be, in general, only partly filled), provides inherent flow-control, (the read thread cannot 'run away' because the pool will run out of chunks and the read thread will block on the pool until the write thread returns some chunks), and inhibits excess context-switching because only large chunks are processed.
The read thread opens the file, gets a chunk from the queue, reads the disk, parses into lines and shoves the lines into the chunk. It then queues the whole chunk to the processing thread and loops around to get another chunk from the pool. Possibly, the read thread could, on start or when idle, be waiting on its own input queue for a message class instance that contains the read/write filespecs. The write filespec could be propagated through a field of the chunks, so supplying the the write thread wilth everything it needs via. the chunks. This makes a nice subsystem to which filespecs can be queued and it will process them all without any further intervention.
The processing thread gets chunks from its input queue and splits the the lines up into the message objects in the chunk and then queues the completed, whole chunks to the write thread.
The write thread writes the message objects to the output file and then requeues the chunk to the storage pool queue for re-use by the read thread.
All the queues should be blocking producer-consumer queues.
One issue with threaded subsystems is completion notification. When the write thread has written the last chunk of a file, it probably needs to do something. I would probably fire an event with the last chunk as a parameter so that the event handler knows which file has been completely written. I would probably somethihng similar with error notifications.
If this is not fast enough, you could try:
1) Ensure that the read and write threads cannot be preemepted in favour of the other during chunk-disking by using a mutex. If your chunks are big enough, this probably won't make much difference.
2) Use more than one processing thread. If you do this, chunks may arrive at the write-thread 'out-of-order'. You would maybe need a local list and perhaps some sort of sequence-number in the chunks to ensure that the disk writes are correctly ordered.
Good luck, whatever design you come up with..
Rgds,
Martin

Resources