How accurate is Explicit Synchronization of Schedule Tables? - autosar

I am reading up on time synchronization in AUTOSAR. Specifically, how to use global/PTP time to actually do time sensitive work on an ECU.
The way I understand it (from the OS spec "AUTOSAR_SWS_OS"), the way to do this is to put tasks in Schedule Tables, and then synchronize the tables either implicitly or explicitly.
Implicit synchronization I understand: lower level code/hardware sorts out the synchronization of a physical clock, and then the schedule tables just use a timer based on this clock.
I'm a bit puzzled by Explicit Synchronization however: It seems the way the table is synchronized is by periodic calls to SyncScheduleTable(). This tells the scheduler "the PTP time now is X".
But wouldn't the process of retrieving the current PTP time and then updating the table (in software...) introduce error in the time sync? I would think this would take at least a few microseconds?
Is the level of synchronization not expected to be sub-microsecond in AUTOSAR?

You will always have small offsets between SW modules.
Normally you receive PTP global time from bus.
Then STBM will use global time value and manage internal timer to start counting from last received value from bus. after sometime you will have offset between internal time and master clock from bus.
This offset will be corrected always when new value is received but you will always have small offsets.
The internal timer will keep synchronizing the schedule table and yes there will be few nano seconds for time take from reading the time from HW register till it is sent to the OS to synchronize.
Even after synchronization command, OS runnables events will be fired with offsets due to CPU load.
In the end would really few nano seconds hurt your design? in most projects I have seen such small offsets are acceptable.

Related

Determining latency in threads that use sleep_for using SCHED_FIFO

I have a Linux embedded system built with PREEMPT_RT (real time patch) that creates multiple SCHED_FIFO threads, with a priority of 90 each. The goal is that they execute without being preempted, but with the same priority.
Each thread does a little bit of work, then goes to sleep using std::this_thread::sleep_for() for a few milliseconds, then gets scheduled back and executes the same amount of work.
Most of the time, each thread latency is impeccable, but once every minute or so (not an exact regular interval) all threads get hogged at the same time for one second or more (instead of the low milliseconds they usually get called at).
I have made sure Power management is disabled in the kernel kconfig, I have called mlockall() to avoid memory getting paged out, to no avail.
I have tried to use ftrace with wakeup_rt as the tracer, but the highest latency recorded was around 5ms, not nearly enough time to be the cause of the issue.
I am not sure what tool would be best to identify where the latency is coming from. Does anyone have ideas please?

Creating real-time thread on OSX

I'm working on an OSX application that transmits data to a hardware device over USB serial. The hardware has a small serial buffer that is drained at a variable rate and should always stay non-empty.
We have a write loop in its own NSThread that checks if the hardware buffer is full, and if not, writes data until it is. The majority of loop iterations don't write anything and take almost no time, but they can occasionally take up to a couple milliseconds (as timed with CACurrentMediaTime). The thread sleeps for 100ns after each iteration. (I know that sleep time seems insanely short, but if we bump it up, the hardware starts getting data-starved.)
This works well much of the time. However, if the main thread or another application starts doing something processor-intensive, the write thread slows down and isn't able to stream data fast enough to keep the device's queue from emptying.
So, we'd like to make the serial write thread real-time. I read the Apple docs on requesting real-time scheduling through the Mach API, then tried to adapt the code snippet from SetPriorityRealtimeAudio(mach_port_t mach_thread_id) in the Chromium source.
However, this isn't working - the application remains just as susceptible to serial communication slowdowns. Any ideas? I'm not sure if I need to change the write thread's behavior, or if I'm passing in the wrong thread policy parameters, or both. I experimented with various period/computation/constraint values, and with forcing a more consistent duty cycle (write for 100ns max and then sleep for 100ns) but no luck.
A related question: How can I check the thread's priority directly, and/or tell if it's starting off as real-time and then being demoted vs not being promoted to begin with? Right now I'm just making inferences from the hardware performance, so it's hard to tell exactly what's going on.
My suggestion is to move the thread of execution that requires the highest priority into a separate process. Apple often does this for realtime processes such as driving the built-in camera. Depending on what versions of the OS you are targeting you can use Distributed Objects (predecessor to XPC) or XPC.
You can also roll your own RPC mechanism and use standard Unix fork techniques to create a separate child process. Since your main app is the owner of the child process, you should also be able to set the scheduling priority of the process in addition to the individual thread priority within the process.
As I edit this post, I have a WWDC video playing in the background and also started a QuickTime Movie Recording task. As you can see, the real-time aspects of both those apps are running in separate XPC processes:
ps -ax | grep Video
1933 ?? 0:00.08 /System/Library/Frameworks/VideoToolbox.framework/Versions/A/XPCServices/VTDecoderXPCService.xpc/Contents/MacOS/VTDecoderXPCService
2332 ?? 0:08.94 /System/Library/Frameworks/VideoToolbox.framework/Versions/A/XPCServices/VTDecoderXPCService.xpc/Contents/MacOS/VTDecoderXPCService
XPC Services at developer.apple.com
Distributed Objects at developer.apple.com

What is the difference between busy-wait and polling?

From the Wikipedia article on Polling
Polling, or polled operation, in computer science, refers to actively sampling the status of an external device by a client program as a synchronous activity. Polling is most often used in terms of input/output (I/O), and is also referred to as polled I/O or software driven I/O.
Polling is sometimes used synonymously with busy-wait polling (busy waiting). In this situation, when an I/O operation is required the computer does nothing other than check the status of the I/O device until it is ready, at which point the device is accessed. In other words the computer waits until the device is ready.
Polling also refers to the situation where a device is repeatedly checked for readiness, and if it is not the computer returns to a different task. Although not as wasteful of CPU cycles as busy-wait, this is generally not as efficient as the alternative to polling, interrupt driven I/O.
So, when a thread doesn't use the "condition variables", will it be called "polling" for the data change or "busy waiting"?
The difference between the two is what the application does between polls.
If a program polls a device say every second, and does something else in the mean time if no data is available (including possibly just sleeping, leaving the CPU available for others), it's polling.
If the program continuously polls the device (or resource or whatever) without doing anything in between checks, it's called a busy-wait.
This isn't directly related to synchronization. A program that blocks on a condition variable (that should signal when a device or resource is available) is neither polling nor busy-waiting. That's more like event-driven/interrupt-driven I/O.
(But for example a thread that loops around a try_lock is a form of polling, and possibly busy-waiting if the loop is tight.)
Suppose one has a microprocessor or microcontroller which is supposed to perform some action when it notices that a button is pushed.
A first approach is to have the program enter a loop which does nothing except look to see if the button has changed yet and, once it has, perform the required action.
A second approach in some cases would be to program the hardware to trigger an interrupt when the button is pushed, assuming the button is wired to an input that's wired so it can cause an interrupt.
A third approach is to configure a timer to interrupt the processor at some rate (say, 1000x/second) and have the handler for that interrupt check the state of the button and act upon it.
The first approach uses a busy-wait. It can offer very good response time to one particular stimulus, at the expense of totally tuning out everything else. The second approach uses event-triggered interrupt. It will often offer slightly slower response time than busy-waiting, but will allow the CPU to do other things while waiting for I/O. It may also allow the CPU to go into a low-power sleep mode until the button is pushed. The third approach will offer a response time that is far inferior to the other two, but will be usable even if the hardware would not allow an interrupt to be triggered by the button push.
In cases where rapid response is required, it will often be necessary to use either an event-triggered interrupt or a busy-wait. In many cases, however, a polled approach may be most practical. Hardware may not exist to support all the events one might be interested in, or the number of events one is interested in may substantially exceed the number of available interrupts. Further, it may be desirable for certain conditions to generate a delayed response. For example, suppose one wishes to count the number of times a switch is activated, subject to the following criteria:
Every legitimate switch event will consist of an interval from 0 to 900us (microseconds) during which the switch may arbitrarily close and reopen, followed by an interval of at least 1.1ms during which the switch will remain closed, followed by an interval from 0 to 900us during which the switch may arbitrarily open and reclose, followed by an interval of which at least 1.1ms during which the switch will be open.
Software must ignore the state of the switch for 950us after any non-ignored switch opening or closure.
Software is allowed to arbitrarily count or ignore switch events which occur outside the above required blanking interval, but which last less than 1.1ms.
The software's reported count must be valid within 1.99ms of the time the switch is stable "closed".
The easiest way to enforce this requirement is to observe the state of the switch 1,000x/second; if it is seen "closed" when the previous state was "open", increment the counter. Very simple and easy; even if the switch opens and closes in all sorts of weird ways, during the 900us preceding and following a real event, software won't care.
It would be possible to use a switch-input-triggered interrupt along with a timer to yield faster response to the switch input, while meeting the required blanking requirement. Initially, the input would be armed to trigger the next time the switch closes. Once the interrupt was triggered, software would disable it but set a timer to trigger an interrupt after 950us. Once that timer expired, it would trigger an interrupt which would arm the interrupt to fire the next time the switch is "open". That interrupt would in turn disable the switch interrupt and again set the timer for 950us, so the timer interrupt would again re-enable the switch interrupt. Sometimes this approach can be useful, but the software is a lot more complicated than the simple polled approach. When the timer-based approach will be sufficient, it is often preferable.
In systems that use a multitasking OS rather than direct interrupts, many of the same principles apply. Periodic I/O polling will waste some CPU time compared with having code which the OS won't run until certain events occur, but in many cases both the event response time and the amount of time wasted when no event occurs will be acceptable when using periodic polling. Indeed, in some buffered I/O situations, periodic polling might turn out to be quite efficient. For example, suppose one is receiving a large amount of data from a remote machine via serial port, at most 11,520 bytes will arrive per second, the device will send up to 2K of data ahead of the last acknowledged packet, and the serial port has a 4K input buffer. While one could process data using a "data received" event, it may be just as efficient to simply check the port 100x/second and process all packets received up to that point. Such polling would be a waste of time when the remote device wasn't sending data, but if incoming data was expected it may be more efficient to process it in chunks of roughly 1.15K than to process every little piece of incoming data as soon as it comes in.

How NOHZ=ON affects do_timer() in Linux kernel?

In a simple experiment I set NOHZ=OFF and used printk() to print how often the do_timer() function gets called. It gets called every 10 ms on my machine.
However if NOHZ=ON then there is a lot of jitter in the way do_timer() gets called. Most of the times it does get called every 10 ms but there are times when it completely misses the deadlines.
I have researched about both do_timer() and NOHZ. do_timer() is the function responsible for updating jiffies value and is also responsible for the round robin scheduling of the processes.
NOHZ feature switches off the hi-res timers on the system.
What I am unable to understand is how can hi-res timers affect the do_timer()? Even if hi-res hardware is in sleep state the persistent clock is more than capable to execute do_timer() every 10 ms. Secondly if do_timer() is not executing when it should, that means some processes are not getting their timeshare when they should ideally be getting it. A lot of googling does show that for many people many applications start working much better when NOHZ=OFF.
To make long story short, how does NOHZ=ON affect do_timer()?
Why does do_timer() miss its deadlines?
First lets understand what is a tickless kernel ( NOHZ=On or CONFIG_NO_HZ set ) and what was the motivation of introducing it into the Linux Kernel from 2.6.17
From http://www.lesswatts.org/projects/tickless/index.php,
Traditionally, the Linux kernel used a periodic timer for each CPU.
This timer did a variety of things, such as process accounting,
scheduler load balancing, and maintaining per-CPU timer events. Older
Linux kernels used a timer with a frequency of 100Hz (100 timer events
per second or one event every 10ms), while newer kernels use 250Hz
(250 events per second or one event every 4ms) or 1000Hz (1000 events
per second or one event every 1ms).
This periodic timer event is often called "the timer tick". The timer
tick is simple in its design, but has a significant drawback: the
timer tick happens periodically, irrespective of the processor state,
whether it's idle or busy. If the processor is idle, it has to wake up
from its power saving sleep state every 1, 4, or 10 milliseconds. This
costs quite a bit of energy, consuming battery life in laptops and
causing unnecessary power consumption in servers.
With "tickless idle", the Linux kernel has eliminated this periodic
timer tick when the CPU is idle. This allows the CPU to remain in
power saving states for a longer period of time, reducing the overall
system power consumption.
So reducing power consumption was one of the main motivations of the tickless kernel. But as it goes, most of the times, Performance takes a hit with decreased power consumption. For desktop computers, performance is of utmost concern and hence you see that for most of them NOHZ=OFF works pretty well.
In Ingo Molnar's own words
The tickless kernel feature (CONFIG_NO_HZ) enables 'on-demand' timer
interrupts: if there is no timer to be expired for say 1.5 seconds
when the system goes idle, then the system will stay totally idle for
1.5 seconds. This should bring cooler CPUs and power savings: on our (x86) testboxes we have measured the effective IRQ rate to go from HZ
to 1-2 timer interrupts per second.
Now, lets try to answer your queries-
What I am unable to understand is how can hi-res timers affect the
do_timer ?
If a system supports high-res timers, timer interrupts can occur more frequently than the usual 10ms on most systems. i.e these timers try to make the system more responsive by leveraging the system capabilities and by firing timer interrupts even faster, say every 100us. So with NOHZ option, these timers are cooled down and hence the lower execution of do_timer
Even if hi-res hardware is in sleep state the persistent clock is more
than capable to execute do_timer every 10ms
Yes it is capable. But the intention of NOHZ is exactly the opposite. To prevent frequent timer interrupts!
Secondly if do_timer is not executing when it should that means some
processes are not getting their timeshare when they should ideally be
getting it
As caf noted in the comments, NOHZ does not cause processes to get scheduled less often, because it only kicks in when the CPU is idle - in other words, when no processes are schedulable. Only the process accounting stuff will be done at a delayed time.
Why does do_timer miss it's deadlines ?
As elaborated, it is the intended design of NOHZ
I suggest you go through the tick-sched.c kernel sources as a starting point. Search for CONFIG_NO_HZ and try understanding the new functionality added for the NOHZ feature
Here is one test performed to measure the Impact of a Tickless Kernel

Timestamp generated by two threads

I have two thread in my code. One thread is a generator which creates messages. A timestamp is generated before a message is transmitted. The other thread is a receiver which accepts replies from multiple clients. A timestamp is created for each reply. Two threads are running at the same time.
I find the timestamp generated by the receivers is earlier than the timestamp generated by the generator. The correct order should be the timestamp for the receiver is later than the timestamp for the generator.
If I give a high priority for the generator thread, this problem does not occcur. But this can also slow down the performance.
Is there other way to guarantee the correct order and less effection on the performance? Thanks.
Based on the comment thread in the question, this is likely the effect of the optimizer. This is really a problem with the design more than anything else - it assumes that the clocks between the producer and consumer are shared or tightly synchronized. This assumption seems reasonable until you need to distribute the processing between more than one computer.
Clocks are rarely (if ever) tightly synchronized between different computers. The common algorithm for synchronizing computers is the Network Time Protocol. You can achieve very close to millisecond synchronization on the local area network but even that is difficult.
There are two solutions to this problem that come to mind. The first is to have the producer's timestamp is passed through the client and into the receiver. If the receiver receives a timestamp that is earlier than it's notion of the current time, then it simply resets the timestamp to the current time. This type of normalization will allow assumptions about time being a monotonically increasing sequence continue to hold.
The other solution is to disable optimization and hope that the problem goes away. As you might expect, your mileage may vary considerably with this solution.
Depending on the problem that you are trying to solve you may be able to provide your own synchronized clock between the different threads. Use an atomically incrementing number instead of the wall time. java.util.concurrent.atomic.AtomicInteger or one of its relatives can be used to provide a single number that is incremented every time that a message is generated. This allows the producer and receiver to have a shared value to use as a clock of sorts.
In any case, clocks are really hard to use correctly especially for synchronization purposes. If you can find some way to remove assumptions about time from distributed systems, your architectures and solutions will be more resilient and more deterministic.

Resources