Fast periodic tasks in RT Linux - linux

What is the shortest interval at which RT Linux can execute a (real-time) periodic task?
I'm investigating hardware vs. software solutions for a scientific data acquisition app. The requirements include real-time feedback control of physiological processes at approximately 40kHz. There are hardware solutions (using programable DSP chips), but I'm curious whether a real-time linux task could handle the entire problem. The task is simple: read a sample from the A/D board, perform some simple arithmetic and write a sample to the A/D board. Can RT Linux schedule this task 40k times/second or is that an unreasonable speed?
If we can perform the periodic task on the CPU, we can write the app without a hardware dependency. If not, we'll have to use a hybrid CPU/DSP system. Obviously, I'm hoping for the former.

According to http://www.ibm.com/developerworks/linux/library/l-real-time-linux/, even non-RT linux on a decent processor can deliver on-average 20μs timer interval, which corresponds to 50kHz. The same article mentions that high-resolution timers in 2.6 kernel w/ some RT mods can deliver 1μs intervals, or 1000kHz. So I don't think it is unreasonable to expect a RT kernel to be able to deliver 40kHz reliably.

Related

Vtune: Accuracy of Intel sampling drivers when vtune measurement run on a machine running other tasks

I have the latest coffeelake machine which is primarily used as a storage server. The average workload on each core (4 cores) is around 5-10% when running a storage server alone.
I want to run vtune measurements of a workload on this machine using Intel Sampling drivers. However, I'm doubtful whether or not the measurements will be accurate given the storage server application is concurrently running.
But as the intel's documents suggest, the sampling drivers get installed on the Linux kernel, so is it really the case that the measurements will be inaccurate if run concurrently with other applications? In other words, how exactly do the intel sampling drivers work? Are they able to distinguish between the workload process and other processes running on the system?
If VTune is like the Linux PAPI subsystem that perf uses, it basically saves/restores HW event counter registers on context switch, along with the regular register state. So events like instructions and uops_retired should be unaffected. And effects on other events will be due to actual impacts, like extra cache misses.
(The basic mechanism for HW performance events are that each logical core has its own programmable perf counters that increment every time some microarchitectural event happens. If one overflows, it raises an interrupt for the driver to collect the count. Or for perf record type of functionality, perf or VTune would program them to count down so trigger an interrupt regularly, and sample the saved user-space RIP at that point. This produces some funky effects on a superscalar out-of-order CPU, like "blaming" the instruction waiting for data, not the cache miss load itself, for example. But the key point is that the inside-the-core events are totally per-core. The uncore / L3 cache events count stuff about shared resources like L3 cache, so are more easily disturbed by system load.)
Another point is that if you are running something on a CPU core, Linux isn't going to want to schedule other tasks there. So your background load will tend to avoid whichever core your test is running on, leaving it able to use 100% of a single core without a lot of context switches. (Although network / disk interrupts might still be handled on that core.)
So yes, you should be able to fairly accurately measure what's actually happening in your process while it runs on a system that's not totally idle. That might be a bit different from what would happen if it were run on a fully idle system, but probably not much different. Especially if it's single-threaded, or you can limit it to fewer than all of your cores, so there's at least one left for the OS to schedule other tasks onto.

What options do I have for running recurring events on a microsecond resolution from a kernel driver?

I want to create a simulation of an actual device on an x86 Linux Kernel. Part of this will involve simulating timings as close to possible as I can get. Based on some research it seems I will need at least microsecond resolution timing. I understand that on a non-realtime system it won't be possible to get perfect timing, but I don't perfect, just as close as I can get, perhaps with hacking around with thread scheduling / preemption options.
What I actually want to do is perform an action every interval, i.e. run a some code every Xµs. I've been trying to research the best ways to do this from a Kernel driver as well as some research into whether it's possible to do this reasonably accurately from user mode (keeping the above paragraph in mind). One of the first things that caught my eye was the HPET timer, that is programmable to generate interrupts based on programmable comparators. Unfortunately, it seems on many chipsets it has been rather buggy in the past, and there's not much information on using it for anything that obtaining a timestamp or using it as the main clock source. The linux Kernel provides an HPET driver that in the past, seemed to provide both kernel and user mode interfaces, but seems only to provide a barely documented usermode interface in more recent kernel versions. I've also read about various other kernel functions and interfaces such as the hrtimer interface and the various delay functions, though I'm having a bit of trouble understanding them and if they are suited for my purpose.
Given my current use case, what are the best options I have running recurring events at a µs resolution from say a kernel driver? Obviously accuracy is probably my biggest criteria, but ease of use would be second.
Well, it's possible to achieve your accuracy in userspace -- clock_nanosleep is one ideal option, which has relative and absolute mode. Since clock_nanosleep is based on hrtimer in kernel mode, you may want to use hrtimer if you'd like to implement it in kernel space.
However, to make the timer work accurately, there're two IMPORTENT things worth mentioning.
You should set the timerslack of your process (either by writing nonzero value in ns to /proc/self/timerslack_ns or via prctl(PR_SET_TIMERSLACK,...)). This value is considered as the 'tolerance' of the timer.
The CPU power management also matters here. The CPU has many different Cstates, each of which has a different exit latency. So you need to configure your cpuidle module to not use Cstates other than C0, e.g. for an Intel CPU you could simply write 1 to /sys/devices/system/cpu/cpu$c/cpuidle/state$s/disable to disable state $s of CPU $c. Or just add idle=poll to your kernel options to let CPU keep active (in C0) while kernel idle. NOTE that this significantly influences the power of the computer and leads the cooling fans to make noise.
You can get a timer with delays under 10 microseconds if the two things mentioned above are configured correctly. There is a trade-off between latency and power consumption that you should made.

Real-time audio on multi-core Linux system

I'm working on an audio application on a multi-core (Debian) Linux machine with an RT kernel. The audio source generation takes a lot of processing time which can't be handled by a single core, so I have three different threads:
The main portaudio thread running on core 0
Source generation 1 running on core 1
Source generation 2 running on core 2
Thread 2 and 3 are writing to a ringbuffer, while thread 1 is reading data from the ringbuffer and sums it into the portaudio buffer.
I've tried many buffer sizes and scheduling policies, my best result was FIFO policy with audio buffer size of 16 stereo samples and ringbuffer size of 576. This solution generates more than 13ms (576/44100*1000) latency, which is too much.
I'm sure that this latency can be reduced, but I'm not an expert in Linux scheduling. Any ideas?
As long as you keep RT prio of your process above any other on the core the policy doesn't matter.
Make sure you kick any other application out of the cores you use for RT (e.g. with isolcpus= kernel cmdline parameter). Otherwise the low-prio processes can trigger I/O which will block your RT threads. You should also assign all the interrupts your application doesn't care about to the unused core. Actually I would suggest using core0 for normal tasks and cores 1,2,3 for RT in your case, because since core0 is the boot CPU it will have to perform some special housekeeping tasks.
Once you partition the system as described above try latency-measurement tools to figure out what is causing delays. Googling linux rt latency trace will give you a lot of useful links. This is the basic one: http://people.redhat.com/williams/latency-howto/rt-latency-howto.txt
If it turns out some kernel processing is blocking your app you may find a solution by looking at the description of kernel threads here: http://lxr.free-electrons.com/source/Documentation/kernel-per-CPU-kthreads.txt
You should definitely be able to go below 2ms.

How to calculate max number if thread that my kernel runs on them in opencl

when I read the device info from an OpenCL device, how can I calculate how good is its processing capability?
To add more information, assume that I want to do a very simple task on a pixels of an image, as far as I know (which maybe is not right !) when I run my kernel on a GPU, opencl runs it in parallel with different processing unit in GPU and I can think of the kernel as the thread body which would run in parallel.
If this is correct, then for my simple task, I need to find the device that has more processing unit so my kernel runs on them and hence finishes faster. Am I wrong?
How to find a suitable device based on its processing power?
Counting the number of processors in an OpenCL device is not sufficient to know how it will perform, for many reasons:
Different processors can have very different frequencies (in MHz/GHz)
Different processors can have very different architectures, e.g. out-of-order, multi-scalar, functions implemented in hardware
Different OpenCL devices have different types of memory available to them, which can affect the overall performance to a large extent
OpenCL devices could be integrated with the main CPU, on discrete peripheral board, or across a network. The latency and the need to synchronize or copy memory will affect the performance.
Different algorithms favor different architectures, so while one device may be faster than another for one algorithm, the same may not be true for a different algorithm.
I don't recommend using the number of processors as a measure of performance. The best way is to benchmark with a specific algorithm.

How to generate a square wave by Linux kernel

I need to develop a Linux driver that generates a square wave, with a cycle of about 1ms, using the MIPS platform (this is not i386).
I tried some methods, but these are not success:
Use timer/hrtimer --> but cycle is 12ms and unstable
Cannot use realtime additional packages as RTLinux/RTAI, because these do not support for MIPS
Use the kernel-thread with a forever loop and udelay function --> It takes too much of the CPU's resource --> Performance is not acceptable
Do you aid me? Or do you thwart me...? (Please help!)
Thank you.
The Unix way would be not doing that at all. Maybe in olden times on single task machines, you would have done like this, but now - if you don't have a hardware circuit that gives to the proper frequency, you may never succeed because hardware timers don't have the necessary resolution, and it may always happen that a task of more importance grabs your CPU time.
As FrankH said, the best solution involves relying on hardware. You should check your processor's reference manual to see if it has a timer.
I'll add this: if it happens to have an Output Compare or PWM subsystem (I'm not familiar with MIPS, but it's not at all uncommon in embedded devices) you can just write a few registers to set it all up, and then you don't need any more processor time.
It might be possible, but to get this from within Linux, the hardware must have certain characteristics:
you need a programmable timer device that can create an interrupt at sufficiently-high priority that other activity by the Linux kernel (such as scheduling or other interrupts, even) won't preempt / block the interrupt handler, and at sufficient granulatity/frequency to meet your signal stability constraints
the "square wave" electrical line must also be programmable and the operation (register write ? memory mapped register write ? special CPU instruction ? ... ?) which switches its state must be guaranteed faster than the shortest cycle time used with the timer above (or else you could get "frequency moire")
If that's the case then your special timer device driver can toggle the line from within its high prio interrupt handler and create the square wave. Since it's both interrupt driven and separate from the normal timer interrupt sources / consumers (i.e. not shared - no latency from possibly dispatching multiple timer events per interrupt), you've got a much better chance of sufficient precision.
Since all this (the existance of a separately-programmable timer device, to start with) is hardware-specific, you need to start with the specs of your CPU/SoC/board and find out if there are multiple independent timers available.

Resources