serial port: control, predict or at least optimize read() latency - linux

I am designing application running on ARM9 working through serial port using Modbus. You may know that Modbus protocol is based on timing.
Originally I open port in non-blocking mode and use polling through read(). Then I learned, while it seems to work, it is not best, or even not a good solution for this environment. I have seen my thread execution "holes" of up to 60 ms (yes, milliseconds), and it is too much. I do not know if my measurements are correct - this is what I see on the screen, and it is not actually a question here.
I have learned there're a number of ways doing "high level" reading differently:
use another way of polling, e.g. epoll epoll_wait;
open serial port in blocking mode, and, in another thread, measure the time while read() is waiting for the data (e.g. timer somehow connected to the signal).
However, as I was told, the nature of Linux by default is not real-time, and nothing is guaranteed.
I am looking for advice and information if there're any hacks to design read() getting characters received from UART through all the software layers as quick as possible. Controlled delays up to 1ms would be acceptable (9600 baud).
For example, if I write code in specific way, compiler and target CPU will arrange timing in the way if it sees code loops waiting for some condition, CPU will turn away to another threads, but as soon as this thread's condition is met (no idea how - interrupt? watcher?) it switches to this thread as soon as it can and proceeds with it.

Related

Can single thread do everything that multithread can do?

My 1st Question: As per the title.
I am asking this because I came across a StackExchange question: What can multiple threads do that a single thread cannot?
In one of the solutions given in that link states that whatever multithread can do, it can be done by single thread as well.
However I don't think this is true. My argument is this: When we build a simple chat program with socket programming and run it via the command console. If the chat program is single threaded. The chat program is actually half-duplex. Which means we cannot listen and talk concurrently and each time only a party can talk and the other have to listen. In order for both parties to be able to talk and receive message concurrently, we have to implement it with multithreads.
My 2nd Question: Is my argument correct? Or did I miss out some points here, and therefore a single thread still can do everything multithread does?
Let's consider the computer as a whole, and more precisely that you chat application is bound with the kernel (or the whole os) as a piece we would call "the software".
Now consider that this "software" runs on a single core (say a i386).
Then you can figure out that, even if you wrote your chat application using threads (which is probably quite overkill), the software as a whole runs on a single CPU core, which means that at a very moment it performs one single thing even if there seem to be parallel things happening.
This is nothing more but a Turing machine (using a single tape) https://en.wikipedia.org/wiki/Turing_machine
The parallelism is an illusion caused by the kernel because it can switch between task fast enough. Just like a film seems to be continuous picture on screen, when actually there are just 24 images per seconds, and this is enough to fool our brain.
So I would say that anything a multithreaded program does, a single threaded could do.
Nevertheless, now we all use multi-core CPUs which can be seen at a certain point as running on multiple computers at the same time (parallel computing), thus you can probably find software that works on multi core and that would not run on a single threaded one.
A good example are device drivers (in kernel). If you have a poor implementation, on non preemptive kernel, you can create a busy loop that waits for an event indefinitely. This usually deadlock on single core (you prevent the kernel to schedule to another task, thus you prevent the event to be sent). But this can work on multi core as the event is usually eventually sent by the other thread running on an other core (hopefully).
I want to amend the existing answer (+1):
You absolutely can run multiple parallel IOs on a single thread. An IO is nothing more but a kernel data structure. When you start the IO the OS talks to the hardware and tells it to do something. Then, the CPU is free to do whatever it wants. The hardware calls back into the OS when it's done. It issues an interrupt which hijacks a CPU core to process the completion notification.
This is called async IO and all OS'es provide it.
In fact this is how socket programs with many connections run. They use async IO to multiplex high amounts of connections onto a small pool of threads.
The core reason why this argument is incorrect is subtle. While it's true that with only a single thread, or single core, or single network interface, that particular component can only be handling a send or a receive at any given time, if it's not the critical path, it does not make sense to describe the overall system as half duplex.
Consider a network link that is full-duplex and takes 1ms to move a chunk of data from one end to the other. Now imagine we have a device that puts data on the link or removes data from the link but cannot do both at the same time. So long as it takes much less than 1ms to process a send or a receive, this single file path that data in both directions must go through does not somehow make the link half-duplex. There will still be data moving in both directions at the same time.
In any realistic chat application, the CPU will not be the limiting factor. So it's inability to do more than one thing at a time can't make the system half-duplex. There can still be data moving in both directions at the same time.
For a typical chat application under typical load, the behavior of the system will not be significantly different whether implementation uses a single thread or has multiple threads with infinite CPU resources. The CPU just won't be the limiting factor.

Creating real-time thread on OSX

I'm working on an OSX application that transmits data to a hardware device over USB serial. The hardware has a small serial buffer that is drained at a variable rate and should always stay non-empty.
We have a write loop in its own NSThread that checks if the hardware buffer is full, and if not, writes data until it is. The majority of loop iterations don't write anything and take almost no time, but they can occasionally take up to a couple milliseconds (as timed with CACurrentMediaTime). The thread sleeps for 100ns after each iteration. (I know that sleep time seems insanely short, but if we bump it up, the hardware starts getting data-starved.)
This works well much of the time. However, if the main thread or another application starts doing something processor-intensive, the write thread slows down and isn't able to stream data fast enough to keep the device's queue from emptying.
So, we'd like to make the serial write thread real-time. I read the Apple docs on requesting real-time scheduling through the Mach API, then tried to adapt the code snippet from SetPriorityRealtimeAudio(mach_port_t mach_thread_id) in the Chromium source.
However, this isn't working - the application remains just as susceptible to serial communication slowdowns. Any ideas? I'm not sure if I need to change the write thread's behavior, or if I'm passing in the wrong thread policy parameters, or both. I experimented with various period/computation/constraint values, and with forcing a more consistent duty cycle (write for 100ns max and then sleep for 100ns) but no luck.
A related question: How can I check the thread's priority directly, and/or tell if it's starting off as real-time and then being demoted vs not being promoted to begin with? Right now I'm just making inferences from the hardware performance, so it's hard to tell exactly what's going on.
My suggestion is to move the thread of execution that requires the highest priority into a separate process. Apple often does this for realtime processes such as driving the built-in camera. Depending on what versions of the OS you are targeting you can use Distributed Objects (predecessor to XPC) or XPC.
You can also roll your own RPC mechanism and use standard Unix fork techniques to create a separate child process. Since your main app is the owner of the child process, you should also be able to set the scheduling priority of the process in addition to the individual thread priority within the process.
As I edit this post, I have a WWDC video playing in the background and also started a QuickTime Movie Recording task. As you can see, the real-time aspects of both those apps are running in separate XPC processes:
ps -ax | grep Video
1933 ?? 0:00.08 /System/Library/Frameworks/VideoToolbox.framework/Versions/A/XPCServices/VTDecoderXPCService.xpc/Contents/MacOS/VTDecoderXPCService
2332 ?? 0:08.94 /System/Library/Frameworks/VideoToolbox.framework/Versions/A/XPCServices/VTDecoderXPCService.xpc/Contents/MacOS/VTDecoderXPCService
XPC Services at developer.apple.com
Distributed Objects at developer.apple.com

How is Wait() functionality Implemented

How are wait (Eg: WaitForSingleObject) functions implemented internally in Windows or any OS?
How is it any different from a spin lock?
Does the CPU/hardware provide special functionality to do this?
Hazy view of What's Going On follows... Focusing on IO mainly.
Unix / Linux / Posix
The Unix equivalent, select(), epoll(), and similar have been implemented in various ways. In the early days the implementations were rubbish, and amounted to little more than busy polling loops that used up all your CPU time. Nowadays it's much better, and takes no CPU time whilst blocked.
They can do this I think because the device driver model for devices like Ethernet, serial and so forth has been designed to support the select() family of functions. Specifically the model must allow the kernel to tell the devices to raise an interrupt when something has happened. The kernel can then decide whether or not that will result in a select() unblocking, etc etc. The result is efficient process blocking.
Windows
In Windows the WaitfFor when applied to asynchronous IO is completely different. You actually have to start a thread reading from an IO device, and when that read completes (note, not starts) you have that thread return something that wakes up the WaitFor. That gets dressed up in object.beginread(), etc, but they all boil down to that underneath.
This means you can't replicate the select() functionality in Windows for serial, pipes, etc. But there is a select function call for sockets. Weird.
To me this suggests that the whole IO architecture and device driver model of the Windows kernel can drive devices only by asking them to perform an operation and blocking until the device has completed it. There would seem to be no truly asynchronous way for the device to notify the kernel of events, and the best that can be achieved is to have a separate thread doing the synchronous operation for you. I've no idea how they've done select for sockets, but I have my suspicions.
CYGWIN, Unix on Windows
When the cygwin guys came to implement their select() routine on Windows they were horrified to discover that it was impossible to implement for anything other than sockets. What they did was for each file descriptor passed to select they would spawn a thread. This would poll the device, pipe, whatever waiting for the available data count to be non zero, etc. That thread would then notify the thread that is actually calling select() that something had happened. This is very reminiscent of select() implementations from the dark days of Unix, and is massively inefficient. But it does work.
I would bet a whole 5 new pence that that's how MS did select for sockets too.
My Experience Thus Far
Windows' WaitFors... are fine for operations that are guaranteed to complete or proceed and nice fixed stages, but are very unpleasant for operations that aren't (like IO). Cancelling an asynchronous IO operation is deeply unpleasant. The only way I've found for doing it is to close the device, socket, pipe, etc which is not always what you want to do.
Attempt to Answer the Question
The hardware's interrupt system supports the implementation of select() because it's a way for the devices to notify the CPU that something has happened without the CPU having to poll / spin on a register in the device.
Unix / Linux uses that interrupt system to provide select() / epoll() functionality, and also incorporates purely internal 'devices' (pipes, files, etc) into that functionality.
Windows' equivalent facility, WaitForMultipleObjects() fundamentally does not incorporate IO devices of any sort, which is why you have to have a separate thread doing the IO for you whilst you're waiting for that thread to complete. The interrupt system on the hardware is (I'm guessing) used solely to tell the device drivers when a read or write operation is complete. The exception is the select() function call in Windows which operates only on sockets, not anything else.
A big clue to the architectural differences between Unix/Linux and Windows is that a PC can run either, but you get a proper IO-centric select() only on Unix/Linux.
Guesswork
I'm guessing that the reason Windows has never done a select() properly is that the early device drivers for Windows could in no way support it, a bit like the early days of Linux.
However, Windows became very popular quite early on, and loads of device drivers got written against that (flawed?) device driver standard.
If at any point MS had thought "perhaps we'd better improve on that" they would have faced the problem of getting everyone to rewrite their device drivers, a massive undertaking. So they decided not to, and instead implemented the separate IO thread / WaitFor... model instead. This got promoted by MS as being somehow superior to the Unix way of doing things. And now that Windows has been that way for so long I'm guessing that there's no one in MS who perceives that Things Could Be Improved.
==EDIT==
I have since stumbled across Named Pipes - Asynchronous Peeking. This is fascinating because it would seem (I'm very glad to say) to debunk pretty much everything I'd thought about Windows and IO. The article applies to pipes, though presumably it would also apply to any IO stream.
It seems to hinge on starting an asynchronous read operation to read zero bytes. The read will not return until there are some bytes available, but none of them are read from the stream. You can therefore use something like WaitForMultipleObjects() to wait for more than one such asynchronous operation to complete.
As the comment below the accepted answer recognises this is very non-obvious in all the of the Microsoft documentation that I've ever read. I wonder about it being an unintended but useful behaviour in the OS. I've been ploughing through Windows Internals by Mark Russinovich, but I've not found anything yet.
I've not yet had a chance to experiment with this in anyway, but if it does work then that means that one can implement something equivalent to Unix's select() on Windows, so it must therefore be supported all the way down to the device driver level and interrupts. Hence the extensive strikeouts above...

Would handling each TCP connection in a separate thread improve latency?

I have an FTP server, implemented on top of QTcpServer and QTcpSocket.
I take advantage of the signals and slots mechanism to support multiple TCP connections simultaneously, even though I have a single thread. My code returns as soon as possible to the event loop, it doesn't block (no wait functions), and it doesn't use nested event loops anywhere. That way I already have cooperative multitasking, like Win3.1 applications had.
But a lot of other FTP servers are multithreaded. Now I'm wondering if using a separate thread for handling each TCP connection would improve performance, and especially latency.
On one hand, threads add to latency because you need to start a new thread for each new connection, but on the other, with my cooperative multitasking, other TCP connections have to wait until I've returned to the main loop before their readyRead()/bytesWritten() signals can be handled.
In your current system and ignoring file I/O time one processor is always doing something useful if there's something useful to be done, and waiting ready-to-go if there's nothing useful to be done. If this were a single processor (single core) system you would have maximized throughput. This is often a very good design -- particularly for an FTP server where you don't usually have a human waiting on a packet-by-packet basis.
You have also minimized average latency (for a single processor system.) What you do not have is consistent latency. Measuring your system's performance is likely to show a lot of jitter -- a lot of variation in the time it takes to handle a packet. Again because this is FTP and not real-time process control or human interaction, jitter may not be a problem.
Now, however consider that there is probably more than one processor available on your system and that it may be possible to overlap I/O time and processing time.
To take full advantage of a multi-processor(core) system you need some concurrency.
This normally translates to using multiple threads, but it may be possible to achieve concurrency via asynchronous (non-blocking) file reads and writes.
However, adding multiple threads to a program opens up a huge can-of-worms.
If you do decide to go the MT route, I'd suggest that you consider depending on a thread-aware I/O library. QT may provide that for you (I'm not sure.) If not, take a look at boost::asio (or ACE for an older, but still solid solution). You'll discover that using the MT capabilities of such a library involves a considerable investment in learning time; however as it turns out the time to add on multithreading "by-hand" and get it right is even worse.
So I'd say stay with your existing solution unless you are worried about unused Processor cycles and/or jitter in which case start learning QT's multithreading support or boost::asio.
Do you need to start a new thread for each new connection? Could you not just have a pool of threads that acts on requests as and when they arrive. This should reduce some of the latency. I have to say that in general a multi-threaded FTP server should be more responsive that a single-threaded one. Is it possible to have an event based FTP server?

What is the difference between busy-wait and polling?

From the Wikipedia article on Polling
Polling, or polled operation, in computer science, refers to actively sampling the status of an external device by a client program as a synchronous activity. Polling is most often used in terms of input/output (I/O), and is also referred to as polled I/O or software driven I/O.
Polling is sometimes used synonymously with busy-wait polling (busy waiting). In this situation, when an I/O operation is required the computer does nothing other than check the status of the I/O device until it is ready, at which point the device is accessed. In other words the computer waits until the device is ready.
Polling also refers to the situation where a device is repeatedly checked for readiness, and if it is not the computer returns to a different task. Although not as wasteful of CPU cycles as busy-wait, this is generally not as efficient as the alternative to polling, interrupt driven I/O.
So, when a thread doesn't use the "condition variables", will it be called "polling" for the data change or "busy waiting"?
The difference between the two is what the application does between polls.
If a program polls a device say every second, and does something else in the mean time if no data is available (including possibly just sleeping, leaving the CPU available for others), it's polling.
If the program continuously polls the device (or resource or whatever) without doing anything in between checks, it's called a busy-wait.
This isn't directly related to synchronization. A program that blocks on a condition variable (that should signal when a device or resource is available) is neither polling nor busy-waiting. That's more like event-driven/interrupt-driven I/O.
(But for example a thread that loops around a try_lock is a form of polling, and possibly busy-waiting if the loop is tight.)
Suppose one has a microprocessor or microcontroller which is supposed to perform some action when it notices that a button is pushed.
A first approach is to have the program enter a loop which does nothing except look to see if the button has changed yet and, once it has, perform the required action.
A second approach in some cases would be to program the hardware to trigger an interrupt when the button is pushed, assuming the button is wired to an input that's wired so it can cause an interrupt.
A third approach is to configure a timer to interrupt the processor at some rate (say, 1000x/second) and have the handler for that interrupt check the state of the button and act upon it.
The first approach uses a busy-wait. It can offer very good response time to one particular stimulus, at the expense of totally tuning out everything else. The second approach uses event-triggered interrupt. It will often offer slightly slower response time than busy-waiting, but will allow the CPU to do other things while waiting for I/O. It may also allow the CPU to go into a low-power sleep mode until the button is pushed. The third approach will offer a response time that is far inferior to the other two, but will be usable even if the hardware would not allow an interrupt to be triggered by the button push.
In cases where rapid response is required, it will often be necessary to use either an event-triggered interrupt or a busy-wait. In many cases, however, a polled approach may be most practical. Hardware may not exist to support all the events one might be interested in, or the number of events one is interested in may substantially exceed the number of available interrupts. Further, it may be desirable for certain conditions to generate a delayed response. For example, suppose one wishes to count the number of times a switch is activated, subject to the following criteria:
Every legitimate switch event will consist of an interval from 0 to 900us (microseconds) during which the switch may arbitrarily close and reopen, followed by an interval of at least 1.1ms during which the switch will remain closed, followed by an interval from 0 to 900us during which the switch may arbitrarily open and reclose, followed by an interval of which at least 1.1ms during which the switch will be open.
Software must ignore the state of the switch for 950us after any non-ignored switch opening or closure.
Software is allowed to arbitrarily count or ignore switch events which occur outside the above required blanking interval, but which last less than 1.1ms.
The software's reported count must be valid within 1.99ms of the time the switch is stable "closed".
The easiest way to enforce this requirement is to observe the state of the switch 1,000x/second; if it is seen "closed" when the previous state was "open", increment the counter. Very simple and easy; even if the switch opens and closes in all sorts of weird ways, during the 900us preceding and following a real event, software won't care.
It would be possible to use a switch-input-triggered interrupt along with a timer to yield faster response to the switch input, while meeting the required blanking requirement. Initially, the input would be armed to trigger the next time the switch closes. Once the interrupt was triggered, software would disable it but set a timer to trigger an interrupt after 950us. Once that timer expired, it would trigger an interrupt which would arm the interrupt to fire the next time the switch is "open". That interrupt would in turn disable the switch interrupt and again set the timer for 950us, so the timer interrupt would again re-enable the switch interrupt. Sometimes this approach can be useful, but the software is a lot more complicated than the simple polled approach. When the timer-based approach will be sufficient, it is often preferable.
In systems that use a multitasking OS rather than direct interrupts, many of the same principles apply. Periodic I/O polling will waste some CPU time compared with having code which the OS won't run until certain events occur, but in many cases both the event response time and the amount of time wasted when no event occurs will be acceptable when using periodic polling. Indeed, in some buffered I/O situations, periodic polling might turn out to be quite efficient. For example, suppose one is receiving a large amount of data from a remote machine via serial port, at most 11,520 bytes will arrive per second, the device will send up to 2K of data ahead of the last acknowledged packet, and the serial port has a 4K input buffer. While one could process data using a "data received" event, it may be just as efficient to simply check the port 100x/second and process all packets received up to that point. Such polling would be a waste of time when the remote device wasn't sending data, but if incoming data was expected it may be more efficient to process it in chunks of roughly 1.15K than to process every little piece of incoming data as soon as it comes in.

Resources