What is the difference between busy-wait and polling? - multithreading

From the Wikipedia article on Polling
Polling, or polled operation, in computer science, refers to actively sampling the status of an external device by a client program as a synchronous activity. Polling is most often used in terms of input/output (I/O), and is also referred to as polled I/O or software driven I/O.
Polling is sometimes used synonymously with busy-wait polling (busy waiting). In this situation, when an I/O operation is required the computer does nothing other than check the status of the I/O device until it is ready, at which point the device is accessed. In other words the computer waits until the device is ready.
Polling also refers to the situation where a device is repeatedly checked for readiness, and if it is not the computer returns to a different task. Although not as wasteful of CPU cycles as busy-wait, this is generally not as efficient as the alternative to polling, interrupt driven I/O.
So, when a thread doesn't use the "condition variables", will it be called "polling" for the data change or "busy waiting"?

The difference between the two is what the application does between polls.
If a program polls a device say every second, and does something else in the mean time if no data is available (including possibly just sleeping, leaving the CPU available for others), it's polling.
If the program continuously polls the device (or resource or whatever) without doing anything in between checks, it's called a busy-wait.
This isn't directly related to synchronization. A program that blocks on a condition variable (that should signal when a device or resource is available) is neither polling nor busy-waiting. That's more like event-driven/interrupt-driven I/O.
(But for example a thread that loops around a try_lock is a form of polling, and possibly busy-waiting if the loop is tight.)

Suppose one has a microprocessor or microcontroller which is supposed to perform some action when it notices that a button is pushed.
A first approach is to have the program enter a loop which does nothing except look to see if the button has changed yet and, once it has, perform the required action.
A second approach in some cases would be to program the hardware to trigger an interrupt when the button is pushed, assuming the button is wired to an input that's wired so it can cause an interrupt.
A third approach is to configure a timer to interrupt the processor at some rate (say, 1000x/second) and have the handler for that interrupt check the state of the button and act upon it.
The first approach uses a busy-wait. It can offer very good response time to one particular stimulus, at the expense of totally tuning out everything else. The second approach uses event-triggered interrupt. It will often offer slightly slower response time than busy-waiting, but will allow the CPU to do other things while waiting for I/O. It may also allow the CPU to go into a low-power sleep mode until the button is pushed. The third approach will offer a response time that is far inferior to the other two, but will be usable even if the hardware would not allow an interrupt to be triggered by the button push.
In cases where rapid response is required, it will often be necessary to use either an event-triggered interrupt or a busy-wait. In many cases, however, a polled approach may be most practical. Hardware may not exist to support all the events one might be interested in, or the number of events one is interested in may substantially exceed the number of available interrupts. Further, it may be desirable for certain conditions to generate a delayed response. For example, suppose one wishes to count the number of times a switch is activated, subject to the following criteria:
Every legitimate switch event will consist of an interval from 0 to 900us (microseconds) during which the switch may arbitrarily close and reopen, followed by an interval of at least 1.1ms during which the switch will remain closed, followed by an interval from 0 to 900us during which the switch may arbitrarily open and reclose, followed by an interval of which at least 1.1ms during which the switch will be open.
Software must ignore the state of the switch for 950us after any non-ignored switch opening or closure.
Software is allowed to arbitrarily count or ignore switch events which occur outside the above required blanking interval, but which last less than 1.1ms.
The software's reported count must be valid within 1.99ms of the time the switch is stable "closed".
The easiest way to enforce this requirement is to observe the state of the switch 1,000x/second; if it is seen "closed" when the previous state was "open", increment the counter. Very simple and easy; even if the switch opens and closes in all sorts of weird ways, during the 900us preceding and following a real event, software won't care.
It would be possible to use a switch-input-triggered interrupt along with a timer to yield faster response to the switch input, while meeting the required blanking requirement. Initially, the input would be armed to trigger the next time the switch closes. Once the interrupt was triggered, software would disable it but set a timer to trigger an interrupt after 950us. Once that timer expired, it would trigger an interrupt which would arm the interrupt to fire the next time the switch is "open". That interrupt would in turn disable the switch interrupt and again set the timer for 950us, so the timer interrupt would again re-enable the switch interrupt. Sometimes this approach can be useful, but the software is a lot more complicated than the simple polled approach. When the timer-based approach will be sufficient, it is often preferable.
In systems that use a multitasking OS rather than direct interrupts, many of the same principles apply. Periodic I/O polling will waste some CPU time compared with having code which the OS won't run until certain events occur, but in many cases both the event response time and the amount of time wasted when no event occurs will be acceptable when using periodic polling. Indeed, in some buffered I/O situations, periodic polling might turn out to be quite efficient. For example, suppose one is receiving a large amount of data from a remote machine via serial port, at most 11,520 bytes will arrive per second, the device will send up to 2K of data ahead of the last acknowledged packet, and the serial port has a 4K input buffer. While one could process data using a "data received" event, it may be just as efficient to simply check the port 100x/second and process all packets received up to that point. Such polling would be a waste of time when the remote device wasn't sending data, but if incoming data was expected it may be more efficient to process it in chunks of roughly 1.15K than to process every little piece of incoming data as soon as it comes in.

Related

serial port: control, predict or at least optimize read() latency

I am designing application running on ARM9 working through serial port using Modbus. You may know that Modbus protocol is based on timing.
Originally I open port in non-blocking mode and use polling through read(). Then I learned, while it seems to work, it is not best, or even not a good solution for this environment. I have seen my thread execution "holes" of up to 60 ms (yes, milliseconds), and it is too much. I do not know if my measurements are correct - this is what I see on the screen, and it is not actually a question here.
I have learned there're a number of ways doing "high level" reading differently:
use another way of polling, e.g. epoll epoll_wait;
open serial port in blocking mode, and, in another thread, measure the time while read() is waiting for the data (e.g. timer somehow connected to the signal).
However, as I was told, the nature of Linux by default is not real-time, and nothing is guaranteed.
I am looking for advice and information if there're any hacks to design read() getting characters received from UART through all the software layers as quick as possible. Controlled delays up to 1ms would be acceptable (9600 baud).
For example, if I write code in specific way, compiler and target CPU will arrange timing in the way if it sees code loops waiting for some condition, CPU will turn away to another threads, but as soon as this thread's condition is met (no idea how - interrupt? watcher?) it switches to this thread as soon as it can and proceeds with it.

Smalltalk: Can a single object block the entire system by entering an infinite loop?

Since Smalltalk scheduling is non-preemptive, processes must explicitly yield or wait on a semaphore
Does this mean that one object entering an infinite loop could stall the entire system?
the loop can be interrupted at any time. Even an atomic loop like [true] whileTrue can be interrupted before "executing" the true object
By what can it be interrupted?
It is the Virtual Machine who may interrupt the image. Under a normal execution flow, the VM is basically sending messages, one after the other. However, certain events may impact the natural flow of execution by interrupting it, if needed. While concrete examples may change from one dialect to the other, these usually correspond to OS events that need to be communicated to the image for their consideration.
An interruption may also be caused if the VM is running out of memory. In this case it will interrupt the image requesting it to do garbage collection.
Loops are interesting because they have the semantics of regular messages, so what happens is that the block of code inside the loop is evaluated (#value & friends) every time the loop repeats. So, you should think of loops as regular messages. However, this semantics is usually optimized so the re-evaluation is not explicitly requested by a Smalltalk message. In that case the VM will check for interruptions before executing the block. Thus, if you run
[true] whileTrue
before designating the object true as the current receiver (in this case, of no message) the VM will check whether there is any interrupt to pay attention to (in the same way it checks for interruptions before starting to execute any given method).
Most dialects implement some "break" keystroke that would produce a "halt" and open a debugger for the programmer to recover manual control.
Note that, depending on the dialect, an interruption may only consist of the signaling of a semaphore. This will have the effect of moving the waiting process (if any) to the ready queue of the ProcessScheduler. So, the intended "routine" may not run immediately but change to the ready state for the next time there is a process switch (at that level of priority).
The last example that comes to mind is the StackOverflow exception (no pun intended), where the VM realizes that it is running out of stack space and interrupts the image by signaling an exception.
You may also think of the #messageNotUnderstood: as an interruption generated by the VM when it realizes that an object has received a message for which is has no implementation. In this case, the natural flow will change so that the object will receive the message #messageNotUnderstood: with the actual message as the argument.
One more thing. Whether a loop may or may not stall the system depends on the priority of the process it is running. If the loop is running with low priority an interruption that awakes a process of higher priority will take precedence and be run while the loop is sent to sleep. By the same logic, if your endless loop runs in a process at a higher priority no interruption will stop it.
Yes, it is super simple to just run
[ true ] whileTrue: [ ]
and you won't be able to do anything else.
Pharo has a "ripcord" when you press comand + . on Mac. For Windows or Linux it's either alt or control. This action should halt the thing that you are running and allow you to intervene.

How does the kernel track which processes receive data from an interrupt?

In a preemptive kernel (say Linux), say process A makes a call to getc on stdin, so it's blocked waiting for a character. I feel like I have a fundamental misunderstanding of how the kernel knows then to wake process A at this point and deliver the data after it's received.
My understanding is then this process can be put into a suspended state while the scheduler schedules other processes/threads to run, or it gets preempted. When the keypress happens, through polling/interrupts depending on the implementation, the OS runs a device driver that decodes the key that was pressed. However it's possible (and likely) that my process A isn't currently running. At this point, I'm confused on how my process that was blocked waiting on I/O is now queued to run again, especially how it knows which process is waiting for what. It seems like the device drivers hold some form of a wait queue.
Similarly, and I'm not sure if this is exactly related to the above, but if my browser window, for example, is in focus, it seems to receive key presses but not other windows. Does every window/process have the ability to "listen" for keyboard events even if they're not in focus, but just don't for user experience sake?
So I'm curious how kernels (or how some) keep track of what processes are waiting on which events, and when those events come in, how it determines which processes to schedule to run?
The events that processes wait on are abstract software events, such as a particular queue is not empty, rather than concrete hardware events, such as a interrupt 4635 occurring.
Some configuration ( perhaps guided by a hardware description like device tree ) identifies interrupt 4635 as being a signal from a given serial device with a given address. The serial device driver configures itself so it can access the device registers of this serial port, and attaches its interrupt handler to the given interrupt identifier (4635).
Once configured, when an interrupt from the serial device is raised, the lowest level of the kernel invokes this serial device's interrupt handler. In turn, when the handler sees a new character arriving, it places it in the input queue of that device. As it enqueues the character, it may notice that some process(es) are waiting for that queue to be non-empty, and cause them to be run.
That approximately describes the situation using condition variables as the signalling mechanism between interrupts and processes, as was established in UNIX-y kernels 44 years ago. Other approaches involve releasing a semaphore on each character in the queue; or replying with messages for each character. There are many forms of synchronization that can be used.
Common to all such mechanisms, is that the caller chooses to suspend itself to wait for io to complete; and does so by associating its suspension with the instance of the object which it is expecting input from.
What happens next can vary; typically the waiting process, which is now running, reattempts to remove a character from the input queue. It is possible some other process got to it first, in which case, it merely goes back to waiting for the queue to become non empty.
So, the OS doesn't explicitly route the character from the device to the application; a series of implicit and indirect steps does.

InfiniBand: transfer rate depends on MPI_Test* frequency

I'm writing a multi-threaded OpenMPI application, using MPI_Isend and MPI_Irecv from several threads to exchange hundreds of messages per second between ranks over InfiniBand RDMA.
Transfers are in the order of 400 - 800KByte, generating about 9 Gbps in and out for each rank, well within the capacity of FDR. Simple MPI benchmarks also show good performance.
The completion of the transfers is checked upon by polling all active transfers using MPI_Testsome in a dedicated thread.
The transfer rates I achieve depend on the message rate, but more importantly also on the polling frequency of MPI_Testsome. That is, if I poll, say, every 10ms, the requests finish later than if I poll every 1ms.
I'd expect that if I poll evert 10ms instead of every 1ms, I'd at most be informed of finished requests 9ms later. I'd not expect the transfers themselves to delay completion by fewer calls to MPI_Testsome, and thus slow down the total transfer rates. I'd expect MPI_Testsome to be entirely passive.
Anyone here have a clue why this behaviour could occur?
The observed behaviour is due to the way operation progression is implemented in Open MPI. Posting a send or receive, no matter if it is done synchronously or asynchronously, results in a series of internal operations being queued. Progression is basically the processing of those queued operations. There are two modes that you can select at library build time: one with asynchronous progression thread and one without with the latter being the default.
When the library is compiled with async progression thread enabled, a background thread takes care and processes the queue. This allows for background transfers to commence in parallel with the user's code but increases the latency. Without async progression, operations are faster but progression can only happen when the user code calls into the MPI library, e.g. while in MPI_Wait or MPI_Test and family. The MPI_Test family of functions are implemented in such a way as to return as fast as possible. That means that the library has to balance a trade-off between doing stuff in the call, thus slowing it down, or returning quickly, which means less operations are progressed on each call.
Some of the Open MPI developers, notably Jeff Squyres, visits Stack Overflow every now and then. He could possibly provide more details.
This behaviour is hardly specific to Open MPI. Unless heavily hardware-assisted, MPI is usually implemented following the same methods.

How do system calls like select() or poll() work under the hood?

I understand that async I/O ops via select() and poll() do not use processor time i.e its not a busy loop but then how are these really implemented under the hood ? Is it supported in hardware somehow and is that why there is not much apparent processor cost for using these ?
It depends on what the select/poll is waiting for. Let's consider a few cases; I'm going to assume a single-core machine for simplification.
First, consider the case where the select is waiting on another process (for example, the other process might be carrying out some computation and then outputs the result through a pipeline). In this case the kernel will mark your process as waiting for input, and so it will not provide any CPU time to your process. When the other process outputs data, the kernel will wake up your process (give it time on the CPU) so that it can deal with the input. This will happen even if the other process is still running, because modern OSes use preemptive multitasking, which means that the kernel will periodically interrupt processes to give other processes a chance to use the CPU ("time-slicing").
The picture changes when the select is waiting on I/O; network data, for example, or keyboard input. In this case, while archaic hardware would have to spin the CPU waiting for input, all modern hardware can put the CPU itself into a low-power "wait" state until the hardware provides an interrupt - a specially handled event that the kernel handles. In the interrupt handler the CPU will record the incoming data and after returning from the interrupt will wake up your process to allow it to handle the data.
There is no hardware support. Well, there is... but is nothing special and it depends on what kind of file descriptor are you watching. If there is a device driver involved, the implementation depends on the driver and/or the device. For example, sockets. If you wait for some data to read, there are a sequence of events:
Some process calls poll()/select()/epoll() system call to wait for data in a socket. There is a context switch from the user mode to the kernel.
The NIC interrupts the processor when some packet arrives. The interrupt routine in the driver push the packet in the back of a queue.
There is a kernel thread that takes data from that queue and wakes up the network code inside the kernel to process that packet.
When the packet is processed, the kernel determines the socket that was expecting for it, saves the data in the socket buffer and returns the system call back to user space.
This is just a very brief description, there are a lot of details missing but I think that is enough to get the point.
Another example where no drivers are involved is a unix socket. If you wait for data from one of them, the process that waits is added to a list. When other process on the other side of the socket writes data, the kernel checks that list and the point 4 is applied again.
I hope it helps. I think that examples are the best to undertand it.

Resources