How does D3D9's Presentation Interval work? - direct3d

If I set the presentation interval in Direct3D9 to D3DPRESENT_INTERVAL_ONE, when I call Present it waits until the monitor updates. It always waits the correct amount and (presumably) doesn't use a spinlock.
I'd like to be able to do the same "waiting" that Present does in Direct3D9, however I don't want to use Direct3D. How exactly does it wait for vsync perfectly without using a spinlock? Can just the waiting be programmed without Direct3D?

Synchronization with the vertical retrace is handled by driver in a device dependent manner. It's not inconceivable that there exists some implementation just busy waits, polling some device register until it detects the beginning of the retrace interval. The alternative would to sleep waiting on a device interrupt, which frees up the CPU for other tasks, but increases the latency because of the necessary kernel-mode/user-mode transitions. It's also possible for a driver to implement a hybrid approach by estimating the time to the retrace, sleeping for a bit less than that and then busy waiting.
I don't know which of these three possible implementations is typical, but it doesn't really matter. Windows doesn't provide any device independent means for a Windows application to synchronize with the virtual retrace outside of DirectX (and I guess OpenGL). Unlike a video card driver, applications don't have direct access to the hardware, so can't read the device registers nor request or handle interrupts.

Related

Thread sleeps longer then expected

I have this code:
let k = time::Instant::now();
thread::sleep(time::Duration::from_micros(10));
let elapsed = k.elapsed().as_micros();
println!("{}", elapsed);
My output is always somewhere between 70 and 90. I expect it to be 10, why is this number 7x higher?
This actually doesn't really have anything to do with Rust.
On a typical multi-processing, user-interactive operating system (i.e., every consumer OS you've used), your thread isn't special. It's one among many, and the CPUs need to be shared.
You operating system has a component called a scheduler, whose job it is to share the hardware resources. It will boot off your thread off the CPU quite often. This typically happens:
On every system call
Every time an interrupt hits the CPU
When the scheduler kicks you off to give other processes/threads a chance (this is called preemption, and typically happens 10s of times a second)
Thus, your userland process can't possibly do anything timing-related with such fine precision.
There's several solution paths you can explore:
Increase the amount of CPU your operating system gives you. Some ideas:
Increase the process' priortiy
Pin the thread to a particular CPU core, to give it exclusive use (this means you lose throughput, because if your thread is idle, no other thread's work can borrow that CPU)
Switch to a real-time operating system which makes guarantees about latency and timing.
Offload the work to some hardware that's specialized to do with, without the involvement of your process.
E.g. offload sine wave generation to a hardware sound-card, WiFi radio processing to a radio controller, etc.
Use your own micro controller to do the real-time stuff, and communicate to it over something like I2C or SPI.
In your case of running some simple code on a userland process, I think your easiest bet is to just pin your process. Your existing code will work as-is, you'll just lose the throughput of one of your cores (but luckily, you haven multiple).

How OS handle input operations?

I learn how an OS system works and know that peripheral devices can send interrupts that OS handles then. But I don't have a vision of how actually it handles it.
What happens when I move the mouse around? Does it send interrupts every millisecond? How OS can handle the execution of a process and mouse positioning especially if there is one CPU? How can OS perform context switch in this case effectively?
Or for example, there are 3 launched processes. Process 1 is active, process 2 and process 3 are ready to go but in the pending state. The user inputs something with the keyboard in process 1. As I understand OS scheduler can launch process 2 or process 3 while awaiting input. I assume that the trick is in timings. Like the processor so fast that it's able to launched processes 2 and 3 between user's presses.
Also, I will appreciate any literature references where I could get familiar with how io stuff works especially in terms of timings and scheduling.
Let's assume it's some kind of USB device. For USB you have 2 layers of device drivers - the USB controller driver and the USB peripheral (keyboard, mouse, joystick, touchpad, ...) driver. The USB peripheral driver asks the USB controller driver to poll the device regularly (e.g. maybe every 8 milliseconds) and the USB controller driver sets that up and the USB controller hardware does this polling (not software/driver), and if it receives something from the USB peripheral it'll send an IRQ back to the USB controller driver.
When the USB controller sends an IRQ it causes the CPU to interrupt whatever it was doing and execute the USB controller driver's IRQ handler. The USB controller driver's IRQ handler examines the state of the USB controller and figures out why it sent an IRQ; and notices that the USB controller received data from a USB peripheral; so it determines which USB peripheral driver is responsible and forwards the received data to that USB peripheral's device driver.
Note: Because it's bad to spend too much time handling an IRQ (because it can cause the handling of other more important IRQs to be postponed) often there will be some kind separation between the IRQ handler and higher level logic at some point; which is almost always some variation of a queue where the IRQ handler puts a notification on a queue and then returns from the IRQ handler, and the notification on the queue causes something else to be executed run later. This might happen in the middle of the USB controller driver (e.g. USB controller driver's IRQ handler does a little bit of work, then creates a notification that causes the rest of the USB controller driver to do the rest of the work). There's multiple ways to implement this "queue of notifications" (deferred procedure calls, message passing, some other form of communication, etc) and different operating systems use different approaches.
The USB peripheral's device driver (e.g. keyboard driver, mouse driver, ...) receives the data sent by the USB controller's driver (that came from the USB controller that got it from polling the USB peripheral); and examines that data. Depending on what the data contains the USB peripheral's device driver will probably construct some kind of event describing what happened in a "standard for that OS" way. This can be complicated (e.g. involve tracking past state of the device and lookup tables for keyboard layout, etc). In any case the resulting event will be forwarded to something else (often a user-space process) using some form of "queue of notifications". This might be the same kind of "queue of notifications" that was used before; but might be something very different (designed to suit user-space instead of being designed for kernel/device drivers only).
Note: In general every OS that supports multi-tasking provides one or more ways that normal processes can use to communicate with each other; called "inter-process communication". There are multiple possibilities - pipes, sockets, message passing, etc. All of them interact with scheduling. E.g. a process might need to wait until it receives data and call a function (e.g. to read from a pipe, or read from a socket, or wait for a message, or ..) that (if there's no data in the queue to receive) will cause the scheduler to be told to put the task into a "blocked" state (where the task won't be given any CPU time); and when data arrives the scheduler is told to bump the task out of the "blocked" state (so it can/will be given CPU time again). Often (for good operating systems), whenever a task is bumped out of the "blocked" state the scheduler will decide if the task should preempt the currently running task immediately, or not; based on some kind of task/thread priorities. In other words; if a lower priority task is currently running and a higher priority task is waiting to receive data, then when the higher priority task receives the data it was waiting for the scheduler may immediately do a task switch (from lower priority task to higher priority task) so that the higher priority task can examine the data it received extremely quickly (without waiting for ages while the CPU is doing less important work).
In any case; the event (from the USB peripheral's device driver) is received by something (likely a process in user-space, likely causing that process to be unblocked and given CPU time immediately by the scheduler). This is the top of a "hierarchy/tree of stuff" in user-space; where each thing in the tree might look at the data it receives and may forward it to something else in the tree (using the same inter-process communication to forward the data to something else). For example; that "hierarchy/tree of stuff" might have a "session manager" at the top of the tree, then "GUI" under that, then several application windows under that. Sometimes an event will be consumed and not forwarded to something else (e.g. if you press "alt+tab" then the GUI might handle that itself, and the GUI won't forward it to the application window that currently has keyboard focus).
Eventually most events will end up at a normal application. Normal applications often have a language run-time that will abstract the operating systems details to make the application more portable (so that the programmer doesn't have to care which OS their application is running on). For example, for Java, the Java virtual machine might convert the operating system's event (that arrived in an "OS specific" format via. an "OS specific" communication mechanism) into a generic "KeyEvent" (and notify any "KeyListener").
The entire path (from drivers to a function/method inside an application) could involve many thousands of lines of code written by hundreds of people spread across many separate layers; where the programmers responsible for one piece (e.g. GUI) don't have to worry much about what the programmers working on other pieces (e.g. drivers) do. For this reason; you probably won't find a single source of information that covers everything (at all layers). Instead, you'll find information for device driver developers only, or information for C++ application developers only, or ...
This is also why nobody will be able to provide more than a generic overview (without any OS specific or "layer specific" details) - they'd have to write 12 entire books to provide an extremely detailed answer.

Creating real-time thread on OSX

I'm working on an OSX application that transmits data to a hardware device over USB serial. The hardware has a small serial buffer that is drained at a variable rate and should always stay non-empty.
We have a write loop in its own NSThread that checks if the hardware buffer is full, and if not, writes data until it is. The majority of loop iterations don't write anything and take almost no time, but they can occasionally take up to a couple milliseconds (as timed with CACurrentMediaTime). The thread sleeps for 100ns after each iteration. (I know that sleep time seems insanely short, but if we bump it up, the hardware starts getting data-starved.)
This works well much of the time. However, if the main thread or another application starts doing something processor-intensive, the write thread slows down and isn't able to stream data fast enough to keep the device's queue from emptying.
So, we'd like to make the serial write thread real-time. I read the Apple docs on requesting real-time scheduling through the Mach API, then tried to adapt the code snippet from SetPriorityRealtimeAudio(mach_port_t mach_thread_id) in the Chromium source.
However, this isn't working - the application remains just as susceptible to serial communication slowdowns. Any ideas? I'm not sure if I need to change the write thread's behavior, or if I'm passing in the wrong thread policy parameters, or both. I experimented with various period/computation/constraint values, and with forcing a more consistent duty cycle (write for 100ns max and then sleep for 100ns) but no luck.
A related question: How can I check the thread's priority directly, and/or tell if it's starting off as real-time and then being demoted vs not being promoted to begin with? Right now I'm just making inferences from the hardware performance, so it's hard to tell exactly what's going on.
My suggestion is to move the thread of execution that requires the highest priority into a separate process. Apple often does this for realtime processes such as driving the built-in camera. Depending on what versions of the OS you are targeting you can use Distributed Objects (predecessor to XPC) or XPC.
You can also roll your own RPC mechanism and use standard Unix fork techniques to create a separate child process. Since your main app is the owner of the child process, you should also be able to set the scheduling priority of the process in addition to the individual thread priority within the process.
As I edit this post, I have a WWDC video playing in the background and also started a QuickTime Movie Recording task. As you can see, the real-time aspects of both those apps are running in separate XPC processes:
ps -ax | grep Video
1933 ?? 0:00.08 /System/Library/Frameworks/VideoToolbox.framework/Versions/A/XPCServices/VTDecoderXPCService.xpc/Contents/MacOS/VTDecoderXPCService
2332 ?? 0:08.94 /System/Library/Frameworks/VideoToolbox.framework/Versions/A/XPCServices/VTDecoderXPCService.xpc/Contents/MacOS/VTDecoderXPCService
XPC Services at developer.apple.com
Distributed Objects at developer.apple.com

Need help handling multiple shared I2C MAX3107 chips on shared ARM9 GPIO interrupt (linux)

Our group is working with an embedded processor (Phytec LPC3180, ARM9). We have designed a board that includes four MAX3107 uart chips on one of the LPC3180's I2C busses. In case it matters, we are running kernel 2.6.10, the latest version available for this processor (support of this product has not been very good; we've had to develop or fix a number of the drivers provided by Phytec, and Phytec seems to have no interest in upgrading the linux code (especially kernel version) for this product. This is too bad in that the LPC3180 is a nice device, especially in the context of low power embedded products that DO NOT require ethernet and in fact don't want ethernet (owing to the associated power consumption of ethernet controller chips). The handler that is installed now (developed by someone else) is based on a top-half handler and bottom-half work queue approach.
When one of four devices (MAX3107 UART chips) on the I2C bus receives a character it generates an interrupt. The interrupt lines of all four MAX3107 chips are shared (open drain pull-down) and the line is connected to a GPIO pin of the 3180 which is configured for level interrupt. When one of the 3017's generates an interrupt a handler is run which does the following processing (roughly):
spin_lock_irqsave();
disable_irq_nosync(irqno);
irq_enabled = 0;
irq_received = 1;
spin_unlock_irqrestore()
set_queued_work(); // Queue up work for all four devices for every interrupt
// because at this point we don't know which of the four
// 3107's generated the interrupt
return IRQ_HANDLED;
Note, and this is what I find somewhat troubling, that the interrupt is not re-enabled before leaving the above code. Rather, the driver is written such that the interrupt is re-enabled by a bottom half work queue task (using the "enable_irq(LPC_IRQ_LINE) function call". Since the work queue tasks do not run in interrupt context I believe they may sleep, something that I believe to be a bad idea for an interrupt handler.
The rationale for the above approach follows:
1. If one of the four MAX3107 uart chips receives a character and generates an interrupt (for example), the interrupt handler needs to figure out which of the four I2C devices actually caused the interrupt. However, and apparently, one cannot read the I2C devices from within the context of the upper half interrupt handler since the I2C reads can sleep, something considered inappropriate for an interrupt handler upper-half.
2. The approach taken to address the above problem (i.e. which device caused the interrupt) is to leave the interrupt disabled and exit the top-half handler after which non-interrupt context code can query each of the four devices on the I2C bus to figure out which received the character (and hence generated the interrupt).
3. Once the bottom-half handler figures out which device generated the interrupt, the bottom-half code disables the interrupt on that chip so that it doesn't re-trigger the interrupt line to the LPC3180. After doing so it reads the serial data and exits.
The primary problem here seems to be that there is not a way to query the four MAX3107 uart chips from within the interrupt handler top-half. If the top-half simply re-enabled interrupts before returning, this would cause the same chip to generate the interrupt again, leading, I think, to the situation where the top-half disables the interrupt, schedules bottom-half work queues and disables the interrupt only to find itself back in the same place because before the lower-half code would get to the chip causing the interrupt, another interrupt has occurred, and so forth, ....
Any advice for dealing with this driver will be much appreciated. I really don't like the idea of allowing the interrupt to be disabled in the top-half of the driver yet not be re-enabled prior to existing the top-half drive code. This does not seem safe.
Thanks,
Jim
PS: In my reading I've discovered threaded interrupts as a means to deal with the above-described requirements (at least that's my interpretation of web site articles such as http://lwn.net/Articles/302043/). I'm not sure if the 2.6.10 kernel as provided by Phytec includes threaded interrupt functions. I intend to look into this over the next few days.
If your code is written properly it shouldn't matter if a device issues interrupts before handling of prior interrupts is complete, and you are correct that you don't want to do blocking operations in the top half, but blocking operations are acceptable in a bottom half, in fact that is part of the reason they exist!
In this case I would suggest an approach where the top half just schedules the bottom half, and then the bottom half loops over all 4 devices and handles any pending requests. It could be that multiple devices need processing, or none.
Update:
It is true that you may overload the system with a load test, and the software may need to be optimized to handle heavy loads. Additionally I don't have a 3180, and four 3107s (or similar) of my own to test this out on, so I am speaking theoretically, but I am not clear why you need to disable interrupts at all.
Generally speaking when a hardware device asserts an interrupt it will not assert another one until the current one is cleared. So you have 4 devices sharing one int line:
Your top half fires and adds something to the work queue (ie triggers bottom half)
Your bottom half scans all devices on that int line (ie all four 3107s)
If one of them caused the interrupt you will then read all data necessary to fully process the data (possibly putting it in a queue for higher level processing?)
You "clear" the interrupt on the current device.
When you clear the interrupt then the device is allowed to trigger another interrupt, but not before.
More details about this particular device:
It seems that this device (MAX3107) has a buffer of 128 words, and by default you are getting interrupted after every single word. But it seems that you should be able to take better advantage of the buffer by setting the FIFO level registers. Then you will get interrupted only after that number of words has been rx (or if you fill your tx FIFO up beyond the threshold in which case you should slow down the transmit speed (ie buffer more in software)).
It seems the idea is to basically pull data off the devices periodically (maybe every 100ms or 10ms or whatever seems to work for you) and then only have the interrupt act as a warning that you have crossed a threshold, which might schedule the periodic function for immediate execution, or increases the rate at which it is called.
Interrupts are enabled & disabled because we use level-based interrupts, not edge-based. The ramifications of that are explicitly explained in the driver code header, which you have, Jim.
Level-based interrupts were required to avoid losing an edge interrupt from a character that arrives on one UART immediately after one arriving on another: servicing the first effectively eliminates the second, so that second character would be lost. In fact, this is exactly what happened in the initial, edge-interrupt version of this driver once >1 UART was exercised.
Has there been an observed failure with the current scheme?
Regards,
The Driver Author (someone else)

What is the difference between busy-wait and polling?

From the Wikipedia article on Polling
Polling, or polled operation, in computer science, refers to actively sampling the status of an external device by a client program as a synchronous activity. Polling is most often used in terms of input/output (I/O), and is also referred to as polled I/O or software driven I/O.
Polling is sometimes used synonymously with busy-wait polling (busy waiting). In this situation, when an I/O operation is required the computer does nothing other than check the status of the I/O device until it is ready, at which point the device is accessed. In other words the computer waits until the device is ready.
Polling also refers to the situation where a device is repeatedly checked for readiness, and if it is not the computer returns to a different task. Although not as wasteful of CPU cycles as busy-wait, this is generally not as efficient as the alternative to polling, interrupt driven I/O.
So, when a thread doesn't use the "condition variables", will it be called "polling" for the data change or "busy waiting"?
The difference between the two is what the application does between polls.
If a program polls a device say every second, and does something else in the mean time if no data is available (including possibly just sleeping, leaving the CPU available for others), it's polling.
If the program continuously polls the device (or resource or whatever) without doing anything in between checks, it's called a busy-wait.
This isn't directly related to synchronization. A program that blocks on a condition variable (that should signal when a device or resource is available) is neither polling nor busy-waiting. That's more like event-driven/interrupt-driven I/O.
(But for example a thread that loops around a try_lock is a form of polling, and possibly busy-waiting if the loop is tight.)
Suppose one has a microprocessor or microcontroller which is supposed to perform some action when it notices that a button is pushed.
A first approach is to have the program enter a loop which does nothing except look to see if the button has changed yet and, once it has, perform the required action.
A second approach in some cases would be to program the hardware to trigger an interrupt when the button is pushed, assuming the button is wired to an input that's wired so it can cause an interrupt.
A third approach is to configure a timer to interrupt the processor at some rate (say, 1000x/second) and have the handler for that interrupt check the state of the button and act upon it.
The first approach uses a busy-wait. It can offer very good response time to one particular stimulus, at the expense of totally tuning out everything else. The second approach uses event-triggered interrupt. It will often offer slightly slower response time than busy-waiting, but will allow the CPU to do other things while waiting for I/O. It may also allow the CPU to go into a low-power sleep mode until the button is pushed. The third approach will offer a response time that is far inferior to the other two, but will be usable even if the hardware would not allow an interrupt to be triggered by the button push.
In cases where rapid response is required, it will often be necessary to use either an event-triggered interrupt or a busy-wait. In many cases, however, a polled approach may be most practical. Hardware may not exist to support all the events one might be interested in, or the number of events one is interested in may substantially exceed the number of available interrupts. Further, it may be desirable for certain conditions to generate a delayed response. For example, suppose one wishes to count the number of times a switch is activated, subject to the following criteria:
Every legitimate switch event will consist of an interval from 0 to 900us (microseconds) during which the switch may arbitrarily close and reopen, followed by an interval of at least 1.1ms during which the switch will remain closed, followed by an interval from 0 to 900us during which the switch may arbitrarily open and reclose, followed by an interval of which at least 1.1ms during which the switch will be open.
Software must ignore the state of the switch for 950us after any non-ignored switch opening or closure.
Software is allowed to arbitrarily count or ignore switch events which occur outside the above required blanking interval, but which last less than 1.1ms.
The software's reported count must be valid within 1.99ms of the time the switch is stable "closed".
The easiest way to enforce this requirement is to observe the state of the switch 1,000x/second; if it is seen "closed" when the previous state was "open", increment the counter. Very simple and easy; even if the switch opens and closes in all sorts of weird ways, during the 900us preceding and following a real event, software won't care.
It would be possible to use a switch-input-triggered interrupt along with a timer to yield faster response to the switch input, while meeting the required blanking requirement. Initially, the input would be armed to trigger the next time the switch closes. Once the interrupt was triggered, software would disable it but set a timer to trigger an interrupt after 950us. Once that timer expired, it would trigger an interrupt which would arm the interrupt to fire the next time the switch is "open". That interrupt would in turn disable the switch interrupt and again set the timer for 950us, so the timer interrupt would again re-enable the switch interrupt. Sometimes this approach can be useful, but the software is a lot more complicated than the simple polled approach. When the timer-based approach will be sufficient, it is often preferable.
In systems that use a multitasking OS rather than direct interrupts, many of the same principles apply. Periodic I/O polling will waste some CPU time compared with having code which the OS won't run until certain events occur, but in many cases both the event response time and the amount of time wasted when no event occurs will be acceptable when using periodic polling. Indeed, in some buffered I/O situations, periodic polling might turn out to be quite efficient. For example, suppose one is receiving a large amount of data from a remote machine via serial port, at most 11,520 bytes will arrive per second, the device will send up to 2K of data ahead of the last acknowledged packet, and the serial port has a 4K input buffer. While one could process data using a "data received" event, it may be just as efficient to simply check the port 100x/second and process all packets received up to that point. Such polling would be a waste of time when the remote device wasn't sending data, but if incoming data was expected it may be more efficient to process it in chunks of roughly 1.15K than to process every little piece of incoming data as soon as it comes in.

Resources