when polling is better than interrupt? - io

I was looking at this pic:
and have 2 questions regarding it:
1. how much faster should a disk be in order for polling to be refered over the interrupt?
I thought that beacuse of the ISR and the process jumping (when using interrupt) - that polling will be better when using fast SSD for example , where the polling takes less time than the the interrupt(ISR+ scheduler ). Am I mistaken?
and the second question is : If my disk is slower than the SSD in my first question, but still fast - is there any reason to prefer polling?
I was wondering if the fact that I'll have lot's of I/O - read requests is a good enough reaon to prefer polling.
thanks!

You could imagine a case where the overhead of running interrupt handlers (invalidating your caches, setting up the interrupt stack to run on) could be slower than actually doing the read or write, in which case I guess polling would be faster.
However, SSDs are fast compared to disk, but still much slower than memory. SSDs still take anywhere from tens of microseconds to milliseconds to complete each I/O, whereas doing the interrupt setup and teardown uses all in-memory operations and probably takes a maximum of, say, 100-1000 cycles (~100ns to 1us).
The main benefit of using interrupts instead of polling is that the "disabled" effect of using interrupts is much lower, since you don't have to schedule your I/O threads to continuously poll for more data while there is none available. It has the added benefit that I/O is handled immediately, so if a user types a key, there won't be a pause before the letter appears on the screen while the I/O thread is being scheduled. Combining these issues is a mess - inserting arbitrary stalls into your I/O thread makes polling less resource-intense at the expense of even slower response times. These may be the primary reasons nobody uses polling in kernel I/O designs.
From a user process' perspective (using software interrupt systems such as Unix signals), either way can make sense since polling usually means a blocking syscall such as read() or select() rather than (say) checking a boolean memory-mapped variable or issuing device instructions like the kernel version of polling might do. Having system calls to do this work for us in the OS can be beneficial for performance because the userland thread doesn't get its cache invalidated any more or less by using signals versus any other syscall. However this is pretty OS dependent, so profiling is your best bet for figuring out which codepath is faster.

Related

Would it makes the kernel level thread clearly preferable to user level thread if system calls is as fast as procedure calls?

Some web searching results told me that the only deficiency of kernel-level thread is the slow speed of its management(create, switch, terminate, etc.). It seems that if the operation on the kernel-level thread is all through system calls, the answer to my question will be true. However, I've searched a lot to find whether the management of kernel-level thread is all through system call but find nothing. And I always have an instinct that such management should be done by the OS automatically because only OS knows which thread would be suitable to run at a specific time. So it seems impossible for programmers to write some explicit system calls to manage threads. I'm appreciative of any ideas.
Some web searching results told me that the only deficiency of kernel-level thread is the slow speed of its management(create, switch, terminate, etc.).
It's not that simple. To understand, think about what causes task switches. Here's a (partial) list:
a device told a device driver that an operation completed (some data arrived, etc) causing a thread that was waiting for the operation to unblock and then preempt the currently running thread. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
enough time passed; either causing an "end of time slice" task switch, or causing a sleeping thread to unblock and preempt. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the thread accessed virtual memory that isn't currently accessible, triggering the kernel's page fault handler which finds out that the current task has to wait while the kernel fetches data from from swap space or from a file (if the virtual memory is part of a memory mapped file), or has to wait for kernel to free up RAM by sending other pages to swap space (if virtual memory was involved in some kind of "copy on write"); causing a task switch because the currently running task can't continue. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
a new process is being created, and its initial thread preempts the currently running thread. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread asked kernel to do something with a file and kernel got "VFS cache miss" that prevents the request from being performed without any task switches. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread releases a mutex or sends some data (e.g. using a pipe or socket); causing a thread that belongs to a different process to unblock and preempt. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread releases a mutex or sends some data (e.g. using a pipe or socket); causing a thread that belongs to the same process to unblock and preempt. For this case you're running user-space code when you find out that a task switch is needed, so in theory user-space task switching is faster, but in practice it can just as easily be an indicator of poor design (using too many threads and/or far too much lock contention).
a new thread is being created for the same process; and the new thread preempts the currently running thread. For this case you're running user-space code when you find out that a task switch is needed, so in user-space task switching is faster; but only if kernel isn't informed (e.g. so that utilities like "top" can properly display details for threads) - if kernel is informed anyway then it doesn't make much difference where the task switch happens.
For most software (which doesn't use very many threads); doing task switches in the kernel is faster. Of course it's also (hopefully) fairly irrelevant for performance (because time spent switching tasks should be tiny compared to time spend doing other work).
And I always have an instinct that such management should be done by the OS automatically because only OS knows which thread would be suitable to run at a specific time.
Yes; but possibly not for the reason you think.
Another problem with user-space threading (besides making most task switches slower) is that it can't support global thread priorities without becoming a severe security disaster. Specifically; a process can't know if its own thread is higher or lower priority than a thread belonging to a different process (unless it has information about all threads for the entire OS, which is information that normal processes shouldn't be trusted to have); so user-space threading leads to wasting CPU time doing unimportant work (for one process) when there's important work to do (for a different process).
Another problem with user-space threading is that (for some CPUs - e.g. most 80x86 CPUs) the CPUs are not independent, and there may be power management decisions involved with scheduling. For examples; most 80x86 CPUs have hyper-threading (where a core is shared by 2 logical processors), where a smart scheduler may say "one logical processor in the core is running a high priority/important thread, so the other logical processor in the same core should not run a low priority/unimportant thread because that would make the important work slower"; most 80x86 CPUs have "turbo boost" (with similar "don't let low priority threads ruin the turbo-boost/performance of high priority thread" possibilities); and most CPUs have thermal management (where scheduler might say "Hey, these threads are all low priority, so let's underclock the CPU so that it cools down and can go faster later (has more thermal headroom) when there's high priority/more important work to do!").
Would it makes the kernel level thread clearly preferable to user level thread if system calls is as fast as procedure calls?
If system calls were as fast as normal procedure calls, then the performance differences between user-space threading and kernel threading would disappear (but all the other problems with user-space threading would remain). However, the reason why system calls are slower than normal procedure calls is that they pass through a kind of "isolation barrier" (that isolates kernel's code and data from malicious user-space code); so to make system calls as fast as normal procedure calls you'd have to get rid of the isolation (effectively turning the kernel into a kind of "global shared library" that can be dynamically linked) but without that isolation you'll have an extreme security disaster. In other words; to have any hope of achieving acceptable security, system calls must be slower than normal procedure calls.
Your basic premise is wrong. System calls are much slower than procedure calls in almost every interesting architecture.
The perceived cpu throughput is based on pipelining, speculative execution and fetching. The syscall stops the pipeline, invalidates the speculative execution and halts the speculative fetching, is a store and instruction barrier, and may flush the write fifo.
So, the processor slows down to its ‘spec’ speed around the syscall, accelerating back up until the syscall return, whereupon it does about the exact same thing.
Attempts to optimise this area have given rise to lots of papers named after fictional James Bond organizations, and not conciliatory enough apologies from not embarrassed enough cpu product managers. Google spectre as an example, then follow the associated links.
The other cost of syscall
A bit over 30 years ago, some smart guys wrote a paper about least privilege. Conceptually, it is a stunner. The basic premise is that whatever your program is doing, it should do it with the least privilege possible.
If your program is inverting arrays, according to the notion of least privilege, it should not be able to disable interrupts. Disabling interrupts can cause a very difficult to diagnose system failure. Simple user code should not have this ability.
The notion of user and kernel modes of execution evolved from early computer systems, and (with the possible exception of the iax32 / 80286 ) are increasingly showing their inadequacy in the connected computer environment. At one point in time you could say "this is a single user system"; but the IoT dweebs have made everything multi-user.
Least privilege insists that all code should execute with the minimum privilege required to complete the task at hand. Thus, nothing should be in the kernel that absolutely doesn't need to be. If you think that is a radical thought, in Ken Thompson's 1977(?) paper on the UNIX kernel he states exactly the same thing.
So no, putting your junk in the kernel just means you have increased the attack surface for no valid reason. Try to think in terms of exposing minimum risk, it leads to better software and better sleep.

What consequences are there to disabling interrupts/preemption for a long period?

In the Linux kernel, there are a lot of functions, for example on_each_cpu_mask, that have documentation warning against passing in callbacks that run for long periods of time because interrupts and/or preemption will be disabled for the duration of the callback. It's unclear though if the callback "must" be short because it being long will just cause terrible system performance, or because a long running callback will actually break the correctness of the system somehow.
Obviously while interrupts and preemption are disabled the busy core can't do any other work, and you can concoct situations where you could force deadlock by having two CPUs waiting for one another. But for the sake of argument say the callback just does a bunch of pure computation that takes a significant amount of time and then returns. Can this somehow break the kernel? If so how long is too long? Or does performance just suffer until the computation is done?
Disabling interrupts on one CPU for any period of time will eventually result in all other CPUs hanging, as the kernel frequently needs to perform short operations on all CPUs. Leaving interrupts off on any CPU will prevent it from completing these operations. (ref)
Disabling interrupts on all CPUs, either intentionally or unintentionally, will make the system completely unresponsive. This includes the user interface (including TTY switching at the console), as well as tasks which the kernel normally performs automatically, like responding to network activity (like ARP responses and TCP acknowledgements) or performing thermal management tasks (like adjusting system fan speeds). This is bad. Don't do it.

Process & thread scheduling overhead

There are a few things I don't quite understand when it come to scheduling:
I assume each process/thread, as long as it is CPU bound, is given a time window. Once the window is over, it's swapped out and another process/thread is ran. Is that assumption correct? Are there any ball park numbers how long that window is on a modern PC? I'm assuming around 100 ms? What's the overhead of swapping out like? A few milliseconds or so?
Does the OS schedule by procces or by an individual kernel thread? It would make more sense to schedule each process and within that time window run whatever threads that process has available. That way the process context switching is minimized. Is my understanding correct?
How does the time each thread runs compare to other system times, such as RAM access, network access, HD I/O etc?
If I'm reading a socket (blocking) my thread will get swapped out until data is available then a hardware interrupt will be triggered and the data will be moved to the RAM (either by the CPU or by the NIC if it supports DMA) . Am I correct to assume that the thread will not necessarily be swapped back in at that point to handle he incoming data?
I'm asking primarily about Linux, but I would imagine the info would also be applicable to Windows as well.
I realize it's a bunch of different questions, I'm trying to clear up my understanding on this topic.
I assume each process/thread, as long as it is CPU bound, is given a time window. Once the window is over, it's swapped out and another process/thread is ran. Is that assumption correct? Are there any ball park numbers how long that window is on a modern PC? I'm assuming around 100 ms? What's the overhead of swapping out like? A few milliseconds or so?
No. Pretty much all modern operating systems use pre-emption, allowing interactive processes that suddenly need to do work (because the user hit a key, data was read from the disk, or a network packet was received) to interrupt CPU bound tasks.
Does the OS schedule by proces or by an individual kernel thread? It would make more sense to schedule each process and within that time window run whatever threads that process has available. That way the process context switching is minimized. Is my understanding correct?
That's a complex optimization decision. The cost of blowing out the instruction and data caches is typically large compared to the cost of changing the address space, so this isn't as significant as you might think. Typically, picking which thread to schedule of all the ready-to-run threads is done first and process stickiness may be an optimization affecting which core to schedule on.
How does the time each thread runs compare to other system times, such as RAM access, network access, HD I/O etc?
Obviously, threads have to run through a very large number of RAM accesses because switching threads requires a large number of such accesses. Hard drive and network I/O are generally slow enough that a thread that's waiting for such a thing is descheduled.
Fast SSDs change things a bit. One thing I'm seeing a lot of lately is long-treasured optimizations that use a lot of CPU to try to avoid disk accesses can be worse than just doing the disk access on some modern machines!

libevent / epoll number of worker threads?

I am following this example. Line#37 says that number of worker threads should be equal of number of cpu cores. Why is that so?
If there are 10k connections and my system has 8 cores, does that mean 8 worker threads will be processing 10k connections? Why shouldn't I increase this number?
Context Switching
For an OS to context switch between threads takes a little bit of time. Having a lot of threads, each one doing comparatively little work, means that the context switch time starts becoming a significant portion of the overall runtime of the application.
For example, it could take an OS about 10 microseconds to do a context switch; if the thread does only 15 microseconds worth of work before going back to sleep then 40% of the runtime is just context switching!
This is inefficient, and that sort of inefficiency really starts to show up when you're up-scaling as your hardware, power and cooling costs go through the roof. Having few threads means that the OS doesn't have to switch contexts anything like as much.
So in your case if your requirement is for the computer to handle 10,000 connections and you have 8 cores then the efficiency sweet spot will be 1250 connections per core.
More Clients Per Thread
In the case of a server handling client requests it comes down to how much work is involved in processing each client. If that is a small amount of work, then each thread needs to handle requests from a number of clients so that the application can handle a lot of clients without having a lot of threads.
In a network server this means getting familiar with the the select() or epoll() system call. When called these will both put the thread to sleep until one of the mentioned file descriptors becomes ready in some way. However if there's no other threads pestering the OS for runtime the OS won't necessarily need to perform a context switch; the thread can just sit there dozing until there's something to do (at least that's my understanding of what OSes do. Everyone, correct me if I'm wrong!). When some data turns up from a client it can resume a lot faster.
And this of course makes the thread's source code a lot more complicated. You can't do a blocking read of data from the clients for instance; being told by epoll() that a file descriptor has become ready for reading does not mean that all the data you're expecting to receive from the client can be read immediately. And if the thread stalls due to a bug that affects more than one client. But that's the price paid for attaining the highest possible efficiency.
And it's not necessarily the case that you would want just 8 threads to go with your 8 cores and 10,000 connections. If there's something that your thread has to do for each connection every time it handles a single connection then that's an overhead that would need to be minimised (by having more threads and fewer connections per thread). [The select() system call is like that, which is why epoll() got invented.] You have to balance that overhead up against the overhead of context switching.
10,000 file descriptors is a lot (too many?) for a single process in Linux, so you might have to have several processes instead of several threads. And then there's the small matter of whether the hardware is fundamentally able to support 10,000 within whatever response time / connection requirements your system has. If it doesn't then you're forced to distribute your application across two or more servers, and that can start getting really complicated!
Understanding exactly how many clients to handle per thread depends on what the processing is doing, whether there's harddisk activity involved, etc. So there's no one single answer; it's different for different applications, and also for the same application on different machines. Tuning the clients / thread to achieve peak efficiency is a really hard job. This is where profiling tools like dtrace on Solaris, ftrace on Linux, (especially when used like this, which I've used a lot on Linux on x86 hardware) etc. can help because they allow you to understand at a very fine scale precisely what runtime is involved in your thread handling a request from a client.
Outfits like Google are of course very keen on efficiency; they get through a lot of electricity. I gather that when Google choose a CPU, hard drive, memory, etc. to put into their famously home grown servers they measure performance in terms of "Searches per Watt". Obviously you have to be a pretty big outfit before you get that fastidious about things, but that's the way things go ultimately.
Other Efficiencies
Handling things like TCP network connections can take up a lot of CPU time in it's own right, and it can be difficult to understand whereabouts in a system all your CPU runtime has gone. For network connections things like TCP offload in the smarter NICs can have a real benefit, because that frees the CPU from the burden of doing things like the error correction calculations.
TCP offload mirrors what we do in the high speed large scale real time embedded signal processing world. The (weird) interconnects that we use require zero CPU time to run them. So all of the CPU time is dedicated to processing data, and specialised hardware looks after moving data around. That brings about some quite astonishing efficiencies, so one can build a system with more modest, lower cost, less power hungry CPUs.
Language can have a radical effect on efficiency too; Things like Ruby, PHP, Perl are all very well and good, but everyone who has used them initially but has then grown rapidly ended up going to something more efficient like Java/Scala, C++, etc.
Your question is even better than you think! :-P
If you do networking with libevent, it can do non-blocking I/O on sockets. One thread could do this (using one core), but that would under-utilize the CPU.
But if you do “heavy” file I/O, then there is no non-blocking interface to the kernel. (Many systems have nothing to do that at all, others have some half-baked stuff going on in that field, but non-portable. –Libevent doesn’t use that.) – If file I/O is bottle-necking your program/test, then a higher number of threads will make sense! If a hard-disk is used, and the i/o-scheduler is reordering requests to avoid disk-head-moves, etc. it will depend on how much requests the scheduler takes into account to do its job the best. 100 pending requests might work better then 8.
Why shouldn't you increase the thread number?
If non-blocking I/O is done: all cores are working with thread-count = core-count. More threads only means more thread-switching with no gain.
For blocking I/O: you should increase it!

Is the timeslice given to a thread that is waiting on I/O "wasted"?

I'm currently analyzing the pros and cons of writing a server using a threaded model or event driven model. I already know the many cons of the threaded model (does not scale well due to context switching overhead, virtual memory limitations, etc.) but I came upon another one in my analysis and would like to verify that my understanding of threads is correct.
If I have 5 threads, 1 which is doing work (not being blocked), 4 which are being blocked waiting for I/O (for example waiting on data from a socket), isn't the CPU time given to those 4 threads essentially wasted since no work is actually being done (assuming no data arrives)? The timeslice given to those 4 blocked threads is taking away time from the 1 thread actually doing work, correct?
In this case I'm explicitly saying that the socket is a blocking one.
No. Although it actually depends on the type of OS, type of I/O (polled/DMA) and device driver architecture, most device I/O is performed using DMA + interrupts. In such cases a thread is put into a sleep state until an interrupt is triggered for such I/O operations and scheduler does not visit those threads until their pending I/O is complete. Only polling I/O can cause consumption of CPU, such as PIO mode for hard disks.
Threads don't need to use their entire timeslice. I don't know the specifics, but if blocked threads even get time, they certainly don't use it all.
Obviously, these details vary platform-to-platform-to-environment-to-etc.

Resources