Does system time consume allocated quanta for a realtime thread? - multithreading

In a real time thread, the user requests that in a certain period P, their thread gets a minimum number of time quanta Q. I would like to know if the following proposition is true or false on various OS, or is at least conditionally true, depending on the specific call or otherwise: the proposition is
"The time spent in a system call does not count as part of the requested quanta Q."
One can derive from this the conclusion that is the program, for example, signals a condition variable, the onus is on the Operating System to accept the request and return control to the user without reducing the time available for other work. That is, the user pays the costs of making the system call but NOT the cost of performing the requested work.
Clearly, this cannot be the case for some calls, for example a memory request may require the OS to access backup store on the disk, in which case the OS cannot ensure the user gets the memory fast enough.
The converse of the signal case is also pertinent: if the waiting thread is real time, does the OS have to ensure a signal is delivered to it quickly enough to still be able to honour its promise?
The question derives from the following observation: if the real time thread is pre-empted, that's fine as long as the OS returns control fast enough that the client thread get the requested quanta Q. The time spent suspended obvious MUST NOT count in this case.

Related

Does a large Max Degree Of Parallelism cause queuing?

I would like to know if my understanding of setting a Max Degree Of Parallelism (MDOP) value larger than a machines available processor amount causes a queueing effect that I have described below.
Please see this as a purely I/O asynchronous operation:
A computer has (for example) 16 processors. This means a max of 16 tasks can be worked on at any one time.
If there is a requirement for 100 http end points to be called and the MDOP was set to 100, this will create 100 http request tasks at the same time all run in parallel. The problem is only 16 will ever be handled at once meaning the rest are effectively queued and will be handled once a processor frees resulting in an increased response time. Also to add, the process will be solved down further due to other parts of the system demanding use of the 16 available processors.
Setting the MDOP to half the available processer count (8 for example in a 16 processor machine) means that 8 http request tasks will be in flight at any one time. The response times of the 8 requests will be minimal due to there being no queueing of the tasks as the set MDOP is well under the machines available processor resources. Further to this there are also another 8 processors available to handle any other tasks required by the machine.
The main difference is that the overall response time for 100 calls will return faster with a MDOP of 100 as all 100 tasks were started at the same time, where as with 8 there are only ever 8 requests in flight at once.
The implicit assumption made in the question are not correct.
IO operations are generally far from saturating a core. Synchronous and asynchronous request works results in different behaviours. The former is not efficient and must not be used. Both should not be limited to the number of available cores but to the maximum concurrency of the target device completing the IO operations assuming the software stack is doing its job correctly.
For synchronous requests, most of the time is spent waiting for the operation to complete. For example, for a network operation, the OS send the request buffer to the NIC which send it asynchronously on network link. It takes some time to be sure data has been sent so the NIC needs to wait a bit for this sending request to mark it as completed. It also sometimes need to wait for the link to be ready. During this time, the processor can be free and it can actually queue new requests to the NIC. Not to mention the response from the request sent will take a significant time (during which neither the processor nor the link work for this specific request). When a synchronous operation needs to wait for the target device, the IO scheduler of the OS does a context switch (assuming the user code does a proper passive wait). This enable the processor to actually start new IO requests of other threads or overlap the IO requests with computation when the load is high. If there is not enough threads to do IO operations, then this is the main issue, not the number of cores itself. Increasing the number of thread is not efficient. It just increases the number of context switches and thread migration resulting in significant overheads. Asynchronous operations should be used instead. Regarding the OS stack, they may also causes many context switches, but they are generally more efficiently scheduled by the OS. Moreover, using asynchronous IO operations remove the artificial limitation of the number of threads (ie. the maximum degree of parallelism).
For asynchronous operations, one thread can starts a lot of IO requests before they can actually be completed. Having more cores does not directly means more requests can be completed in a given fixed time. This is only true if the OS IO stack is truly parallel and if the operations are limited by the OS stack rather than the concurrency of the target device (this tends to be true nowadays for example on SSD which are massively parallel). The thing is that modern processors are very fast so few threads should theoretically be enough to saturate the queue of most target device, although in practice, not all OS stacks are efficiently designed for modern IO devices.
Every software and hardware stack have a maximum degree of parallelism meant to saturate the device and so to mitigate the latency of IO requests. Because IO latency is generally high, IO request queues are large. "Queuing" do not mean much here since requests are eventually queued anyway. The question is whether they are queued in the OS stack and not the one of the device, that is if the degree Of parallelism of the software stack (including the OS) is bigger than the one of the target device (which may or may not truly compute incoming request of its request queue in parallel). The answer is generally yes if the target application send a lot of requests and the OS stack to not provide any mechanism to regulate the amount of incoming requests. That being said, some API provides it or even guarantee it (asynchronous IO ring buffers are a good example).
Put it shortly, it depends of the exact target device, the target operating system, the OS API/stack used as well as the application itself. The system can be seen as big platform-dependent dataflow where queues are everywhere so one needs to carefully specify what "MDOP" and "queuing" means in this context.
You cannot expect anyone to know what you mean by MDOP unless you mention the precise technology in the context of which you are using this term. Microsoft SQL Server has a concept of MDOP, but you are talking about HTTP requests, so you are probably not talking about MSSQL. So, what are you talking about? Anyway, on with the question.
A computer has (for example) 16 processors. This means a max of 16 tasks can be worked on at any one time.
No, it doesn't mean that. It means that the computer can execute 16 CPU instructions simultaneously. (If we disregard pipelines, superscalar pipelines, memory contention, etc.) A "Task" is a very high-level concept which involves all sorts of things besides just executing CPU instructions. For example, it involves waiting for I/O to complete, or even waiting for events to occur, events which might be raised by other tasks.
When a system allows you to set the value of some concept such as a "degree of parallelism", this means that there is no silver bullet for that value, so depending on the application at hand, different values will yield different performance benefits. Knowing your specific usage scenario you can begin with an educated guess for a good value, but the only way to know the optimal value is to try and see how your actual system performs.
Specifically with degree of parallelism, it depends on what your threads are doing. If your threads are mostly computation-bound, then a degree of parallelism close to the number of physical cores will yield best results. If your threads are mostly I/O bound, then your degree of parallelism should be orders of magnitude larger than the number of physical cores, and the ideal value depends on how much memory each thread is consuming, because with too many threads you might start hitting memory bottlenecks.
Proof: check how many threads are currently alive on your computer. (Use some built-in system monitor utility, or download one if need be.) Notice that you have thousands of threads running. And yet look at your CPU utilization: it is probably close to zero. That's because virtually all of those thousands of threads are doing nothing most of the time but waiting for stuff to happen, like for you to press a key or for a network packet to arrive.

How to chose priority for threads?

Considering one core , when multiple request is arrived at server at the same timestamp, and all have the same priority ,for which request the thread would be allotted first ?
Ex: CPU has single core and has 2 thread. Now the 4 people has made the request (process) A,B,C,D to any server & server need to assign threads in the message queue in order to process those request. But which 2 process would be given chance first to assign those 2 threads ?
Assumption they all have arrived at same timestamp and have equal priority.
TUSHAR, there is a bit of a language gap occurring here. Considering you chose, kernel, and didn't seem to think it was something to do with algebra, I am going to translate your question:
In a single CPU system, when multiple interrupts are asserted
simultaneously, and all have the same priority, which handler would be
serviced first?
The first bit of info is that most interrupt controllers are little more than a priority encoder with some extra glue. As such, they have no notion of same priority, but that is less important than you might think.
Real Time Operating Systems, in particular, seek to disassociate their implementation with the hardware, and may even dynamically adjust interrupt priorities to suit the current workload. The key here is that the OS spends a minimal time at the mercy of the interrupt controller, and chooses what to do based upon its state. As the system designer, you can choose what happens.
Time Sharing Operating Systems also have some control over this; but typically less as they strive for maximum throughput rather than predictable response. As such, they might do anything from first-in-first-served, random-served, or even random-starved.
So the answer to your question depends upon your environment. For the most part, if you have a very simple environment (eg. an executive like vxWorks or freeRTOS), expect it to follow the dictates of the interrupt controller. If you have a more sophisticated device OS (eg. INTEGRITY or QNX) it is up to your configuration. If you have Linux/winDOS, there are likely 320 control knobs that all result in burning the toast.

Process & thread scheduling overhead

There are a few things I don't quite understand when it come to scheduling:
I assume each process/thread, as long as it is CPU bound, is given a time window. Once the window is over, it's swapped out and another process/thread is ran. Is that assumption correct? Are there any ball park numbers how long that window is on a modern PC? I'm assuming around 100 ms? What's the overhead of swapping out like? A few milliseconds or so?
Does the OS schedule by procces or by an individual kernel thread? It would make more sense to schedule each process and within that time window run whatever threads that process has available. That way the process context switching is minimized. Is my understanding correct?
How does the time each thread runs compare to other system times, such as RAM access, network access, HD I/O etc?
If I'm reading a socket (blocking) my thread will get swapped out until data is available then a hardware interrupt will be triggered and the data will be moved to the RAM (either by the CPU or by the NIC if it supports DMA) . Am I correct to assume that the thread will not necessarily be swapped back in at that point to handle he incoming data?
I'm asking primarily about Linux, but I would imagine the info would also be applicable to Windows as well.
I realize it's a bunch of different questions, I'm trying to clear up my understanding on this topic.
I assume each process/thread, as long as it is CPU bound, is given a time window. Once the window is over, it's swapped out and another process/thread is ran. Is that assumption correct? Are there any ball park numbers how long that window is on a modern PC? I'm assuming around 100 ms? What's the overhead of swapping out like? A few milliseconds or so?
No. Pretty much all modern operating systems use pre-emption, allowing interactive processes that suddenly need to do work (because the user hit a key, data was read from the disk, or a network packet was received) to interrupt CPU bound tasks.
Does the OS schedule by proces or by an individual kernel thread? It would make more sense to schedule each process and within that time window run whatever threads that process has available. That way the process context switching is minimized. Is my understanding correct?
That's a complex optimization decision. The cost of blowing out the instruction and data caches is typically large compared to the cost of changing the address space, so this isn't as significant as you might think. Typically, picking which thread to schedule of all the ready-to-run threads is done first and process stickiness may be an optimization affecting which core to schedule on.
How does the time each thread runs compare to other system times, such as RAM access, network access, HD I/O etc?
Obviously, threads have to run through a very large number of RAM accesses because switching threads requires a large number of such accesses. Hard drive and network I/O are generally slow enough that a thread that's waiting for such a thing is descheduled.
Fast SSDs change things a bit. One thing I'm seeing a lot of lately is long-treasured optimizations that use a lot of CPU to try to avoid disk accesses can be worse than just doing the disk access on some modern machines!

Memory addressing in assembly / multitasking

I understand how programs in machine code can load values from memory in to registers, perform jumps, or store values in registers to memory, but I don't understand how this works for multiple processes. A process is allocated memory on the fly, so must it use relative addressing? Is this done automatically (meaning there are assembly instructions that perform relative jumps, etc.), or does the program have to "manually" add the correct offset to every memory position it addresses.
I have another question regarding multitasking that is somewhat related. How does the OS, which isn't running, stop a thread and move on to the next. Is this done with timed interrupts? If so, then how can the values in registers be preserved for a thread. Are they saved to memory before control is given to a different thread? Or, rather than timed interrupts, does the thread simply choose a good time to give up control. In the case of timed interrupts, what happens if a thread is given processor time and it doesn't need it. Does it have to waste it, can it call the interrupt manually, or does it alert the OS that it doesn't need much time?
Edit: Or are executables edited before being run to compensate for the correct offsets?
That's not how it works. All modern operating systems virtualize the available memory. Giving every process the illusion that it has 2 gigabytes of memory (or more) and doesn't have to share it with anybody. The key component in a machine that does this is the MMU, nowadays built in the processor itself. Another core feature of this virtualization is that it isolates processes. One misbehaving one cannot bring another one down with it.
Yes, a clock tick interrupt is used to interrupt the currently running code. Processor state is simply saved on the stack. The operating system scheduler then checks if any other thread is ready to run and has a high enough priority to get first in line. Some extra code ensures that everybody gets a fair share. Then it just a matter of setting the MMU to resume execution on the other thread. If no thread is ready to run then the CPU gets physically turned off with the HALT instruction. To be woken again by the next clock interrupt.
This is ten-thousand foot view, it is well covered in any book about operating system design.
A process is allocated memory on the fly, so must it use relative addressing?
No, it can use relative or absolute addressing depending on what it is trying to address.
At least historically, the various different addressing modes were more about local versus remote memory. Relative addressing was for memory addresses close to the current address while absolute was more expensive but could address anything. With modern virtual memory systems, these distinctions may be no longer necessary.
A process is allocated memory on the fly, so must it use relative addressing? Is this done automatically (meaning there are assembly instructions that perform relative jumps, etc.), or does the program have to "manually" add the correct offset to every memory position it addresses.
I'm not sure about this one. This is taken care of by the compiler normally. Again, modern virtual memory systems make make this complexity unnecessary.
Are they saved to memory before control is given to a different thread?
Yes. Typically all of the state (registers, etc.) is stored in a process control block (PCB), a new context is loaded, the registers and other context is loaded from the new PCB, and execution begins in the new context. The PCB can be stored on the stack or in kernel memory or in can utilize processor specific operations to optimize this process.
Or, rather than timed interrupts, does the thread simply choose a good time to give up control.
The thread can yield control -- put itself back at the end of the run queue. It can also wait for some IO or sleep. Thread libraries then put the thread in wait queues and switch to another context. When the IO is ready or the sleep expires, the thread is put back into the run queue. The same happens with mutex locks. It waits for the lock in a wait queue. Once the lock is available, the thread is put back into the run queue.
In the case of timed interrupts, what happens if a thread is given processor time and it doesn't need it. Does it have to waste it, can it call the interrupt manually, or does it alert the OS that it doesn't need much time?
Either the thread can run (perform CPU instructions) or it is waiting -- either on IO or a sleep. It can ask to yield but typically it is doing so by [again] sleeping or waiting on IO.
I probably walked into this question quite late, but then, it may be of use to some other programmers. First - the theory.
The modern day operating system will virtualize the memory, and to do so, it maintains, within its system memory area, a series of page pointers. Each page is of a fixed size (usually 4K), and when any program seeks some memory, its allocated memory addresses that are virtualized using the memory page pointer. Its approximates the behaviour of "segment" registers in the prior generation of the processors.
Now when the scheduler decides to get another process running, it may or may not keep the previous process in memory. If it keeps it in memory, then all that the scheduler does is to save the entire register snapshot (now, including YMM registers - this bit was a complex issue earlier as there are no single instructions that saved the entire context : read up on XSAVE), and this has a fixed format (available in Intel SW manual). This is stored in the memory space of the scheduler itself, along with the information on the memory pages that were being used.
If however, the scheduler needs to "dump" the current process context that is about to go to sleep to the hard disk - this situation usually arises when the process that is waking up needs extraordinary amount of memory, then the scheduler writes the memory page files in the disk blocks (called pagefile - reserved area of memory - also the source of "old grandmother wisdom" that pagefile must be equal to size of real memory) and the scheduler preserves the memory page pointer addresses as offsets in the pagefile. When it wakes up, the scheduler reads from pagefile the offset address, allocates real memory and populates the memory page pointers, and then loads the contents from the disk blocks.
Now, to answer your specific questions :
1. Do u need to use only relative addressing, or you can use absolute?
And. You may use either - whatever u perceive to be as absolute is also relative as the memory page pointer relativizes that address in an invisible format. There is no really absolute memory address anywhere (including the io device memories) except the kernel of the operating system itself. To test this, u may unassemble any .EXE program, to see that the entry point is always CALL 0010 which clearly implies that each thread gets a different "0010" to start the execution.
How do threads get life and what if it surrenders the unused slice.
Ans. The threads usually get a slice - modern systems have 20ms as the usual standard - but this is sometimes changed in special purpose compilation for servers that do not have many hardware interrupts to deal with - in order of their position on the process queue. A thread usually surrenders its slice by calling function sleep(), which is a formal (and very nice way) to surrender your balance part of the time slice. Most libraries implementing asynchronous reads, or interrupt actions, call sleep() internally, but in many instances, top level programs also call sleep() - e.g. to create a time gap. An invocation to sleep will certainly change the process context - the CPU actually is not given the liberty to sleep using NOP.
The other method is to wait for an IO to complete, and this is handled differently. The program on asking for an IO process, will cede its time slice, and the process scheduler flags this thread to be in "WAITING FOR AN IO" state - and this thread will not be given a time slice by the processor till its intended IO is completed, or timed out. This feature helps programmers as they do not have to explicitly write a sleep_until_IO() kind of interface.
Trust this sets you going further in your explorations.

Two processes on two CPUs -- is it possible that they complete at exactly the same moment?

This is sort of a strange question that's been bothering me lately. In our modern world of multi-core CPUs and multi-threaded operating systems, we can run many processes with true hardware concurrency. Let's say I spawn two instances of Program A in two separate processes at the same time. Disregarding OS-level interference which may alter the execution time for either or both processes, is it possible for both of these processes to complete at exactly the same moment in time? Is there any specific hardware/operating-system mechanism that may prevent this?
Now before the pedants grill me on this, I want to clarify my definition of "exactly the same moment". I'm not talking about time in the cosmic sense, only as it pertains to the operation of a computer. So if two processes complete at the same time, that means that they complete
with a time difference that is so small, the computer cannot tell the difference.
EDIT : by "OS-level interference" I mean things like interrupts, various techniques to resolve resource contention that the OS may use, etc.
Actually, thinking about time in the "cosmic sense" is a good way to think about time in a distributed system (including multi-core systems). Not all systems (or cores) advance their clocks at exactly the same rate, making it hard to actually tell which events happened first (going by wall clock time). Because of this inability to agree, systems tend to measure time by logical clocks. Two events happen concurrently (i.e., "exactly at the same time") if they are not ordered by sharing data with each other or otherwise coordinating their execution.
Also, you need to define when exactly a process has "exited." Thinking in Linux, is it when it prints an "exiting" message to the screen? When it returns from main()? When it executes the exit() system call? When its process state is run set to "exiting" in the kernel? When the process's parent receives a SIGCHLD?
So getting back to your question (with a precise definition for "exactly at the same time"), the two processes can end (or do any other event) at exactly the same time as long as nothing coordinates their exiting (or other event). What counts as coordination depends on your architecture and its memory model, so some of the "exited" conditions listed above might always be ordered at a low level or by synchronization in the OS.
You don't even need "exactly" at the same time. Sometimes you can be close enough to seem concurrent. Even on a single core with no true concurrency, two processes could appear to exit at the same time if, for instance, two child processes exited before their parent was next scheduled. It doesn't matter which one really exited first; the parent will see that in an instant while it wasn't running, both children died.
So if two processes complete at the same time, that means that they complete with a time difference that is so small, the computer cannot tell the difference.
Sure, why not? Except for shared memory (and other resources, see below), they're operating independently.
Is there any specific hardware/operating-system mechanism that may prevent this?
Anything that is a resource contention:
memory access
disk access
network access
explicit concurrency management via locks/semaphores/mutexes/etc.
To be more specific: these are separate CPU cores. That means they have computing circuitry implemented in separate logic circuits. From the wikipedia page:
The fact that each core can have its own memory cache means that it is quite possible for most of the computation to occur as interaction of each core with its own cache. Once you have that, it's just a matter of probability. That's not to say that algorithms take a nondeterministic amount of time, but their inputs may come from a probabilistic distribution and the amount of time it takes to run is unlikely to be completely independent of input data unless the algorithm has been carefully designed to take the same amount of time.
Well I'm going to go with I doubt it:
Internally any sensible OS maintains a list of running processes.
It therefore seems sensible for us to define the moment that the process completes as the moment that it is removed from this list.
It also strikes me as fairly unlikely (but not impossible) that a typical OS will go to the effort to construct this list in such a way that two threads can independently remove an item from this list at exactly the same time (processes don't terminate that frequently and removing an item from a list is relatively inexpensive - I can't see any real reason why they wouldn't just lock the entire list instead).
Therefore for any two terminating processes A and B (where A terminates before B), there will always be a reasonably large time period (in a cosmic sense) where A has terminated and B has not.
That said it is of course possible to produce such a list, and so in reality it depends on the OS.
Also I don't really understand the point of this question, in particular what do you mean by
the computer cannot tell the difference
In order for the computer to tell the difference it has to be able to check the running process table at a point where A has terminated and B has not - if the OS schedules removing process B from the process table immediately after process A then it could very easily be that no such code gets a chance to execute and so by some definitions it isn't possible for the computer to tell the difference - this sutation holds true even on a single core / CPU processor.
Yes, without any OS Scheduling interference they could finish at the same time, if they don't have any resource contention (shared memory, external io, system calls). When either of them have a lock on a resource they will force the other to stall waiting for resource to free up.

Resources