I'm developing a multi-process application, which runs on an iMX6 quad core CPU running linux 3.0.35.
One of the processes/threads I am running require higher priority than the others.
So I use SCHED_FIFO for that thread, and SCHED_OTHER for the others.
While the machine is under heavy load, I noticed that following scenario (using DS5 Streamline profiling tool):
A low priority thread gets preempted by a high priority thread on CPU
x.
While the high priority thread is in CPU x, the other low priority threads get CPU time on other CPUs. But not the original thread - it doesn't get CPU time for a very long time (even a few seconds)
When the high-priority thread is done using CPU x, the CPU remains IDLE for a long time (up to 2 seconds) even though there's a heavy load on the machine (10+ threads need the CPU) (the CPU is marked as IDLE in the profiling tool)
I'm looking for ways to understand what that means
- Isn't the linux scheduler supposed to give the preempted thread CPU time after it was kicked out of CPU x ?
- How come the CPU is idle for so long even though there are many threads that require CPU time?
What am I looking at?
I have full access to the kernel code, and some more profiling tools, if it helps.
Thanks,
Ofer
Related
I am looking to confirm my assumptions about threads and CPU cores.
All the threads are the same. No disk I/O is used, threads do not share memory, and each thread does CPU bound work only.
If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Are any of my assumptions incorrect? if so why?
Edit
When I say the thread is asleep, I mean that the thread is blocked for a specific amount of time. In C++ I would use sleep_for Blocks the execution of the current thread for at least the specified sleep_duration
If we assume that you are talking about threads that are implemented using native thread support in a modern OS, then your statements are more or less correct.
There are a few factors that could cause the behavior to deviate from the "ideal".
If there are other user-space processes, they may compete for resources (CPU, memory, etcetera) with your application. That will reduce (for example) the CPU available to your application. Note that this will include things like the user-space processes responsible for running your desktop environment etc.
There are various overheads that will be incurred by the operating system kernel. There are many places where this happens including:
Managing the file system.
Managing physical / virtual memory system.
Dealing with network traffic.
Scheduling processes and threads.
That will reduce the CPU available to your application.
The thread scheduler typically doesn't do entirely fair scheduling. So one thread may get a larger percentage of the CPU than another.
There are some complicated interactions with the hardware when the application has a large memory footprint, and threads don't have good memory locality. For various reasons, memory intensive threads compete with each other and can slow each other down. These interactions are all accounted as "user process" time, but they result in threads being able to do less actual work.
So:
1) If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
Probably not all of the time, due to other user processes and OS overheads.
2) If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
Approximately. There are the overheads (see above). There is also the issue that time slicing between different threads of the same priority is fairly coarse grained, and not necessarily fair.
3) If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
Approximately: see above.
4) An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
There is also the issue that the OS consumes CPU to manage the sleeping threads; e.g. putting them to sleep, deciding when to wake them, rescheduling.
Another one is that the memory used by the threads may also come at a cost. For instance if the sum of the memory used for all process (including all of the 10,000 threads' stacks) is larger than the available physical RAM, then there is likely to be paging. And that also uses CPU resources.
5) In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Not necessarily. If the virtual memory usage is out of whack (i.e. you are paging heavily), the system may have to idle some of the CPU while waiting for memory pages to be read from and written to the paging device. In short, you need to take account of memory utilization, or it will impact on the CPU utilization.
This also doesn't take account of thread scheduling and context switching between threads. Each time the OS switches a core from one thread to another it has to:
Save the the old thread's registers.
Flush the processor's memory cache
Invalidate the VM mapping registers, etcetera. This includes the TLBs that #bazza mentioned.
Load the new thread's registers.
Take performance hits due to having to do more main memory reads, and vm page translations because of previous cache invalidations.
These overheads can be significant. According to https://unix.stackexchange.com/questions/506564/ this is typically around 1.2 microseconds per context switch. That may not sound much, but if your application is switching threads rapidly, that could amount to many milliseconds in each second.
As already mentioned in the comments, it depends on a number of factors. But in a general sense your assumptions are correct.
Sleep
In the bad old days a sleep() might have been implemented by the C library as a loop doing pointless work (e.g. multiplying 1 by 1 until the required time had elapsed). In that case, the CPU would still be 100% busy. Nowadays a sleep() will actually result in the thread being descheduled for the requisite time. Platforms such as MS-DOS worked this way, but any multitasking OS has had a proper implementation for decades.
10,000 sleeping threads will take up more CPU time, because the OS has to make scheduling judgements every timeslice tick (every 60ms, or thereabouts). The more threads it has to check for being ready to run, the more CPU time that checking takes.
Translate Lookaside Buffers
Adding more threads than cores is generally seen as OK. But you can run into a problem with Translate Lookaside Buffers (or their equivalents on other CPUs). These are part of the virtual memory management side of the CPU, and they themselves are effectively content address memory. This is really hard to implement, so there's never that much of it. Thus the more memory allocations there are (which there will be if you add more and more threads) the more this resource is eaten up, to the point where the OS may have to start swapping in and out different loadings of the TLB in order for all the virtual memory allocations to be accessible. If this starts happenging, everything in the process becomes really, really slow. This is likely less of a problem these days than it was, say, 20 years ago.
Also, modern memory allocators in C libraries (and thence everything else built on top, e.g. Java, C#, the lot) will actually be quite careful in how requests for virtual memory are managed, minising the times they actually have to as the OS for more virtual memory. Basically they seek to provide requested allocations out of pools they've already got, rather than each malloc() resulting in a call to the OS. This takes the pressure of the TLBs.
I was studying Operating Systems and am stuck on a doubt that when a currently running process on the processor requests for some I/O, the CPU becomes idle and the scheduler then schedules another process to execute on the CPU. How does the kernel here come to know that the CPU has become idle. Is there some kind of hardware interrupt sent by the processor?
The OS 'knows' that a CPU needs to become idle when it performs a scheduling run and has fewer ready threads than cores.
If the scheduler runs and has only two ready threads that can use CPU, but has four actual cores available, then it will direct the 'surplus' cores to an 'idle' thread that is a loop around a 'HLT', or like instruction, that causes the core to stop fetching and executing instructions until an interrupt is received.
In my option, the kernel always running on the CPU, and the kernel knows schedule which process or interrupt handler.
I understand that only the threads in running state actually consumes CPU but as show below by top in QNX platform, total CPU states is 99.3 which is a cumulative of four threads of which only one is in running state.
Any idea why CPU is consumed more than what running threads consume?
CPU states: 99.3% user, 0.6% kernel
CPU 0 Idle: 0.0%
CPU 1 Idle: 0.0%
Memory: 0 total, 1G avail, page size 4K
PID TID PRI STATE HH:MM:SS CPU COMMAND
704585 11 10 Run 0:01:52 24.82% App
704585 10 10 Rdy 0:01:52 24.68% App
704585 13 10 Rdy 0:01:52 24.53% App
704585 16 10 Rdy 0:01:49 24.19% App
The threads that are ready were in a running state when they consumed the CPU. Given the very similar CPU values, I'll bet that all of those threads are always either ready to run or running.
Threads in a RUNNING state are the only ones currently consuming CPU at the current instant, but those in a READY state are those that are eligible to consume CPU over an interval of time.
Your processor has two cores, so up to two threads can be RUNNING at once. Any number can be READY (i.e. unblocked and runnable, but not necessarily currently executing on a core), and those will be run according to priority and scheduling method that applies. Since you are querying the process manager for thread states, one of those two cores will at that instance obviously be running a thread in the Process manager. The other core will still be running an available READY thread from amongst the set of unblocked threads in the system, again based on priority and scheduling algorithm. This is why just one of your four threads shows as running, while the others are merely READY. That the other three threads are READY means that, assuming they are at the same priority as your other currently-running thread, the scheduler will run those threads on the available cores according to the scheduling algorithm you are using, as long as no higher priority threads are or become READY. The thread state reflects instantaneous state at the time the process manager is being asked to provide thread state information from the kernel, while the usage stat reflects activity over time and not an instantaneous state. Over a brief period of time, if you have four round robin threads all READY and at the same priority running round-robin, you would see close to 25% utilization attributable to each of the four threads. But only two can be RUNNING at any one instant if you have only two cores, and if you are busy actually getting the information about thread states then one of those two available cores is busy grabbing that info and you will only ever see up to one other thread in a RUNNING state. If you are using QNX I suggest you read and memorize the System Architecture manual (http://www.qnx.com/download/feature.html?programid=26183). Ch. 2's discussion of thread lifecycle and scheduling addresses this question.
Hope that helps.
Let's say we have a CPU with 20 cores and a process with 20 CPU-intensive independent of each other threads: One thread per CPU core.
I'm trying to figure out whether context switching happens in this case. I believe it happens because there are system processes in the operating system that need CPU-time too.
I understand that there are different CPU architectures and some answers may vary but can you please explain:
How context switching happens e.g. on Linux or Windows and some known CPU architectures? And what happens under the hood on modern hardware?
What if we have 10 cores and 20 threads or the other way around?
How to calculate how many threads we need if we have n CPUs?
Does CPU cache(L1/L2) gets empty after context switching?
Thank you
How context switching happens e.g. on Linux or Windows and some known
CPU architectures? And what happens under the hood on modern hardware?
A context-switch happens when an interrupt occurs and that interrupt, together with the kernel thread and process state data, specify a set of running threads that is different than the set running before the interrupt. Note that, in OS terms, an interrupt may be either a 'real' hardware interrupt that causes a driver to run and that driver requests a scheduling run, or a syscall from a thread that is already running. In either case, the OS scheduling state-machine decides whether to change the set of threads running on the available cores.
The kernel can change the set of running threads by stopping thread/s and running others. It can stop any thread running on any core by queueing up a premption request and generating a hardware interrupt of that core to force the core to run its interprocessor driver to handle the request.
What if we have 10 cores and 20 threads?
Depends on what the threads are doing. If they are in any other state than ready/running, (eg blocked on I/O or inter-thread comms), there will be no context-switching between them because nothing is running. If they are all ready/running, 10 of them will run forever on the 10 cores until there is an interrupt. Most systems have a periodic timer interrupt that can have the effect of sharing the available cores around the threads.
or the other way around
10 threads run on 10 cores. The other 10 cores are halted. The OS may move the threads around the cores, eg. to prevent uneven heat dissipation across the die.
How to calculate how many threads we need if we have n CPUs?
App-dependent. It would be nice if all cores were always used up 100% on exactly as many ready threads as cores but, since most threads are blocked for much more time than they are running, it's difficult, except in some end-cases, (eg - your '20 CPU-intensive threads on 20 cores'), to come up with any optimal number.
Does CPU cache(L1/L2) gets empty after context switching?
Maybe - it depends entirely on the data usage of the threads. The caches will get reloaded on-demand, as usual. There is no 'context-switch total cache reload' but, if the threads access different, large arrays of data while running, then the (L1 at least), cache will indeed get fully reloaded during the thread run.
I have a java program which spawns multiple threads say, 10-20 threads. This program is scheduled to be run on a machine that has 32 processors.
I am keen to know if all the processors' power would be utilized by these threads.
Solaris is the environment; does that make any difference?
A good profiler should tell you this. If the threads are compute bound, then yes you will use as many cores as you have threads, if you are blocked doing I/O or on contention it will be less than that.
Given that you aren't on Windows, the following doesn't apply, but a decent profiler should still be able to provide a measurement of CPU cycles burned by your process for a give period of time...
If you are on Windows a good free tool to use is the Windows Performance Toolkit (xperf) which is now part of the platform sdk. It will show you the processor cycles burned for each thread or process for a period of time (as opposed to just elapsed times).