CPU percentage and heavy multi-threading - linux

I am observing strange effects with the CPU percentage as shown in e.g. top or htop on Linux (Ubuntu 16.04) for one special application. The application uses many threads (around 1000). Each thread has one computational task. About half of these tasks need to be computed once per "trigger" - the trigger is an external event received exactly every 100ms. The other threads are mostly sleeping (waiting for user interaction) and hence do not play a big role here. So to summarise: many threads are waking up basically simultaneously within a short period of time, doing there (relatively short) computation and going back to sleep again.
Since the machine running this application has 8 virtual CPUs (4 cores each 2 threads, it's an i7-3612QE), only 8 threads can really wake up at a time, so many threads will have to wait. Also some of these tasks have interdependencies, so they anyway have to wait, but I think as an approximation one can think of this application as a bunch of threads going to the runnable state at the same time every 100ms and each doing only a short computation (way below 1ms of CPU time each).
Now coming to the strange effect: If I look at the CPU percentage in "top", it shows something like 250%. As far as I know, top looks on the CPU time (user + system) the kernel accounts for this process, so 250% would mean the process uses 3 virtual CPUs on average. So far so good. Now, if I use taskset to force the entire process to use only a single virtual CPU, the CPU percentage drops to 80%. The application has internal accounting which tells me that still all data is being processed. So the application is doing the same amount of work, but it seemingly uses less CPU resources. How can that be? Can I really trust the kernel CPU time accounting, or is this an artefact of the measurement?
The CPU percentage also goes down, if I start other processes which take a lot of CPU, even if the do nothing ("while(true);") and are running at low priority (nice). If I launch 8 of these CPU-eating processes, the application reaches again 80%. With fewer CPU-eaters, I get gradually higher CPU%.
Not sure if this plays a role: I have used the profiler vtune, which tells me my application is actually quite inefficient (only about 1 IPC), mostly because it's memory bound. This does not change if I restrict the process to a single virtual CPU, so I assume the effect is not caused by a huge increase in efficiency when running everything on the same core (which would be strange anyway).

My question was essentially already answered by myself in the last paragraph: The process is memory bound. Hence not the CPU is the limited resource but the memory bandwidth. Allowing such process to run on multiple CPU cores in parallel will mainly have the effect that more CPU cores are waiting for data to arrive from RAM. This is counted as CPU load, since the CPU is executing the thread, but just quite slowly. All my other observations go along with this.

Related

Concurrent threads, processes and multiple cores

I'm trying to understand the usage of CPU cores with regard to concurrent threads and processes. Please see the below questions:
Assume I have 2 CPU cores. When there are 2 processes running, each process has only 1 thread. Are the two processes using the 2 cores?
Assume I have 2 CPU cores. When there is 1 process running, which has 2 threads. Are the two threads using the 2 cores?
Assume I have 2 CPU cores. When there are 2 processes running, each process has 2 threads. How are the two cores used by those processes and cores?
How to calculate the maximum real concurrent execution given CPU cores? What are other factor I should take into account?
1,2: Quite likely but not definitely. A portion of the system software determines what runs where. It would be unlikely to choose to keep a process or thread waiting for cpu attention when there is one that is otherwise idle, it isn't absolute.
Most processing involves some sort of transfer to and from a device, network, etc.. Typically this necessitates a period of inactivity waiting for the transfer to complete. During this inactivity, another process / thread can run on that cpu. So, if a given process is 30% cpu time and 70% I/O time, then I can run about 3 of them concurrently on a single cpu without degrading performance.
3,4: Like the paragraph above implies, depending upon the workload, their could be any distribution of the threads among the cpus. If the threads were all compute bound (100% cpu), most operating systems switch between them at a small enough granularity that all remain lively, and large enough that the switching has a minimal impact on them.
This scheduling may take other notions into consideration, such as data affinity. Recently touched bits of data are likely to remain in the cpu cache when a thread has relinquished it. The next time the thread is to be scheduled, it would be best to put it onto that cpu, to retain the effort required to warm the cache for it. It might also think that two threads of one process (address space) are more likely to share data, so should prefer the same cpu.
4: depending upon your system, there are likely to be many performance analysis tools available. Top, on UNIX-inspired systems is a simple tool which gives system wide utilization information, and the simple tool time will show how much time a process spent on a cpu vs real-world time. If you run each of your tasks sequentially, noting the cpu-time that they take, then time them running concurrently, the ratio between these cpu-times indicates the scaling factor of your concurrent app. Note that real-world time can be misleading because of io-overlap.

Threads vs cores when threads are asleep

I am looking to confirm my assumptions about threads and CPU cores.
All the threads are the same. No disk I/O is used, threads do not share memory, and each thread does CPU bound work only.
If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Are any of my assumptions incorrect? if so why?
Edit
When I say the thread is asleep, I mean that the thread is blocked for a specific amount of time. In C++ I would use sleep_for Blocks the execution of the current thread for at least the specified sleep_duration
If we assume that you are talking about threads that are implemented using native thread support in a modern OS, then your statements are more or less correct.
There are a few factors that could cause the behavior to deviate from the "ideal".
If there are other user-space processes, they may compete for resources (CPU, memory, etcetera) with your application. That will reduce (for example) the CPU available to your application. Note that this will include things like the user-space processes responsible for running your desktop environment etc.
There are various overheads that will be incurred by the operating system kernel. There are many places where this happens including:
Managing the file system.
Managing physical / virtual memory system.
Dealing with network traffic.
Scheduling processes and threads.
That will reduce the CPU available to your application.
The thread scheduler typically doesn't do entirely fair scheduling. So one thread may get a larger percentage of the CPU than another.
There are some complicated interactions with the hardware when the application has a large memory footprint, and threads don't have good memory locality. For various reasons, memory intensive threads compete with each other and can slow each other down. These interactions are all accounted as "user process" time, but they result in threads being able to do less actual work.
So:
1) If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
Probably not all of the time, due to other user processes and OS overheads.
2) If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
Approximately. There are the overheads (see above). There is also the issue that time slicing between different threads of the same priority is fairly coarse grained, and not necessarily fair.
3) If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
Approximately: see above.
4) An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
There is also the issue that the OS consumes CPU to manage the sleeping threads; e.g. putting them to sleep, deciding when to wake them, rescheduling.
Another one is that the memory used by the threads may also come at a cost. For instance if the sum of the memory used for all process (including all of the 10,000 threads' stacks) is larger than the available physical RAM, then there is likely to be paging. And that also uses CPU resources.
5) In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Not necessarily. If the virtual memory usage is out of whack (i.e. you are paging heavily), the system may have to idle some of the CPU while waiting for memory pages to be read from and written to the paging device. In short, you need to take account of memory utilization, or it will impact on the CPU utilization.
This also doesn't take account of thread scheduling and context switching between threads. Each time the OS switches a core from one thread to another it has to:
Save the the old thread's registers.
Flush the processor's memory cache
Invalidate the VM mapping registers, etcetera. This includes the TLBs that #bazza mentioned.
Load the new thread's registers.
Take performance hits due to having to do more main memory reads, and vm page translations because of previous cache invalidations.
These overheads can be significant. According to https://unix.stackexchange.com/questions/506564/ this is typically around 1.2 microseconds per context switch. That may not sound much, but if your application is switching threads rapidly, that could amount to many milliseconds in each second.
As already mentioned in the comments, it depends on a number of factors. But in a general sense your assumptions are correct.
Sleep
In the bad old days a sleep() might have been implemented by the C library as a loop doing pointless work (e.g. multiplying 1 by 1 until the required time had elapsed). In that case, the CPU would still be 100% busy. Nowadays a sleep() will actually result in the thread being descheduled for the requisite time. Platforms such as MS-DOS worked this way, but any multitasking OS has had a proper implementation for decades.
10,000 sleeping threads will take up more CPU time, because the OS has to make scheduling judgements every timeslice tick (every 60ms, or thereabouts). The more threads it has to check for being ready to run, the more CPU time that checking takes.
Translate Lookaside Buffers
Adding more threads than cores is generally seen as OK. But you can run into a problem with Translate Lookaside Buffers (or their equivalents on other CPUs). These are part of the virtual memory management side of the CPU, and they themselves are effectively content address memory. This is really hard to implement, so there's never that much of it. Thus the more memory allocations there are (which there will be if you add more and more threads) the more this resource is eaten up, to the point where the OS may have to start swapping in and out different loadings of the TLB in order for all the virtual memory allocations to be accessible. If this starts happenging, everything in the process becomes really, really slow. This is likely less of a problem these days than it was, say, 20 years ago.
Also, modern memory allocators in C libraries (and thence everything else built on top, e.g. Java, C#, the lot) will actually be quite careful in how requests for virtual memory are managed, minising the times they actually have to as the OS for more virtual memory. Basically they seek to provide requested allocations out of pools they've already got, rather than each malloc() resulting in a call to the OS. This takes the pressure of the TLBs.

Context switch: what happens in a worst case scenario?

I want to understand how a certain worst case scenario of context switch happens. Say I have 10 CPU cores running a single process. Everything is CPU intensive, no thread is sleeping (waiting for I/O).
(I am mainly concerned with mainstream modern personal computer architectures and systems, typically x64 with Windows, Linux...)
Correct me if I'm wrong: running 10 CPU/RAM intensive independent threads is most often a near optimal situation. The amount of time spent in context switch is rather negligible. While the system may sometimes decide to re-attribute threads to different cores in a round-robin fashion causing a reset of RAM caches, it has a minor effect and works almost as if each thread was running on a single fixed core.
Only the main RAM bus may be a limitation since all threads share it, but it's not the point I'm interested in here. Reducing the number of threads will not increase the throughput anyway.
Now assume you still have 10 cores but run 1000 threads. The scheduler could theoretically decide to switch rarely (say every second) running 10 threads for a second, then 10 others... and the whole thing would still be close to optimal performance (throughput).
But it does not seem to be the case and it looks like threads are switched intensively causing a strongly suboptimal performance (throughput). Am I right about it? What is the main cause for this suboptimal performance? A few numbers would be nice if you have any idea of orders of magnitude of (for example): switches per second, performance loss caused by switching...
I'm going to answer my own question (after some search).
On windows, the number of context switches can be measured with performance counters: https://technet.microsoft.com/en-us/library/cc938606.aspx
I measured it on my machine (core i7/Windows 10) and the order of magnitude is around 1000/s by core when the number of running threads is more than the number of cores (and these threads are full CPU).
The time needed for a context switch varies quite a bit depending on:
what registers need to be saved
if FPU registers need to be saved
the processor model (of course)
You can read: https://www.quora.com/How-long-does-a-context-switch-take or http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
A slightly pessimistic avg. order of magnitude seems to be 1000 ns. Thus the total time for all context switches on each core is 1ms per second, that is 0.1%.
This does not depend on the number of threads: if you run 100 or 1000 threads, the number of switches does not change. As a conclusion the time spent in context switching is somehow negligible.
This reasoning is correct as long as the threads are pure CPU with only small memory read/write like a few local variables. I ran a test with full CPU threads and the difference between a few and 1000 threads is not noticeable.
But the situation changes when RAM is involved and switches makes CPU (memory) cache less efficient. A worse case is when:
computation can be split into 1000 independent "data" parts
each part of the data fits just into the memory cache (say L1 or L2) of a core
each part needs to be read many times
In this situation, running 10 threads to completion, then ten others... would take full advantage of the cache, while running 1000 threads at a time would causes the cache to be useful only during 1ms.
But if the data of several threads could fit into the cache, or if the threads read common data to some degree, or if each thread reads the data just once, then it is possible that running 1000 threads vs. running 10 threads a hundred times will have similar throughput.
It is more a matter a adapting parallelism to memory access. And it depends very much on the way memory needs to be accessed.
The time spend in context switching is negligible, the time lost because of wrong usage of caches may sometimes be problem, sometimes not, depending on how the memory is accessed and shared.

Process & thread scheduling overhead

There are a few things I don't quite understand when it come to scheduling:
I assume each process/thread, as long as it is CPU bound, is given a time window. Once the window is over, it's swapped out and another process/thread is ran. Is that assumption correct? Are there any ball park numbers how long that window is on a modern PC? I'm assuming around 100 ms? What's the overhead of swapping out like? A few milliseconds or so?
Does the OS schedule by procces or by an individual kernel thread? It would make more sense to schedule each process and within that time window run whatever threads that process has available. That way the process context switching is minimized. Is my understanding correct?
How does the time each thread runs compare to other system times, such as RAM access, network access, HD I/O etc?
If I'm reading a socket (blocking) my thread will get swapped out until data is available then a hardware interrupt will be triggered and the data will be moved to the RAM (either by the CPU or by the NIC if it supports DMA) . Am I correct to assume that the thread will not necessarily be swapped back in at that point to handle he incoming data?
I'm asking primarily about Linux, but I would imagine the info would also be applicable to Windows as well.
I realize it's a bunch of different questions, I'm trying to clear up my understanding on this topic.
I assume each process/thread, as long as it is CPU bound, is given a time window. Once the window is over, it's swapped out and another process/thread is ran. Is that assumption correct? Are there any ball park numbers how long that window is on a modern PC? I'm assuming around 100 ms? What's the overhead of swapping out like? A few milliseconds or so?
No. Pretty much all modern operating systems use pre-emption, allowing interactive processes that suddenly need to do work (because the user hit a key, data was read from the disk, or a network packet was received) to interrupt CPU bound tasks.
Does the OS schedule by proces or by an individual kernel thread? It would make more sense to schedule each process and within that time window run whatever threads that process has available. That way the process context switching is minimized. Is my understanding correct?
That's a complex optimization decision. The cost of blowing out the instruction and data caches is typically large compared to the cost of changing the address space, so this isn't as significant as you might think. Typically, picking which thread to schedule of all the ready-to-run threads is done first and process stickiness may be an optimization affecting which core to schedule on.
How does the time each thread runs compare to other system times, such as RAM access, network access, HD I/O etc?
Obviously, threads have to run through a very large number of RAM accesses because switching threads requires a large number of such accesses. Hard drive and network I/O are generally slow enough that a thread that's waiting for such a thing is descheduled.
Fast SSDs change things a bit. One thing I'm seeing a lot of lately is long-treasured optimizations that use a lot of CPU to try to avoid disk accesses can be worse than just doing the disk access on some modern machines!

How can I utilize multithread CPU most in Matlab?

I just bought the Matlab Parallel Computing toolbox.
The command matlabpool open opens parallel workers with the number of the cores in my CPU.
But each of my CPU core has two threads. According to Windows Task Manager, each worker can only use half performance of one CPU core, which seems could be interpreted as one worker = one thread = "half core".
Therefore, after all workers opened, still half of the total power of CPU could be utilized.
Is there any other command could help with that?
By default, the local cluster type for matlabpool considers only "real" cores when choosing the default number of workers to launch. This is because for MATLAB workloads, hyperthreading often does not provide much benefit. However, this value is only a default - you can edit the cluster type and run anything up to 12 local workers.
You need to understand HyperThreading to answer this question.
Matlab launches a worker thread for every CPU. Suppose you now use a directive like parfor to distribute computation over multiple threads. Every thread will now be crunching numbers happily.
Suppose you are doing a sum of a large vector of numbers. What actually happens is the following:
sum = sum + a[0]
array a is not in my CPU cache yet
I will fetch a small part of a from main memory and put it in the CPU cache
sum = sum + a[1]
sum = sum + a[2]
...
During the fetch of a, the CPU stalls, waiting for the system memory. This is called a pipeline bubble, and it is not good for performance. Sometimes, a part of the array a was swapped out to the hard drive. The operating system will need to access the drive to put that part into main memory, after which it will be transferred to the CPU cache. When this happens, your operating system will not let the CPU wait for +200 ms. It will use that time to execute another task instead (like the backup running on your system, or refreshing your screen, or ...).
Switching tasks on a CPU results in a performance penalty. To switch to a different task, the operating system must save the CPU registers in main memory, and load the CPU registers of the other task back into the CPU first. This takes time.
With HyperThreading, the number of registers per CPU is doubled. This means that two processes can 'occupy' the CPU. Only one can be executed, but during a stall, the operating system can switch to the second process without any performance penalty.
Forget how Microsoft Windows reports CPU usage. It's wrong. CPU usage is a lot more complicated than only a simple 47%. The real question is rather: should matlab register two threads per core, or only one?
Arguments pro:
During a stall, the CPU can quickly switch to the other thread and continue executing.
Arguments contra:
There are more threads, and the problem is divided in smaller pieces. This may actually reduce performance, as you need to put more pieces together to get the final result.
A context switch will still 'poison' the L1 and L2 cache, loading in pieces of memory that are of no use to the other thread on the CPU.
If there are no stalls, you have more overhead.
On a desktop, the operating system will also want to run: redrawing the screen, moving your mouse, etc. When all logical cpu's are in use, the operating system is required to do an actual (costly) context switch.
Your problem will only be complete if all pieces of the problem have been calculated. Using all the cores / threads increases the risk of one thread taking more time.
My guess is that the Matlab developers considered the arguments contra to be more important than the arguments pro. My own performance tests certainly suggest that there is little performance gain from HyperThreading for cpu-intensive calculations.

Resources