cpu time jumps a lot in virtual machine - linux

I have a C++ program running with 20 threads (boost threads) on one of the RHEL6.5 systems virtualized in dell server. The result is deterministic, but the cpu time and wall time varies a lot in different runs. Sometimes, it takes 200s cpu time to finish, sometimes it may take up to 300s cpu time to finish. This bothers me as performance is a criterion for our testing.
I've changed the originally used boost::timer::cpu_timer for wall/cpu time calc and use sys apis 'clock_gettime' and 'getrusage'. It doesn't help.
Is it because of the 'steal time' by hypervisor (Vmware)? Is steal time included in the user/sys time collected by 'getrusage'?
Anyone have knowledge on this? Many Thanks.

It would be useful if you provided some extra information. For example are your threads dependent? meaning is there any synchronization going among them?
Since you are using a virtual machine, how is your CPU shared with other users of the server. It might be that even the same single CPU core is shared, thus not each time you have the same allocation of CPU resources [this is the steal time you mention above].
Also you mention that CPU time is different: this is the time spent in user code. If you have sync among threads (such as a mutex, etc) then depending on how operating system wakes up threads etc, the over all time might vary.

Related

Why would a process use over twice as many CPU resources on a machine with double the cores?

I'm hoping someone could point me in the right direction here so I can learn more about this.
We have a process running on our iMac Pro 8-core machine which utilises ~78% CPU.
When we run the same process on our new Mac Pro 16-core machine it utilises ~176% CPU.
What reasons for this could there be? We were hoping the extra cores would allow us to run more processes simultaneously however if each uses over double the CPU resources, surely that means we will be able to run fewer processes on the new machine?
There must be something obvious I'm missing about architecture. Could someone please help? I understand I haven't included any code examples, I'm asking in a more general sense about scenarios that could lead to this.
I suspect that the CPU thread manager tries to use as much CPU as it can/needs. If there are more processes needing CPU time, then the cycles will be shared out more sparingly to each. Presumably your task runs correspondingly faster on the new Mac?
The higher CPU utilization just indicates that it's able to make use of more hardware. That's fine. You should expect it to use that hardware for a shorter period, and so more things should get done in the same overall time.
As to why, it completely depends on the code. Some code decides how many threads to use based on the number of cores. If there are non-CPU bottlenecks (the hard drive or GPU for example), then a faster system may allow the process to spend more time computing and less time waiting for non-CPU resources, which will show up as higher CPU utilization, and also faster throughput.
If your actual goal is to have more processes rather than more throughput (which may be a very reasonable goal), then you will need to tune the process to use fewer resources even when they are available. How you do that completely depends on the code. Whether you even need to do that will depend on testing how the system behaves when there is contention between many processes. In many systems it will take care of itself. If there's no problem, there's no problem. A higher or lower CPU utilization number is not in itself a problem. It depends on the system, where your bottlenecks are, and what you're trying to optimize for.

CPU percentage and heavy multi-threading

I am observing strange effects with the CPU percentage as shown in e.g. top or htop on Linux (Ubuntu 16.04) for one special application. The application uses many threads (around 1000). Each thread has one computational task. About half of these tasks need to be computed once per "trigger" - the trigger is an external event received exactly every 100ms. The other threads are mostly sleeping (waiting for user interaction) and hence do not play a big role here. So to summarise: many threads are waking up basically simultaneously within a short period of time, doing there (relatively short) computation and going back to sleep again.
Since the machine running this application has 8 virtual CPUs (4 cores each 2 threads, it's an i7-3612QE), only 8 threads can really wake up at a time, so many threads will have to wait. Also some of these tasks have interdependencies, so they anyway have to wait, but I think as an approximation one can think of this application as a bunch of threads going to the runnable state at the same time every 100ms and each doing only a short computation (way below 1ms of CPU time each).
Now coming to the strange effect: If I look at the CPU percentage in "top", it shows something like 250%. As far as I know, top looks on the CPU time (user + system) the kernel accounts for this process, so 250% would mean the process uses 3 virtual CPUs on average. So far so good. Now, if I use taskset to force the entire process to use only a single virtual CPU, the CPU percentage drops to 80%. The application has internal accounting which tells me that still all data is being processed. So the application is doing the same amount of work, but it seemingly uses less CPU resources. How can that be? Can I really trust the kernel CPU time accounting, or is this an artefact of the measurement?
The CPU percentage also goes down, if I start other processes which take a lot of CPU, even if the do nothing ("while(true);") and are running at low priority (nice). If I launch 8 of these CPU-eating processes, the application reaches again 80%. With fewer CPU-eaters, I get gradually higher CPU%.
Not sure if this plays a role: I have used the profiler vtune, which tells me my application is actually quite inefficient (only about 1 IPC), mostly because it's memory bound. This does not change if I restrict the process to a single virtual CPU, so I assume the effect is not caused by a huge increase in efficiency when running everything on the same core (which would be strange anyway).
My question was essentially already answered by myself in the last paragraph: The process is memory bound. Hence not the CPU is the limited resource but the memory bandwidth. Allowing such process to run on multiple CPU cores in parallel will mainly have the effect that more CPU cores are waiting for data to arrive from RAM. This is counted as CPU load, since the CPU is executing the thread, but just quite slowly. All my other observations go along with this.

Process & thread scheduling overhead

There are a few things I don't quite understand when it come to scheduling:
I assume each process/thread, as long as it is CPU bound, is given a time window. Once the window is over, it's swapped out and another process/thread is ran. Is that assumption correct? Are there any ball park numbers how long that window is on a modern PC? I'm assuming around 100 ms? What's the overhead of swapping out like? A few milliseconds or so?
Does the OS schedule by procces or by an individual kernel thread? It would make more sense to schedule each process and within that time window run whatever threads that process has available. That way the process context switching is minimized. Is my understanding correct?
How does the time each thread runs compare to other system times, such as RAM access, network access, HD I/O etc?
If I'm reading a socket (blocking) my thread will get swapped out until data is available then a hardware interrupt will be triggered and the data will be moved to the RAM (either by the CPU or by the NIC if it supports DMA) . Am I correct to assume that the thread will not necessarily be swapped back in at that point to handle he incoming data?
I'm asking primarily about Linux, but I would imagine the info would also be applicable to Windows as well.
I realize it's a bunch of different questions, I'm trying to clear up my understanding on this topic.
I assume each process/thread, as long as it is CPU bound, is given a time window. Once the window is over, it's swapped out and another process/thread is ran. Is that assumption correct? Are there any ball park numbers how long that window is on a modern PC? I'm assuming around 100 ms? What's the overhead of swapping out like? A few milliseconds or so?
No. Pretty much all modern operating systems use pre-emption, allowing interactive processes that suddenly need to do work (because the user hit a key, data was read from the disk, or a network packet was received) to interrupt CPU bound tasks.
Does the OS schedule by proces or by an individual kernel thread? It would make more sense to schedule each process and within that time window run whatever threads that process has available. That way the process context switching is minimized. Is my understanding correct?
That's a complex optimization decision. The cost of blowing out the instruction and data caches is typically large compared to the cost of changing the address space, so this isn't as significant as you might think. Typically, picking which thread to schedule of all the ready-to-run threads is done first and process stickiness may be an optimization affecting which core to schedule on.
How does the time each thread runs compare to other system times, such as RAM access, network access, HD I/O etc?
Obviously, threads have to run through a very large number of RAM accesses because switching threads requires a large number of such accesses. Hard drive and network I/O are generally slow enough that a thread that's waiting for such a thing is descheduled.
Fast SSDs change things a bit. One thing I'm seeing a lot of lately is long-treasured optimizations that use a lot of CPU to try to avoid disk accesses can be worse than just doing the disk access on some modern machines!

Considerate, dynamic CPU load management

I am writing a CPU-intensive image processing library. To make best use of available CPU, I can detect the total number of cores on my machine and have my library run with that number of threads. When my library to allocate one thread for each core it performs optimally using 100% available processor time.
The above approach works fine when mine is the only CPU-heavy process running. If another CPU-intensive process is running, or even another instance of my own code, then the OS allocates us only a fraction of the available cores and my library then has too many threads running which is both inefficient and inconsiderate to other processes.
So I would like to find a way to determine the "fair share" number of threads to run given a specific load. For example, if two instances of my process are running on an 8-core machine, each would run with 4 threads. Each would need a way to adapt thread count dynamically according to fluctuations in machine load.
So, my question:
Is there any OS feature or third-party library which allows my process to adapt thread count dynamically to use its fair share of the CPU?
My focus is Windows but interested in non-Windows solutions too.
Edit: to be clear, this is about optimization. I am trying to achieve peak efficiency by running the optimal number of threads appropriate to my fair share of the CPU.
In my eyes, the application shouldnt decide how many threads to spawn. This is an information, that the caller should know. In linux, the "-j" or "--jobs" parameter is widely used (Default: 1).
What about also setting the priority of the processing tasks. So if the caller knows, the processing is mission-critical, he can increase the prio (with the knowledge of maybe blocking the (whole) system). Your processing lib would never know, how important the processing of this image would be.
If the caller doesnt care, then the default low-prio is used, which shouldnt affect the rest of the system. If it does, you should look to what is exactly blocking the system (maybe writing image files to the hdd, reduce ram size to prevent swapping, ...). If you figured out that, you can optimize exactly that point.
If you start the processing with (cpu-cores)*2 on low till normal priority, your system should be useable. No one would expect, that this will kill the system.
Just my 2 cents.
Actually it's not a problem of multithreading but a problem of executing many programs simultaneously. This is hard on most PC's operating systems because it conflicts to the idea of time-sharing.
Let's assume some workflow.
Suppose we have 8 cores and we create 8 threads to feed them; ok, that's easy. Next we choose to monitor core loading to summary how many tasks running on a certain core; well, that needs some statistical assumptions, e.g on Linux you can get a 1/5/15-mins load average chart, but that could be done. The statistical chart is clear and now we get a plot about how many CPU-bound processes are running, say, seeing other 3 CPU-intensive processes.
Then we come to the point: we have to make 3 redundant threads to sleep, but which 3?
Usually we choose 3 threads arbitrarily because the scheduler arranges the other 8 CPU-bound threads automatically. In some cases, we explicitly put threads on high load cores to sleep, assign other threads to certain low load cores, and let the scheduler do the rest things. Most scheduling policies also try to "keep CPU cache hot", which means they tend to forbid transferring threads between cores. We reasonably expect our CPU-intensive threads can utilize the core cache since other processes are scheduled to the 3 crowded cores. Everything looks good.
However this could fail in tightly synchronized computation. In this scenario we need to run our 5 threads simultaneously. Simultaneity here means the 5 threads have to gain CPU and run at almost the same time. I don't know if there's any scheduler on PC could do this for us. In most low-load cases, things still work fine because costs to wait for simultaneity is trivial. But when the load of a core is high and even 1 of our 5 threads is disturbed, occasionally we'll find we spend many life cycles in waiting.
It may help to schedule your program as a real-time program but it's not a perfect solution. Statistically it leads to a wider time window for simultaneity when it gains more CPU control priority. I have to say, it's not guaranteed.

Processor Utilization by threads spawned by a program

I have a java program which spawns multiple threads say, 10-20 threads. This program is scheduled to be run on a machine that has 32 processors.
I am keen to know if all the processors' power would be utilized by these threads.
Solaris is the environment; does that make any difference?
A good profiler should tell you this. If the threads are compute bound, then yes you will use as many cores as you have threads, if you are blocked doing I/O or on contention it will be less than that.
Given that you aren't on Windows, the following doesn't apply, but a decent profiler should still be able to provide a measurement of CPU cycles burned by your process for a give period of time...
If you are on Windows a good free tool to use is the Windows Performance Toolkit (xperf) which is now part of the platform sdk. It will show you the processor cycles burned for each thread or process for a period of time (as opposed to just elapsed times).

Resources