Discrepancy in Linux time command output - linux

I am aware that the output of the time command can show greater time under the user section than the real section for multi-processor cases, but recently, I was trying to profile a program when I saw that real was substantially greater than user + sys.
$ time ./test.o
real 0m5.576s
user 0m1.270s
sys 0m0.540s
Can anybody explain why such a behaviour is caused?

That's the normal behavior.
"Real" is is the wall-clock time. In your example, it literally took 5.576 seconds to run './test.o'
'user' is the User CPU time, or (roughly) CPU time used by user-space processes. This is essentially the time your CPU spent actually executing './test.o'. 1.270 seconds.
And finally, 'sys' is System CPU time, or (roughly) CPU time used by your kernel. 0.540 seconds.
If you add sys + user, you get the amount of time your CPU had to spend executing the program.
real - (user + sys) is, then, the time spent not running your program. 3.766 seconds were spent between invocation and termination not running your program--probably waiting for the CPU to finish running other programs, waiting on disk I/O, etc.

Time your process spends sleeping (e.g., waiting for I/O) is not counted by either "user" or "system", but "real" time still elapses.
Try:
time cat
...then wait 10 seconds and hit ctrl-D.

There are at least two possibilities:
The system is busy with other competing processes
The program is sleeping a lot, or doing other operations which cause it to wait, like i/o (waiting for user input, disk i/o, network i/o, etc.)

Related

cpu time jumps a lot in virtual machine

I have a C++ program running with 20 threads (boost threads) on one of the RHEL6.5 systems virtualized in dell server. The result is deterministic, but the cpu time and wall time varies a lot in different runs. Sometimes, it takes 200s cpu time to finish, sometimes it may take up to 300s cpu time to finish. This bothers me as performance is a criterion for our testing.
I've changed the originally used boost::timer::cpu_timer for wall/cpu time calc and use sys apis 'clock_gettime' and 'getrusage'. It doesn't help.
Is it because of the 'steal time' by hypervisor (Vmware)? Is steal time included in the user/sys time collected by 'getrusage'?
Anyone have knowledge on this? Many Thanks.
It would be useful if you provided some extra information. For example are your threads dependent? meaning is there any synchronization going among them?
Since you are using a virtual machine, how is your CPU shared with other users of the server. It might be that even the same single CPU core is shared, thus not each time you have the same allocation of CPU resources [this is the steal time you mention above].
Also you mention that CPU time is different: this is the time spent in user code. If you have sync among threads (such as a mutex, etc) then depending on how operating system wakes up threads etc, the over all time might vary.

Which 'time' is used for the timeout by the subprocess module on UNIX/Linux OSes?

Which time measurement is used for the timeout by the Python 3 subprocess module on UNIX/Linux OSes?
UNIX like OSes report 3 different times for process execution: real, user, and system. Even with processes that will be alive for only a few milliseconds the real time is often several hundred percent longer than the user and system time.
I'm making calls using subprocess.call() and subprocess.check_output() with the timeout set to a quarter of a second for processes that the time utility reports taking 2-18 milliseconds for the various times reported. There is no problem and my enquiry is purely out of interest.
This is wall-clock time (real), not time spent in either userland (user) or the kernel (system).
You can test this yourself by running a process such as sleep 60, which uses almost no user or system time at all, and observing that it still times out.

Understanding software parallelization on a linux workstation

Summary
I am trying to understand the limits of my compute resources when performing multiple simulations. My task is trivial in terms of parallelisation - I need to run a large number of simple independent simulations, i.e. each simulation program does not rely on another for information. Each simulation has roughly the same running time. For this purpose I have created an experiment that is detailed below.
Details
I have two shell scripts located in the same directory.
First script called simple:
#!/bin/bash
# Simple Script
echo "Running sleep with arg= $1 "
sleep 5s
echo "Finished sleeping with arg= $1"
Second script called runall:
#!/bin/bash
export PATH="$PATH:./"
# Fork off a new process for each program by running in background
# Run N processes at a time and wait until all of them have finished
# before executing the next batch. This is sub-optimal if the running
# time of each process varies significantly.
# Note if the number of total processes is not divisible by the alloted pool something weird happens
echo "Executing runall script..."
for ARG in $(seq 600); do
simple $ARG &
NPROC=$(($NPROC+1))
if [ "$NPROC" -ge 300 ]; then
wait
echo "New batch"
NPROC=0
fi
done
Here are some specs on my computer (MAC OS X):
$ ulimit -u
709
$ sysctl hw.ncpu
hw.ncpu: 8
$ sysctl hw.physicalcpu
hw.physicalcpu: 4
From this I interpret that I have 709 processes at my disposal and 8 processor cores available.
However when I execute $ ./runall I eventually end up with:
...
Running sleep with arg= 253
Running sleep with arg= 254
Running sleep with arg= 255
Running sleep with arg= 256
Running sleep with arg= 257
Running sleep with arg= 258
./runall: fork: Resource temporarily unavailable
Running sleep with arg= 259
./simple: fork: Resource temporarily unavailable
Running sleep with arg= 260
$ Running sleep with arg= 261
Finished sleeping with arg= 5
Finished sleeping with arg= 7
Finished sleeping with arg= 4
Finished sleeping with arg= 8
Finished sleeping with arg= 3
...
SO:
Question 1
Does this mean that out of the 709 processes available, only 258 can be dedicated to my runall program, the rest remaining probably being used by other processes on my computer?
Question 2
I substituted the simple script with something else which does something more complicated than just sleep (it reads a file and processes the data in the file to create a graph) and now I start to notice some differences. With the help of using $ time ./runall I can get the total run time and whereas before when calling simple for up to the 258 processes I always got a run time of about 5s:
real 0m5.071s
user 0m0.184s
sys 0m0.263s
i.e, running many simulations in parallel gives the same runtime as a single simulation. However now that I am calling a more complex program instead of simple I get a longer total run time than the single simulation time (calling a single simulation takes 1.5s whereas 20 simulations in parallel takes about 8.5s). How do I explain this behavior?
Question 3
Im not sure how the number of processor cores is related to the parallel performance - Since I have 8 cores at my disposal I thought I would be able to run 8 programs in parallel at the same time it would take me to just run one. Im not sure about my reasoning on this...
If you have 8 cpu threads available, and your programs consume 100% of a single CPU, it does not make sense to run more than 8 programs at a time.
If your programs are multi-threaded, then you may want to have fewer than 8 processes running at a time. If your programs occasionally use less than 100% of a single CPU (perhaps if they're waiting for IO), then you may want to run more than 8 processes at a time.
Even if the process limit for your user is extremely high, other resources could be exhausted much sooner - for instance, RAM. If you launch 200 processes and they exhaust RAM, then the operating system will respond by satisfying requests for RAM by swapping out some other process's RAM to disk; and now the computer needlessly crawls to a halt because 200 processes are waiting on IO to get their memory back from disk, only to have it be written out again because some other process wants to run. This is called thrashing.
If your goal is to perform some batch computation, it does not make sense to load the computer any more than enough processes to keep all CPU cores at 100% utilization. Anything more is waste.
Edit - Clarification on terminology.
A single computer can have more than one CPU socket.
A single CPU can have more than one CPU core.
A single CPU core can support simultaneous execution of more than one stream of instructions. Hyperthreading is an example of this.
A stream of instructions is what we typically call a "thread", either in the context of the operating system, processes, or in the CPU.
So I could have a computer with 2 sockets, with each socket containing a 4-core CPU, where each of those CPUs supports hyperthreading and thus supports two threads per core.
Such a computer could execute 2 * 4 * 2 = 16 threads simultaneously.
A single process can have as many threads as it wants, until some resources is exhausted - raw RAM, internal operating system data structures, etc. Each process has at least one thread.
It's important to note that tricks like hyperthreading may not scale performance linearly. When you have unhyperthreaded CPU cores, those cores contain enough parts to be able to execute a single stream of instructions all by itself; aside from memory access, it doesn't share anything with the rest of the other cores, and so performance can scale linearly.
However, each core has a lot of parts - and during some types of computations, some of those parts are inactive while others are active. And during other types of computations could be the opposite. Doing a lot of floating-point math? Well, then the integer math unit in the core might be idle. Doing a lot of integer math? Well, then the floating-point math unit might be idle.
Hyperthreading seeks to increase perform, even if only a little bit, by exploiting these temporarily unused units within a core; while the floating point unit is busy, schedule something that can use the integer unit.
...
As far as the operating system is concerned when it comes to scheduling is how many threads across all processes are runnable. If I have one process with 3 runnable threads, a second process with one runnable thread, and a third process with 10 runnable threads, then the OS will want to run a total of 3 + 1 + 10 = 14 threads.
If there are more runnable program threads than there are CPU execution threads, then the operating system will run as many as it can, and the others will sit there doing nothing, waiting. Meanwhile, those programs and those threads may have allocated a bunch of memory.
Lets say I have a computer with 128 GB of RAM and CPU resources such that the hardware can execute a total of 16 threads at the same time. I have a program that uses 2 GB of memory to perform a simple simulation, and that program only creates one thread to perform its execution, and each program needs 100s of CPU time to finish. What would happen if I were to try to run 16 instances of that program at the same time?
Each program would allocate 2 GB * 16 = 32 GB of ram to hold its state, and then begin performing its calculations. Since each program creates a single thread, and there are 16 CPU execution threads available, every program can run on the CPU without competing for CPU time. The total time we'd need to wait for the whole batch to finish would be 100 s: 16 processes / 16 cpu execution threads * 100s.
Now what if I increase that to 32 programs running at the same time? Well, we'll allocate a total of 64GB of RAM, and at any one point in time, only 16 of them will be running. This is fine, nothing bad will happen because we've not exhausted RAM (and presumably any other resource), and the programs will all run efficiently and eventually finish. Runtime will be approximately twice as long at 200s.
Ok, now what happens if we try to run 128 programs at the same time? We'll run out of memory: 128 * 2 = 256 GB of ram, more than double what the hardware has. The operating system will respond by swapping memory to dis and reading it back in as needed, but it'll have to do this very frequently, and it'll have to wait for the disk.
If you had enough ram, this would run in 800s (128 / 16 * 100). Since you don't, it's very possible it could take an order of magnitude longer.
Your questions are a little confusing. But here's an attempt to explain some of it:
Question 1 Does this mean that out of the 709 processes available, only 258 can be dedicated to my runall program, the rest remaining probably being used by other processes on my computer?
As the ulimit manpage explains, -u tells you how many processes you can start as a user. As you know, every process on Unix has a uid (there are some nitty gritty details here like euid, setuid etc.) which refers to the user on the system that owns that process. What -u tells you is the number of processes you (since you are logged in and executing the ulimit command) can start and simultaneously run on the computer. Note that once a process with pid p exits, OS is free to recycle that number p for some other processes.
Question 2
The answer to question 2 (which seems to be your main confusion) can only be given when we understand what the time command actually reports. Understanding the output of the time command needs some experimentation. For instance, when I run your experiment (on a comparable Mac) with 100 processes (i.e. $(seq 100)), I get:
./runall.sh 0.01s user 0.02s system 39% cpu 0.087 total
This means that only 39% of the available computing power was used resulting in 0.087s of the wall clock time. Roughly speaking, the wall clock time multiplied by the CPU utilization gives the running time (user time that your code needs + system time that system calls need to execute). Your simple script is rather too simple. It does not cause the CPU's to do any work by making the sleep system call!
Compare this example with a more real-life example to find a subset of a given set with given sum. This (Java) program, on the same computer produces the following times:
java SubsetSum 38.25s user 1.09s system 510% cpu 7.702 total
This means that the total wall clock time in about 7.7 seconds, but all the available cores are stressed extremely highly to execute this program. On a 4-CPU (8 logical CPU), I get a 500% CPU utilization! (And you can see that wall clock time (7.7) multiplied by CPU utilization (5.1) i.e. 39.27 is roughly equal to total time (38.25+1.09 = 39.34))
Question 3
Well, the way to parallelize your programs is by finding out parallelizable activity in solving the problem. You have 8 cores available and the OS will decide how to allocate them to the processes that ask for it. But what if a process goes into BLOCKING state (blocked on I/O)? Then, the OS will schedule this process out and schedule something else in. A simplistic view of this like "8 cores => 8 programs at the same time" is hardly true when you take into account the way the scheduling works.

Why do I receive a different runtime every time I run "time ./a.out" on the same program?

I am currently trying to reduce the runtime of a kmeans program, however every time I run the "time ./a.out" command the terminal is giving me a different answer even though I haven't changed any of the code. Does anyone have any idea why this is?
real 0m0.100s
user 0m0.082s
sys 0m0.009s
bash-4.1$ time ./a.out
real 0m0.114s
user 0m0.084s
sys 0m0.006s
bash-4.1$ time ./a.out
real 0m0.102s
user 0m0.087s
sys 0m0.005s
bash-4.1$ time ./a.out
real 0m0.099s
user 0m0.082s
sys 0m0.008s
bash-4.1$ time ./a.out
real 0m0.101s
user 0m0.083s
sys 0m0.006s
this is after running the same command consecutively.
On a modern system many many processes run in parallel (or better quasi parallel). That means the system switches between all processes. Note: it does not switch to the next process once one process is finished. That would mean processes would have to wait, get blocked. Instead each process gets a bit of time now and then, until it has finished.
The more processes, the slower the system altogether, the slower the single process gets when measured in absolute duration. That is what you see.
The typical strategy for this is called "round robin". You may want to google that term to read more about this topic.
First, let us understand that the time command will "record the elapsed time or CPU Resource Used time" of the program. That translates to how much time the program runs on the processor. As you have noted, there are different times reported for each run of the program in all categories: Real time, User time, and System time.
Second, let us understand that modern systems will share the processor with all the other processes running on a system (only one process is in control of any core of the processor at any given time), and use many different schemes for how these processes share the processor and system resources, hence the different Real and User times. These times depend on how your system swaps out programs.
The sys time will depend on the program itself, and what resources it is requesting. As with any process, if the resources have been requested by another process, it will be put to sleep, waiting on the resource. Depending on the resource and how your particular system handles shared resources, a process may spend some idle time waiting for the resource and be put to sleep only after a timer times it out, or immediately if the processor can guess that the resource will take longer than the timer. Again, this is highly dependent on how your particular system handles these tasks, on your processor, and on the resources being requested.

How can OS actually measure the CPU power?

Currently I think that processor only has two states: run and not run. If it's running, it will use its full power to process a task. If there are multiple processes, processes will be shared by a portion of CPU.
How can the computing power can be divided into "portions"? So, suppose a CPU has 1 million transistors, only half of the transistors are used if the CPU is only at 50%?
Or is this related to allocated processing time for each process? i.e. assume "100%" means a process seizes a CPU for 200 milliseconds, if a process with a default nice value (priority value) 0, which means the process will receive 50% computing power or, in other word, 100 milliseconds. What is the correct idea?
Let me explain this on the example of Intel x86 CPUs and Windows NT (and its derivatives). One of the built-in system processes on these OSes is the System Idle Process. This process represents how much CPU time is utilized by the operating system's "idle loop". That idle loop does nothing else but executes the HLT instruction of the CPU. That instruction, in turn, commands the CPU to do nothing until the next interrupt arrives.
Therefore, if the scheduler decides that there are no processes that require CPU time at the given moment, it is given to the System Idle Process. If, say, 99% of the time in the last n seconds was spent by "executing" that process, it means that the CPU was really utilized only in 1% in these n seconds.
I believe it is totally analogous with Linux, only that it doesn't have a separate process to model the "idleness" of the CPU.
On a side note : it is, of course, possible, to have a OS that doesn't execute the HLT instruction at all. That was the case with Windows 98 and earlier (including, obviously, MS-DOS), whose idle loop simply consisted of a jmp $. That caused the CPU to use much more power.

Resources