Linux clock_gettime() elapse spikes?

Linux clock_gettime() elapse spikes? - linux

I'm try to get high resolution timestamp on linux. Using clock_gettime(), as below, I got "spike" elapses that looks pretty horrible at almost 26 micro second elapse. Most of the "dt"'s are around 30 ns. I was on linux 2.6.32, Red Hat 4.4.6. 'lscpu' shows CPU MHz=2666.121. I thought that means each each clock tick needs about 2 ns. So, asking for ns resolution didn't see like too unreasonable here.
output of program (sorry wasn't able to post this without making it a list. It thinks it's code some how)
1397534268,40823395 1397534268,40827950,dt=4555
1397534268,41233555 1397534268,41236716,dt=3161
1397534268,41389902 1397534268,41392922,dt=3020
1397534268,46488430 1397534268,46491674,dt=3244
1397534268,46531297 1397534268,46534279,dt=2982
1397534268,46823368 1397534268,46849336,dt=25968
1397534268,46915657 1397534268,46918663,dt=3006
1397534268,51488643 1397534268,51491791,dt=3148
1397534268,51530490 1397534268,51533496,dt=3006
1397534268,51823307 1397534268,51826904,dt=3597
1397534268,55823359 1397534268,55827826,dt=4467
1397534268,60531184 1397534268,60534183,dt=2999
1397534268,60823381 1397534268,60844866,dt=21485
1397534268,60913003 1397534268,60915998,dt=2995
1397534268,65823269 1397534268,65827742,dt=4473
1397534268,70823376 1397534268,70835280,dt=11904
1397534268,75823489 1397534268,75828872,dt=5383
1397534268,80823503 1397534268,80859500,dt=35997
1397534268,86823381 1397534268,86831907,dt=8526
Any ideas? thanks
#include <vector>
#include <iostream>
#include <time.h>
long long elapse( const timespec& t1, const timespec& t2 )
{
return ( t2.tv_sec * 1000000000L + t2.tv_nsec ) -
t1.tv_sec * 1000000000L + t1.tv_nsec );
}
int main()
{
const unsigned n=30000;
timespec ts;
std::vector<timespec> t( n );
for( unsigned i=0; i < n; ++i )
{
clock_gettime( CLOCK_REALTIME, &ts );
t[i] = ts;
}
std::vector<long> dt( n );
for( unsigned i=1; i < n; ++i )
{
dt[i] = elapse( t[i-1], t[i] );
if( dt[i] > 1000 )
{
std::cerr <<
t[i-1].tv_sec << ","
<< t[i-1].tv_nsec << " "
<< t[i].tv_sec << ","
<< t[i].tv_nsec
<< ",dt=" << dt[i] << std::endl;
}
else
{
//normally I get dt[i] = approx 30-35 nano secs
}
}
return 0;
}

The numbers you quoted are in the 3 to 30 microsecond range (3,000 to 30,000 nanoseconds). That is too short a time to be a context switch to another thread/process, let the other thread run, and context switch back to your thread. Most likely the core where your process was running was used by the kernel to service an external interrupt (e.g. network card, disk, timer), then returned to running your process.
You can watch the linux interrupt counters (per CPU core and per source) with this command
watch -d -n 0.2 cat /proc/interrupts
The -n 0.2 will cause the command to be issued at 5Hz, the -d flag will highlight what has changed.
The source of the interrupt could also be a TLB shootdown, which results in an IPI (Inter-Processor Interrupt). You can read more about TLB shootdowns here.
If you want to reduce the number of interrupts serviced by the core running your thread/process, you need to set the interrupt affinity. You can learn more about Red Hat Interrupts and IRQ (Interrupt requests) tuning here, and here.
Worth noting is that you are using CLOCK_REALTIME which isn't guaranteed to be "smooth", it could jump around as the system clock is "disciplined" to keep accurate time by a service like NTP (Network Time Protocol) or PTP (Precision Time Protocol). For your purposes it is better to use CLOCK_MONOTONIC, you can read more about the difference here. When a clock is "disciplined" the clock can jump by a "step" - this is unusual and certainly not the cause of the many spikes you see.

Could you check the resolution with clock_getres()?

I suspect what you are measuring here is called "OS Noise". This is often caused by your program getting pre-empted by the operating system. The operating system then performs other work. There are numerous causes, but commonly it is: other runnable tasks, hardware interrupts, or timer events.
The FTQ/FWQ benchmarks were designed to measure this characteristic and the summary contains some further information:
https://asc.llnl.gov/sequoia/benchmarks/FTQ_summary_v1.1.pdf

Related

AMD SMT or Intel HT performance

I don't really understand why processors with doubled logical processors are much more expensive then single logical processors. As far as I noticed there is no difference with running code on 6 or 12 threads for 6 cores/12 threads CPU.
As monkeys asked, here is C# example emulating heavy load on each thread:
static void Main(string[] args)
{
if (IntPtr.Size != 8)
throw new Exception("use only x64 code, 2020 is coming...");
//6 for physical cores, 12 for logical cores
const int limit_threads = 12;
const int limit_actions = 256;
const int limit_loop = 1000 * 1000 * 10;
const double power = 1.0 / 17.0;
long result = 0;
var action = new Action(() =>
{
long value = 0;
for (int i = 0; i < limit_loop; i++)
value += (long)Math.Pow(i, power);
Interlocked.Add(ref result, value);
});
var actions = Enumerable.Range(0, limit_actions).Select(x => action).ToArray();
var sw = Stopwatch.StartNew();
Parallel.Invoke(new ParallelOptions()
{
MaxDegreeOfParallelism = limit_threads
}, actions);
Console.WriteLine($"done in {sw.Elapsed.TotalSeconds}s\nresult={result}\nlimit_threads={limit_threads}\nlimit_actions={limit_actions}\nlimit_loop={limit_loop}");
}
Results for 6 threads (AMD Ryzen 2600):
done in 13,7074543s
result=5086445312
limit_threads=6
limit_actions=256
limit_loop=10000000
Results for 12 threads (AMD Ryzen 2600):
done in 11,3992756s
result=5086445312
limit_threads=12
limit_actions=256
limit_loop=10000000
It's about 10% performance boost with using all logical cores instead of only physical, which is almost null. What you can say now?
Can someone provide sample code which will be valuable faster with using processor multi-threading (AMD SMT or Intel HT) comparing to using only physical cores?

TLDR: SMT/HT is a technology that exists to offset the cost of massive multithreading as opposed to speeding up your computation with more cores.
You have misunderstood what SMT/HT does.
"As far as I noticed there is no difference with running code on 6 or 12 threads for 6cores-12threads CPU".
If this is true, then SMT/HT is working.
To understand why, you need to understand modern OS kernels and Kernel Threads. Today's Operating Systems use what is called Preemptive Threading.
The OS Kernel divides up each core into time-slices called "Quantum", and using interrupts schedules the various processes in a complicated round robin fashion.
The part we want to look at is the interrupt. When a CPU core is scheduled to switch run another thread, we call this process a "Context Switch". Context Switches are expensive, slow processes, as the entire state and flow of the highly pipelined CPU must be stopped, saved and swapped out for another state (as well as other caches, registers, lookup tables etc). According to this answer, Context Switch times are measured in microseconds (thousands of clock-cycles); and will only get worse as CPUs become more complicated.
The point of SMT/HT is to cheat, by having each CPU core being able to store two states at the same time (imagine having two monitors instead of one, you still only use one at time, but you are more productive because you don't need to rearrange your windows each time you switch tasks). So SMT/HT processors can Context Switch must faster than non-SMT/HT processors.
So back to your example. If you turned off SMT on your Ryzen 2600, then ran the same workload with 12 threads, you will find that it performs significantly slower than with 6 threads.
Also, note, more threads does not make things faster.

I think that varying the price of the processors depending on the availability of the SMT/HT technology is just a matter of marketing strategy.
The hardware is probably the same in every case but the feature is disabled by the manufacturer on some of them to offer cheap models.
This technology relies on the fact that some micro-operations in a single
instruction have to wait for something to be executed; so instead of just waiting,
the same core uses its circuits to make some progress on the micro-operations
from another thread.
On a coarse point of view, we can perceive the execution of two (or more on
certain models) sequences of micro-operations from two different threads executed
on a single piece of hardware (except some redundant parts, like registers...)
The efficiency of this technology depends on the problem.
After various tests I noticed that if the problem is compute bound, ie the
limiting factor is the time needed to compute (add, multiply...), but not
memory bound (the data are already available, no need to wait for the memory),
then this technology does not provide any benefit.
This is due to the fact that there is no gap to fill in the two sequences of
micro-operations, thus the intertwined execution of two threads is not better
than two independent serial executions.
In the exact opposite case, when the problem is memory bound but not
compute bound, there is no more benefit because both threads have to wait
for the data coming from memory.
I only noticed an improvement in performances when the problem is mixed between
data access and computation; in this case when one thread is waiting for data, the
same core can make some progress in the computations of the other thread and
vice-versa.
Edit
Below is given an example to illustrate these situations, and I obtain the
following results (quite stable when run many times,
dual Xeon E5-2697 v2, Linux 5.3.13).
In this memory bound situation HT does not help.
$ ./prog_ht mem
24 threads running memory_task()
result: 1e+17
duration: 13.0383 seconds
$ ./prog_ht mem ht
48 threads (ht) running memory_task()
result: 1e+17
duration: 13.1096 seconds
In this compute bound situation HT helps (almost 30% gain)
(I don't know exactly the details of what is implied in the hardware
when computing cos, but there must be some latencies which are not due
to memory access)
$ ./prog_ht
24 threads running compute_task()
result: -260.782
duration: 9.76226 seconds
$ ./prog_ht ht
48 threads (ht) running compute_task()
result: -260.782
duration: 7.58181 seconds
In this mixed situation HT helps much more (around 70% gain)
$ ./prog_ht mix
24 threads running mixed_task()
result: -260.782
duration: 60.1602 seconds
$ ./prog_ht mix ht
48 threads (ht) running mixed_task()
result: -260.782
duration: 35.121 seconds
Here is the source code (in C++, I'm not confortable with C#)
/*
g++ -std=c++17 -o prog_ht prog_ht.cpp \
-pedantic -Wall -Wextra -Wconversion \
-Wno-missing-braces -Wno-sign-conversion \
-O3 -ffast-math -march=native -fomit-frame-pointer -DNDEBUG \
-pthread
*/
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <thread>
#include <chrono>
#include <cstdint>
#include <random>
#include <cmath>
#include <pthread.h>
bool // success
bind_current_thread_to_cpu(int cpu_id)
{
/* !!!!!!!!!!!!!! WARNING !!!!!!!!!!!!!
I checked the numbering of the CPUs according to the packages and cores
on my computer/system (dual Xeon E5-2697 v2, Linux 5.3.13)
0 to 11 --> different cores of package 1
12 to 23 --> different cores of package 2
24 to 35 --> different cores of package 1
36 to 47 --> different cores of package 2
Thus using cpu_id from 0 to 23 does not bind more than one thread
to each single core (no HT).
Of course using cpu_id from 0 to 47 binds two threads to each single
core (HT is used).
This numbering is absolutely NOT guaranteed on any other computer/system,
thus the relation between thread numbers and cpu_id should be adapted
accordingly.
*/
cpu_set_t cpu_set;
CPU_ZERO(&cpu_set);
CPU_SET(cpu_id, &cpu_set);
return !pthread_setaffinity_np(pthread_self(), sizeof(cpu_set), &cpu_set);
}
inline
double // seconds since 1970/01/01 00:00:00 UTC
system_time()
{
const auto now=std::chrono::system_clock::now().time_since_epoch();
return 1e-6*double(std::chrono::duration_cast
<std::chrono::microseconds>(now).count());
}
constexpr auto count=std::int64_t{20'000'000};
constexpr auto repeat=500;
void
compute_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
(void)indices; // not used here
(void)inputs; // not used here
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
result+=std::cos(double(i));
}
}
results[thread_id]+=result;
}
void
mixed_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
const auto index=indices[i];
result+=std::cos(inputs[index]);
}
}
results[thread_id]+=result;
}
void
memory_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
const auto index=indices[i];
result+=inputs[index];
}
}
results[thread_id]+=result;
}
int
main(int argc,
char **argv)
{
//~~~~ analyse command line arguments ~~~~
const auto args=std::vector<std::string>{argv, argv+argc};
const auto has_arg=
[&](const auto &a)
{
return std::find(cbegin(args)+1, cend(args), a)!=cend(args);
};
const auto use_ht=has_arg("ht");
const auto thread_count=int(std::thread::hardware_concurrency())
/(use_ht ? 1 : 2);
const auto use_mix=has_arg("mix");
const auto use_mem=has_arg("mem");
const auto task=use_mem ? memory_task
: use_mix ? mixed_task
: compute_task;
const auto task_name=use_mem ? "memory_task"
: use_mix ? "mixed_task"
: "compute_task";
//~~~~ prepare input/output data ~~~~
auto results=std::vector<double>(thread_count);
auto indices=std::vector<int>(count);
auto inputs=std::vector<double>(count);
std::generate(begin(indices), end(indices),
[i=0]() mutable { return i++; });
std::copy(cbegin(indices), cend(indices), begin(inputs));
std::shuffle(begin(indices), end(indices), // fight the prefetcher!
std::default_random_engine{std::random_device{}()});
//~~~~ launch threads ~~~~
std::cout << thread_count << " threads"<< (use_ht ? " (ht)" : "")
<< " running " << task_name << "()\n";
auto threads=std::vector<std::thread>(thread_count);
const auto t0=system_time();
for(auto i=0; i<thread_count; ++i)
{
threads[i]=std::thread{task, i, thread_count,
data(indices), data(inputs), data(results)};
}
//~~~~ wait for threads ~~~~
auto result=0.0;
for(auto i=0; i<thread_count; ++i)
{
threads[i].join();
result+=results[i];
}
const auto duration=system_time()-t0;
std::cout << "result: " << result << '\n';
std::cout << "duration: " << duration << " seconds\n";
return 0;
}

Difference between Linux time and Performance clocks in code

I was running a simple test for timing of some C++ code, and I ran across an artifact that I am not 100% positive about.
Setup
My code uses C++11 high_resolution_clock to measure elapsed time. I also wrap the execution of my program using Linux's time command (/usr/bin/time). For my program, the high_resolution_clock reports ~2s while time reports ~7s (~6.5s user and ~.5s system). Also using the verbose option on time shows that my program used 100% of the CPU with 1 voluntary context switch and 10 involuntary context switches (/usr/bin/time -v).
Question
My question is what causes such a dramatic difference between OS time measurements and performance time measurements?
My initial thoughts
Through my knowledge of operating systems, I am assuming these differences are solely caused by context switches with other programs (as noted by time -v).
Is this the only reason for this difference? And should I trust the time reported by my program or the system when looking at code performance?
Again, my assumption is to trust the computed time from my program over Linux's time, because it times more than just my program's CPU usage.
Caveats
I am not posting code, as it isn't really relevant to the issue at hand. If you wish to know it is a simple test that times 100,000,000 random floating point arithmetic operations.
I know other clocks in my C++ code might be more or less appropriate for difference circumstances (this stack overflow question). High_resolution_clock is just an example.
Edit: Code as requested
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <vector>
using namespace std;
using namespace std::chrono;
int main() {
size_t n = 100000000;
double d = 1;
auto start_hrc = high_resolution_clock::now();
for(size_t i = 0; i < n; ++i) {
switch(rand() % 4) {
case 0: d += 0.0001; break;
case 1: d -= 0.0001; break;
case 2: d *= 0.0001; break;
case 3: d /= 0.0001; break;
}
}
auto end_hrc = high_resolution_clock::now();
duration<double> diff_hrc = end_hrc - start_hrc;
cout << d << endl << endl;
cout << "Time-HRC: " << diff_hrc.count() << " s" << endl;
}

My question is what causes such a dramatic difference between OS time measurements and performance time measurements?
It looks like your system takes a while to start your application. Probably a resource issue: not enough free memory (swapping) or oversubscribed CPU.
No dramatic difference is observed on my desktop:
Time-HRC: 1.39005 s
real 0m1.391s
user 0m1.387s
sys 0m0.004s

setitimer on linux rounding up?

When I set a short timeout with setitimer and then query the set value (with getitimer or another setitimer) on a Linux 2.6.26 system (Debian 5.0.5), I get back a value higher than I set:
#include <sys/time.h>
#include <iostream>
int main() {
struct itimerval wanted, got;
wanted.it_value.tv_sec = 0;
wanted.it_value.tv_usec = 7000;
wanted.it_interval.tv_sec = 0;
wanted.it_interval.tv_usec = 0;
setitimer(ITIMER_VIRTUAL, &wanted, NULL);
getitimer(ITIMER_VIRTUAL, &got);
std::cerr << "we said: " << wanted.it_value.tv_usec << "\n"
<< "linux set: " << got.it_value.tv_usec << std::endl;
return 0;
}
returns:
we said: 7000
linux set: 12000
This is problematic, since we use the times reported as remaining after some computations, and they are way too large, too.
Is this a known problem? (googling did not work.) Does anyone have a good workaround?

In the POSIX documentation of the setitimer function there is a note
Implementations may place limitations on the granularity of timer values. For each interval timer, if the requested timer value requires a finer granularity than the implementation supports, the actual timer value shall be rounded up to the next supported value
The granularity in your system seems to be higher than 1000 usec (seems to be 6000 usec) and the timer value is rounded up. The timer granularity is the problem if you need such precision.

max thread per process in linux

I wrote a simple program to calculate the maximum number of threads that a process can have in linux (Centos 5). here is the code:
int main()
{
pthread_t thrd[400];
for(int i=0;i<400;i++)
{
int err=pthread_create(&thrd[i],NULL,thread,(void*)i);
if(err!=0)
cout << "thread creation failed: " << i <<" error code: " << err << endl;
}
return 0;
}
void * thread(void* i)
{
sleep(100);//make the thread still alive
return 0;
}
I figured out that max number for threads is only 300!? What if i need more than that?
I have to mention that pthread_create returns 12 as error code.
Thanks before

There is a thread limit for linux and it can be modified runtime by writing desired limit to /proc/sys/kernel/threads-max. The default value is computed from the available system memory. In addition to that limit, there's also another limit: /proc/sys/vm/max_map_count which limits the maximum mmapped segments and at least recent kernels will mmap memory per thread. It should be safe to increase that limit a lot if you hit it.
However, the limit you're hitting is lack of virtual memory in 32bit operating system. Install a 64 bit linux if your hardware supports it and you'll be fine. I can easily start 30000 threads with a stack size of 8MB. The system has a single Core 2 Duo + 8 GB of system memory (I'm using 5 GB for other stuff in the same time) and it's running 64 bit Ubuntu with kernel 2.6.32. Note that memory overcommit (/proc/sys/vm/overcommit_memory) must be allowed because otherwise system would need at least 240 GB of committable memory (sum of real memory and swap space).
If you need lots of threads and cannot use 64 bit system your only choice is to minimize the memory usage per thread to conserve virtual memory. Start with requesting as little stack as you can live with.

Your system limits may not be allowing you to map the stacks of all the threads you require. Look at /proc/sys/vm/max_map_count, and see this answer. I'm not 100% sure this is your problem, because most people run into problems at much larger thread counts.

I had also encountered the same problem when my number of threads crosses some threshold.
It was because of the user level limit (number of process a user can run at a time) set to 1024 in /etc/security/limits.conf .
so check your /etc/security/limits.conf and look for entry:-
username -/soft/hard -nproc 1024
change it to some larger values to something 100k(requires sudo privileges/root) and it should work for you.
To learn more about security policy ,see http://linux.die.net/man/5/limits.conf.

check the stack size per thread with ulimit, in my case Redhat Linux 2.6:
ulimit -a
...
stack size (kbytes, -s) 10240
Each of your threads will get this amount of memory (10MB) assigned for it's stack. With a 32bit program and a maximum address space of 4GB, that is a maximum of only 4096MB / 10MB = 409 threads !!! Minus program code, minus heap-space will probably lead to your observed max. of 300 threads.
You should be able to raise this by compiling a 64bit application or setting ulimit -s 8192 or even ulimit -s 4096. But if this is advisable is another discussion...

You will run out of memory too unless u shrink the default thread stack size. Its 10MB on our version of linux.
EDIT:
Error code 12 = out of memory, so I think the 1mb stack is still too big for you. Compiled for 32 bit, I can get a 100k stack to give me 30k threads. Beyond 30k threads I get Error code 11 which means no more threads allowed. A 1MB stack gives me about 4k threads before error code 12. 10MB gives me 427 threads. 100MB gives me 42 threads. 1 GB gives me 4... We have 64 bit OS with 64 GB ram. Is your OS 32 bit? When I compile for 64bit, I can use any stack size I want and get the limit of threads.
Also I noticed if i turn the profiling stuff (Tools|Profiling) on for netbeans and run from the ide...I only can get 400 threads. Weird. Netbeans also dies if you use up all the threads.
Here is a test app you can run:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <signal.h>
// this prevents the compiler from reordering code over this COMPILER_BARRIER
// this doesnt do anything
#define COMPILER_BARRIER() __asm__ __volatile__ ("" ::: "memory")
sigset_t _fSigSet;
volatile int _cActive = 0;
pthread_t thrd[1000000];
void * thread(void *i)
{
int nSig, cActive;
cActive = __sync_fetch_and_add(&_cActive, 1);
COMPILER_BARRIER(); // make sure the active count is incremented before sigwait
// sigwait is a handy way to sleep a thread and wake it on command
sigwait(&_fSigSet, &nSig); //make the thread still alive
COMPILER_BARRIER(); // make sure the active count is decrimented after sigwait
cActive = __sync_fetch_and_add(&_cActive, -1);
//printf("%d(%d) ", i, cActive);
return 0;
}
int main(int argc, char** argv)
{
pthread_attr_t attr;
int cThreadRequest, cThreads, i, err, cActive, cbStack;
cbStack = (argc > 1) ? atoi(argv[1]) : 0x100000;
cThreadRequest = (argc > 2) ? atoi(argv[2]) : 30000;
sigemptyset(&_fSigSet);
sigaddset(&_fSigSet, SIGUSR1);
sigaddset(&_fSigSet, SIGSEGV);
printf("Start\n");
pthread_attr_init(&attr);
if ((err = pthread_attr_setstacksize(&attr, cbStack)) != 0)
printf("pthread_attr_setstacksize failed: err: %d %s\n", err, strerror(err));
for (i = 0; i < cThreadRequest; i++)
{
if ((err = pthread_create(&thrd[i], &attr, thread, (void*)i)) != 0)
{
printf("pthread_create failed on thread %d, error code: %d %s\n",
i, err, strerror(err));
break;
}
}
cThreads = i;
printf("\n");
// wait for threads to all be created, although we might not wait for
// all threads to make it through sigwait
while (1)
{
cActive = _cActive;
if (cActive == cThreads)
break;
printf("Waiting A %d/%d,", cActive, cThreads);
sched_yield();
}
// wake em all up so they exit
for (i = 0; i < cThreads; i++)
pthread_kill(thrd[i], SIGUSR1);
// wait for them all to exit, although we might be able to exit before
// the last thread returns
while (1)
{
cActive = _cActive;
if (!cActive)
break;
printf("Waiting B %d/%d,", cActive, cThreads);
sched_yield();
}
printf("\nDone. Threads requested: %d. Threads created: %d. StackSize=%lfmb\n",
cThreadRequest, cThreads, (double)cbStack/0x100000);
return 0;
}

CPU contention (wait time) for a process in Linux

How can I check how long a process spends waiting for the CPU in a Linux box?
For example, in a loaded system I want to check how long a SQL*Loader (sqlldr) process waits.
It would be useful if there is a command line tool to do this.

I've quickly slapped this together. It prints out the smallest and largest "interferences" from task switching...
#include <sys/time.h>
#include <stdio.h>
double seconds()
{
timeval t;
gettimeofday(&t, NULL);
return t.tv_sec + t.tv_usec / 1000000.0;
}
int main()
{
double min = 999999999, max = 0;
while (true)
{
double c = -(seconds() - seconds());
if (c < min)
{
min = c;
printf("%f\n", c);
fflush(stdout);
}
if (c > max)
{
max = c;
printf("%f\n", c);
fflush(stdout);
}
}
return 0;
}

Here's how you should go about measuring it. Have a number of processes, greater than the number of your processors * cores * threading capability wait (block) on an event that will wake them up all at the same time. One such event is a multicast network packet. Use an instrumentation library like PAPI (or one more suited to your needs) to measure the differences in real and virtual "wakeup" time between your processes. From several iterations of the experiment you can get an estimate of the CPU contention time for your processes. Obviously, it's not going to be at all accurate for multicore processors, but maybe it'll help you.
Cheers.

I had this problem some time back. I ended up using getrusage :
You can get detailed help at :
http://www.opengroup.org/onlinepubs/009695399/functions/getrusage.html
getrusage populates the rusage struct.
Measuring Wait Time with getrusage
You can call getrusage at the beginning of your code and then again call it at the end, or at some appropriate point during execution. You have then initial_rusage and final_rusage. The user-time spent by your process is indicated by rusage->ru_utime.tv_sec and system-time spent by the process is indicated by rusage->ru_stime.tv_sec.
Thus the total user-time spent by the process will be:
user_time = final_rusage.ru_utime.tv_sec - initial_rusage.ru_utime.tv_sec
The total system-time spent by the process will be:
system_time = final_rusage.ru_stime.tv_sec - initial_rusage.ru_stime.tv_sec
If total_time is the time elapsed between the two calls of getrusage then the wait time will be
wait_time = total_time - (user_time + system_time)
Hope this helps

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string