sleep(0)? consistent time keeping in code? - linux

Right now i am loading a file then using gettimeofday and tracking the CPU time with tv_usec
My results varies, i get 250's to 280s but sometimes 300's or 500's. I wrote usleep and sleep (0) and (1) with no success. The time still varies vastly. I thought sleep(1) (seconds in linux, not the windows Sleep in ms) would have solved it. How can i keep track of time in a more consistent way for testing? Maybe i should wait until i have a much larger test data and more complex code before starting measurements?

The currently recommended interface for high-rez time on Linux (and POSIX in general) is clock_gettime. See the man page.
clock_gettime(CLOCK_REALTIME, struct timespec *tp) // for wall-clock time
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, struct timespec *tp) // for CPU time
But read the man page. Note that you need to link with -lrt, because POSIX says so, I guess. Maybe to avoid symbol conflicts in -lc, for old programs that defined their own clock_gettime? But dynamic libs use weak symbols...
The best sleep function is nanosleep. It doesn't mess around with signals or any crap like usleep. It is defined to just sleep, and not have any other side effects. And it tells you if you woke up early (e.g. from signals), so you don't necessarily have to call another time function.
Anyway, you're going to have a hard time testing one rep of something that short that involves a system call. There's a huge amount of opportunity for variation. e.g. the scheduler may decide that some other work needs doing (unlikely if your process just started; you won't have used up your timeslice yet). CPU cache (L2 and TLB) are easily possible.
If you have a multi-core machine and a single-threaded benchmark for the code you're optimizing, you can give it realtime priority pinned to one of your cores. Make sure you choose the core that isn't handling interrupts, or your keyboard (and everything else) will be locked out until it's done. Use taskset (for pinning to one CPU) and chrt (for setting realtime prio).
See this mail I sent to gmp-devel with this trick:
http://gmplib.org/list-archives/gmp-devel/2008-March/000789.html
Oh yeah, for the most precise timing, you can use rdtsc yourself (on x86/amd64). If you don't have any other syscalls in what you're benching, it's not a bad idea. Grab a benchmarking framework to put your function into. GMP has a pretty decent one. It's maybe not set up well for benchmarking functions that aren't in GMP and called mpn_whatever, though. I don't remember, and it's worth a look.

Are you trying to measure how long it takes to load a file? Usually if you're performance testing some bit of code that is already pretty fast (sub-second), then you will want to repeat the same code a number of times (say a thousand or a million), time the whole lot, then divide the total time by the number of iterations.
Having said that, I'm not quite sure what you're using sleep() for. Can you post an example of what you intend to do?

I would recommend putting that code in a for loop. Run it over 1000 or 10000 iterations. There's problems with this if you're doing only a few instructions, but it should help.
Larger data sets also help of course.
sleep is going to deschedule your thread from the cpu. It does not accurately count time with precision.

Related

Thread Quantum: How to compute it

I have been reading a few posts and articles regarding thread quanta (here, here and here). Apparently Windows allocate a fix number of CPU ticks for a thread quantum depending on the windows "mode" (server, or something else). However from the last link we can read:
(A thread quantum) between 10-200 clock ticks (i.e. 10-200 ms) under Linux, though some
granularity is introduced in the calculation
Is there any way to compute the quantum length on Linux?
Does that make any sense to compute it anyway? (since from my understanding threads can still be pre-empted, nothing forces a thread to run during the full duration of the quantum)
From a developer's perspective, I could see the interest in writing a program that could predict the running time of a program given its number of threads, and "what they do" (possibly removing all the testing to find the optimal number of threads would be kind of neat, although I am not sure it is the right approach)
On Linux, the default realtime quantum length constant is declared as RR_TIMESLICE, at least in 4.x kernels; HZ must be defined while configuring the kernel.
The interval between pausing the thread whose quantum has expired and resuming it may depend on a lot of things like, say, load average.
To be able to predict the running time at least with some degree of accuracy, give the target process realtime priority; realtime processes are scheduled following a round-robin algorithm, which is generally simpler and more predictable than the common Linux scheduling algo.
To get the realtime quantum length, call sched_rr_get_interval().

Linux' hrtimer - microsecond precision?

Is it possible to execute tasks on a Linux host with microsecond precision? I.e., I'd like to execute a task at a specific instant of time. I know, Linux is no real-time system but I'm searching for the best solution on Linux.
So far, I've created a kernel module, setup hrtimer and measured the jitter when the callback function is entered (I don't really care too much about the actual delay, it's jitter that counts) - it's about 20-50us. That's not significantly better than using timerfd in userspace (also tried using real-time priority for the process but that did not really change anything).
I'm running Linux 3.5.0 (just an example, tried different kernels from 2.6.35 to 3.7), /proc/timer_list shows hrtimer_interrupt, I'm not running in failsafe mode which disables hrtimer functionality. Tried on different CPUs (Intel Atom to Core i7).
My best idea so far would be using hrtimer in combination with ndelay/udelay. Is this really the best way to do it? I can't believe it's not possible to trigger a task with microsecond precision. Running the code in kernel space as module is acceptable, would be great if the code was not interrupted by other tasks though. I dont' really care too much about the rest of the system, the task will be executed only very few times a second so using mdelay/ndelay for burning the CPU for some microseconds every time the task should be executed would not really matter. Altough, I'd prefer a more elegent solution.
I hope the question is clear, found a lot of topics concerning timer precision but no real answer to that problem.
You can do what you want from user space
use clock_gettime() with CLOCK_REALTIME to get the time-of-day with nano-second resolution
use nanosleep() to yield the CPU until you are close to the time you need to execute your task (it is at least milli-second resolution).
use a spin loop with clock_gettime() until you reach the desired time
execute your task
The clock_gettime() function is implemented as a VDSO in recent kernels and modern x86 processors - it takes 20-30 nanoseconds to get the time-of-day with nano-second resolution - you should be able to call clock_gettime() over 30 times per micro-second. Using this method your task should dispatch within 1/30th of a micro-second of the intended time.
The default Linux kernel timer ticks each millisecond. Microseconds is way beyond anything current user hardware is capable of.
The jitter you see is due to a host of factors, like interrupt handling and servicing higher priority tasks. You can cut that down somewhat by selecting hardware carefully, only enabling what is really needed. The real-time patchseries to the kernel (see the HOWTO) might be an option to reduce it a bit further.
Always keep in mind that any gain has a definite cost in terms of interactiveness, stability, and (last, but by far not least) your time in building, tuning, troubleshooting, and keeping the house of cards from falling apart.

Why processes are deprived of CPU for TOO long while busy looping in Linux kernel?

At first glance, my question might look bit trivial. Please bear with me and read completely.
I have identified a busy loop in my Linux kernel module. Due to this, other processes (e.g. sshd) are not getting CPU time for long spans of time (like 20 seconds). This is understandable as my machine has only single CPU and busy loop is not giving chance to schedule other processes.
Just to experiment, I had added schedule() after each iteration in the busy loop. Even though, this would be keeping the CPU busy, it should still let other processes run as I am calling schedule(). But, this doesn't seem to be happening. My user level processes are still hanging for long spans of time (20 seconds).
In this case, the kernel thread got nice value -5 and user level threads got nice value 0. Even with low priority of user level thread, I think 20 seconds is too long to not get CPU.
Can someone please explain why this could be happening?
Note: I know how to remove busy loop completely. But, I want to understand the behaviour of kernel here. Kernel version is 2.6.18 and kernel pre-emption is disabled.
The schedule() function simply invokes the scheduler - it doesn't take any special measures to arrange that the calling thread will be replaced by a different one. If the current thread is still the highest priority one on the run queue then it will be selected by the scheduler once again.
It sounds as if your kernel thread is doing very little work in its busy loop and it's calling schedule() every time round. Therefore, it's probably not using much CPU time itself and hence doesn't have its priority reduced much. Negative nice values carry heavier weight than positives, so the difference between a -5 and a 0 is quite pronounced. The combination of these two effects means I'm not too surprised that user space processes miss out.
As an experiment you could try calling the scheduler every Nth iteration of the loop (you'll have to experiment to find a good value of N for your platform) and see if the situation is better - calling schedule() too often will just waste lots of CPU time in the scheduler. Of course, this is just an experiment - as you have already pointed out, avoiding busy loops is the correct option in production code, and if you want to be sure your thread is replaced by another then set it to be TASK_INTERRUPTIBLE before calling schedule() to remote itself from the run queue (as has already been mentioned in comments).
Note that your kernel (2.6.18) is using the O(1) scheduler which existed until the Completely Fair Scheduler was added in 2.6.23 (the O(1) scheduler having been added in 2.6 to replace the even older O(n) scheduler). The CFS doesn't use run queues and works in a different way, so you might well see different behaviour - I'm less familiar with it, however, so I wouldn't like to predict exactly what differences you'd see. I've seen enough of it to know that "completely fair" isn't the term I'd use on heavily loaded SMP systems with a large number of both cores and processes, but I also accept that writing a scheduler is a very tricky task and it's far from the worst I've seen, and I've never had a significant problem with it on a 4-8 core desktop machine.

Should I use "real" or "user+sys" on the time function?

I understand the difference between "real","user" and "sys" when you use the time command on Linux, as explained on this other thread: What do 'real', 'user' and 'sys' mean in the output of time(1)?
Now I am working on a small comparison between the performance of Python, Java and C, and I am wondering which report I should use.
"User+sys" seems to be the more realistic one, but wouldn't this cause problems when comparing C to Java, for instance, cause the JVM knows how to optimize the code for multi-processors/threads while GCC doesn't?
Also, wouldn't "real" be realistic enough if I make sure no other heavy process is running on the background?
The answer will depend on what you mean by "the performance of (Python|Java|C)". In many cases what a user really cares about is the elapsed wall time, corresponding to real. Suppose you write some piece of code in a reasonable way in several languages and one of the languages can automatically parallelize it to use your 4 cores. If this makes the user wait less time for a reply, then I say this is a fair comparison. Of course it is valid for that particular machine, the results on a single core machine could be different. If an app causes page faults, then it makes the user wait. For the user it's no help if you say the app took fewer cycles if they have to wait longer.
Any way you measure, be sure to repeat the tests multiple times, as there can be lots of variation between runs. Languages like Java also need a program to run for some time before it reaches top speed, due to JIT compilation (but again: if your program is very short by definition and doesn't allow the Java Virtual Machine to warp up, then well it's too bad for Java). Testing performance is very tricky and even experienced developers are prone to misinterpreting results or measuring not what they really intended.

Record thread events

Suppose I need to peek on a thread's state at regular intervals and record its state along the whole execution of a program. I wouldn't know how to start thinking about this. Any pointers (pun?)? I'm on Linux, using gcc, phreads and C and have access to all usual Linux tools. Basically, I guess I'm asking about how to build a simple profiler for threads that will tell me how long a thread has been in some or other state during the execution of the program.
I want to be able to create graphs like Threadscope does. The X axis is time, the Y axis is core/thread number and the "colors" are state: green means running, orange is garbage collection, and so on. Does this make more sense now?
.
For Linux specific solution, you might like to have a look at /proc/<pid>/stat and /proc/<pid>/task/<tid>/stat for process and thread statistics, respectively. Have a look at proc(5) manual page for full description of all the fields there (online http://man7.org/linux/man-pages/man5/proc.5.html - search for /proc/[pid]/stat). Specifically, at least the fields cutime and stime are of interests to you. These are monotonically increasing times, so you need to remember the previously measured value to be able to produce the time spent in the process/thread during the given time slice, in order to produce the data for your graphs. (This is how top(1) works.)
However, for the profiler to distinguish different states makes the problem more complicated. How do the profiler distinguish that the profiled program is in which state? It seems to me the profiled program threads need to signal this in some way to the profiler. You need to have some kind of tailored solution for this state sharing (unless you can run the different states in different threads and make the distinction this way, which I doubt).
If the state transitions are done in single place (e.g. enter GC and leave GC in your example), then one way would be as follows:
The monitored threads would get the start and end times of the special states by using POSIX function clock_gettime() - with clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &tp) you can get the process time and with clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tp) you can get the thread time (both monotonically increasing, again).
The thread could communicate these timings to the profiler program with some kind of IPC.
If the profiler application knows the thread times of entering and leaving a state, then because it knows the thread time values at the change of measuring slices, it can determine how much of the thread time is spent in the reported states within a reporting time slice (and of course here we need to adjust the start time for a state to equal the start of the next reporting time slice).
The time the whole process has spent on a specific state can be calculated by summing up the thread times for that state.
Note that through /proc/<pid>/stat or /proc/<pid>/task/<tid>/stat, the measurement accuracy is not very good (clock ticks, often units of 10ms), but I do not know other way of getting timing information from outside of the process/thread. The function clock_gettime() gives very accurate times (nominally nanosecond accuracy, but note that at least in some MIPS and ARM systems the accuracy is as bad as with the stat files under /proc due to unexisting implementation of accurate timer reading for these fields within Linux kernel). You also would need to do some experimentation to make sure these two timing sources really would give the same results (by reading both values from the same threads). You can of course use these /proc/.../stat files inside the thread, but the accuracy just is not very good unless you spend a lot of time within a state.
Well, the direct match to profiling info produced by the haskell compiler and processed by Threadscope is, using C and GCC, the gprof utility (it's part of the GNU binutils).
For it to work correctly with pthreads you need each thread to trigger some timer initialization function. This can be done without modifying your code with this pthreads wrapper library: http://sam.zoy.org/writings/programming/gprof.html . I haven't dealt with the problem recently, it may be that something has changed and the wrapper isn't needed anymore...
As to GUI to interpret the profiling results, there is kprof (http://kprof.sourceforge.net). Unfortunately, AFAIK it doesn't produce thread duration graphs, for that you'll have to work your own solution with the textual info produced by gprof.
If you are not picky about using the "standard" solution offered by the GCC, you may wanna try this: http://code.google.com/p/gperftools/?redir=1 (didn't try it personally, but heard good opinions).
Good luck!
Take a look at at Intel VTune Amplifier XE (formerly … Intel Thread Profiler) to see if it will meet your needs.
This and other Intel Linux development tools are available free for non-commercial use.
In the video Using the Timeline in Intel VTune Amplifier XE showing a timeline of a multi-threaded application, at 9:20 the presenter mentions
"...with the frame API you can programmatically mark certain events or phases in your code. And these marks will appear on the timeline."
I think it will be rather difficult build a simple profiler simply because there are many different factors that you have to consider and system profiling is an inherently complex task, made all the more so when you are profiling a multithreaded application. The best advice I can think of is to look at something that already exists, for example OProfile.
One advantage of OProfile is that it is open source so the source code is available. But beyond this I suspect that asking how to build a profiling application might be beyond the scope of what someone can answer in a SO question, which might be why this question hasn't gotten very many responses. Hopefully looking at some example will help you get started and then perhaps if you have more focused questions you could get some more detailed responses.

Resources