Measuring RTT for TCP in linux kernel - linux

I'm implementing TCP-like RTT estimation in a custom protocol. When I look in the function
static void tcp_rtt_estimator(struct sock *sk, long mrtt){
long m = mrtt; /* RTT */
For the first iteration, when no previous RTT estimates have been done, the code snippet is
srtt = m << 3; /* take the measured time to be rtt */
Why isn't the value of m taken directly for srtt? As per my understanding, the parameter mrtt_us is just the value in jiffies for the current round-trip time measurement.
Is the above assumption about mrtt_us incorrect? If yes, then what value should I pass to this function?
P.S.- I've the measured RTT in jiffies which I'm currently passing to this function. Obviously, this is incorrect as the first srtt value becomes something other than the measured rtt because of srtt = m << 3

I figured this out from one of the mail chains on LKML at https://lkml.org/lkml/1998/9/12/41
It mentions that the stored SRTT is actually 8 times the real SRTT. I think it's done in such a way to provide higher precision in calculations.
So to answer the question, the measured value of RTT should be passed to this function in jiffies (Kernel version 3.13)

Related

Difference between 'Niceness' and 'Goodness' in linux kernel 2.4.27

I am trying to understand linux kernel code for task scheduler.
I am not able to figure out what is goodness value and how it differs from niceness?
Also, how each of them contribute to scheduling?
AFAIK: Each process in the run queue is assigned a "goodness" value which determines how good a process is for running. This value is calculated by the goodness() function.
The process with higher goodness will be the next process to run. If no process is available for running, then the operating system selects a special idle task.
A first-approximation of "goodness" is calculated according to the number of ticks left in the process quantum. This means that the goodness of a process decreases with time, until it becomes zero when the time quantum of the task expires.
The final goodness is calculated in function of the niceness of the process. So basically goodness of a process is a combination of the Time slice left logic and nice value.
static inline int goodness(struct task_struct * p, int this_cpu, struct mm_struct *this_mm)
{
int weight;
/*
* select the current process after every other
* runnable process, but before the idle thread.
* Also, dont trigger a counter recalculation.
*/
weight = -1;
if (p->policy & SCHED_YIELD)
goto out;
if (p->policy == SCHED_OTHER) {
/*
* Give the process a first-approximation goodness value
* according to the number of clock-ticks it has left.
*
* Don't do any other calculations if the time slice is
* over..
*/
weight = p->counter;
if (!weight)
goto out;
...
weight += 20 - p->nice;
goto out;
}
/* code for real-time processes goes here */
out:
return weight;
}
TO UNDERSTAND and REMEMBER NICE value: remember this.
The etymology of Nice value is that a "nicer" process allows for more time for other processes to run, hence lower nice value translates to higher priority.
So,
Higher the goodness value -> More likely to Run
Higher the nice value -> Less likely to run.
Also Goodness value is calculated using the nice value as
20 - p->nice
Since nice value ranges from -20 ( highest priority) to 19 ( lowest priority)
lets assume a nice value of -20 ( EXTREMELY NOT NICE). hence 20 - (-20) = 40
Which means the goodness value has increased, hence this process will be chosen.

How can I measure elapsed time when encrypting using openssl in linux by C

How can I calculate the amount of processing time used by a process in C on Linux. Specifically, I want to determine how much time elapses when encrypting a file using openssl.
The easiest way for you to do this is by using the clock() function from <time.h> to report the amount of CPU time used by the calling process.
From SUSv4:
The clock() function shall return the implementation's best
approximation to the processor time used by the process since the
beginning of an implementation-defined era related only to the process
invocation.
RETURN VALUE
To determine the time in seconds, the value returned by clock() should
be divided by the value of the macro CLOCKS_PER_SEC. If the processor
time used is not available or its value cannot be represented,
the function shall return the value (clock_t)-1.
Try following,
time_t start, end;
double cpu_time_used;
start = clock();
/* Do encrypting ... */
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

taskstats stats not adding up

I am trying to figure out how the stats in the taskstats struct are adding up. I wrote a simple C program that runs for some time doing IO and exits. I monitor the stats of this program using the taskstats struct, which I get from the taskstats netlink multicast group. When I sum the values of cpu_delay_total, blkio_delay_total, swapin_delay_total, freepages_delay_total, ac_utime and ac_stime, I get a value that is about 0.5 seconds larger than the value of elapsed time (ac_etime)
Here are the statistics for a 3.5-second run:
ac_etime: 3536036
ac_utime: 172000
ac_stime: 3032000
cpu_delay_total: 792528445
blkio_delay_total: 46320128
swapin_delay_total: 0
freepages_delay_total: 0
Summing up values for delays, utime and stime yields 4042848.573 (divide the delays by 1000 to convert to microseconds), while etime is only 3536036!
Interestingly, the wall clock time gives the value that is practically equal to utime+stime: cpu_run_real_total: 3204000129, while ac_utime + ac_stime: 3204000
Does the cpu_run_real_total field give the cpu time, despite that the comment in taskstats.h clearly states that this is a wall clock time? And what could be the reason that the sum of these fields is larger than the elapsed time?
My kernel version is 3.2.0-38.
(1) cpu_run_real_total = ac_utime + ac_stime, I check the codes in ./kernel/delayacct.c, function __delayacct_add_tsk():
tmp = (s64)d->cpu_run_real_total;
cputime_to_timespec(tsk->utime + tsk->stime, &ts);
tmp += timespec_to_ns(&ts);
d->cpu_run_real_total = (tmp < (s64)d->cpu_run_real_total) ? 0 : tmp;
From the above codes, we know cpu_run_real_total is sum the utime and stime up.
(2) Why sum the values of cpu_delay_total, blkio_delay_total, swapin_delay_total, freepages_delay_total, ac_utime and ac_stime, the value is larger than the value of ac_etime?
I have not figured out why. But I have guess: the stime may somewhat overlap with the various *_delay_total counters.

Is clock_gettime() adequate for submicrosecond timing?

I need a high-resolution timer for the embedded profiler in the Linux build of our application. Our profiler measures scopes as small as individual functions, so it needs a timer precision of better than 25 nanoseconds.
Previously our implementation used inline assembly and the rdtsc operation to query the high-frequency timer from the CPU directly, but this is problematic and requires frequent recalibration.
So I tried using the clock_gettime function instead to query CLOCK_PROCESS_CPUTIME_ID. The docs allege this gives me nanosecond timing, but I found that the overhead of a single call to clock_gettime() was over 250ns. That makes it impossible to time events 100ns long, and having such high overhead on the timer function seriously drags down app performance, distorting the profiles beyond value. (We have hundreds of thousands of profiling nodes per second.)
Is there a way to call clock_gettime() that has less than ¼μs overhead? Or is there some other way that I can reliably get the timestamp counter with <25ns overhead? Or am I stuck with using rdtsc?
Below is the code I used to time clock_gettime().
// calls gettimeofday() to return wall-clock time in seconds:
extern double Get_FloatTime();
enum { TESTRUNS = 1024*1024*4 };
// time the high-frequency timer against the wall clock
{
double fa = Get_FloatTime();
timespec spec;
clock_getres( CLOCK_PROCESS_CPUTIME_ID, &spec );
printf("CLOCK_PROCESS_CPUTIME_ID resolution: %ld sec %ld nano\n",
spec.tv_sec, spec.tv_nsec );
for ( int i = 0 ; i < TESTRUNS ; ++ i )
{
clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &spec );
}
double fb = Get_FloatTime();
printf( "clock_gettime %d iterations : %.6f msec %.3f microsec / call\n",
TESTRUNS, ( fb - fa ) * 1000.0, (( fb - fa ) * 1000000.0) / TESTRUNS );
}
// and so on for CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_THREAD_CPUTIME_ID.
Results:
CLOCK_PROCESS_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 3115.784947 msec 0.371 microsec / call
CLOCK_MONOTONIC resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2505.122119 msec 0.299 microsec / call
CLOCK_REALTIME resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2456.186031 msec 0.293 microsec / call
CLOCK_THREAD_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2956.633930 msec 0.352 microsec / call
This is on a standard Ubuntu kernel. The app is a port of a Windows app (where our rdtsc inline assembly works just fine).
Addendum:
Does x86-64 GCC have some intrinsic equivalent to __rdtsc(), so I can at least avoid inline assembly?
No. You'll have to use platform-specific code to do it. On x86 and x86-64, you can use 'rdtsc' to read the Time Stamp Counter.
Just port the rdtsc assembly you're using.
__inline__ uint64_t rdtsc(void) {
uint32_t lo, hi;
__asm__ __volatile__ ( // serialize
"xorl %%eax,%%eax \n cpuid"
::: "%rax", "%rbx", "%rcx", "%rdx");
/* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
This is what happens when you call clock_gettime() function.
Based on the clock you choose it will call the respective function. (from vclock_gettime.c file from kernel)
int clock_gettime(clockid_t, struct __kernel_old_timespec *)
__attribute__((weak, alias("__vdso_clock_gettime")));
notrace int
__vdso_clock_gettime_stick(clockid_t clock, struct __kernel_old_timespec *ts)
{
struct vvar_data *vvd = get_vvar_data();
switch (clock) {
case CLOCK_REALTIME:
if (unlikely(vvd->vclock_mode == VCLOCK_NONE))
break;
return do_realtime_stick(vvd, ts);
case CLOCK_MONOTONIC:
if (unlikely(vvd->vclock_mode == VCLOCK_NONE))
break;
return do_monotonic_stick(vvd, ts);
case CLOCK_REALTIME_COARSE:
return do_realtime_coarse(vvd, ts);
case CLOCK_MONOTONIC_COARSE:
return do_monotonic_coarse(vvd, ts);
}
/*
* Unknown clock ID ? Fall back to the syscall.
*/
return vdso_fallback_gettime(clock, ts);
}
CLOCK_MONITONIC better (though I use CLOCK_MONOTONIC_RAW) since it is not affected from NTP time adjustment.
This is how the do_monotonic_stick is implemented inside kernel:
notrace static __always_inline int do_monotonic_stick(struct vvar_data *vvar,
struct __kernel_old_timespec *ts)
{
unsigned long seq;
u64 ns;
do {
seq = vvar_read_begin(vvar);
ts->tv_sec = vvar->monotonic_time_sec;
ns = vvar->monotonic_time_snsec;
ns += vgetsns_stick(vvar);
ns >>= vvar->clock.shift;
} while (unlikely(vvar_read_retry(vvar, seq)));
ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
ts->tv_nsec = ns;
return 0;
}
And the vgetsns_stick() function which provides nano seconds resolution is implemented as:
notrace static __always_inline u64 vgetsns(struct vvar_data *vvar)
{
u64 v;
u64 cycles;
cycles = vread_tick();
v = (cycles - vvar->clock.cycle_last) & vvar->clock.mask;
return v * vvar->clock.mult;
}
Where the function vread_tick() reads the cycles from register based on the CPU:
notrace static __always_inline u64 vread_tick(void)
{
register unsigned long long ret asm("o4");
__asm__ __volatile__("rd %%tick, %L0\n\t"
"srlx %L0, 32, %H0"
: "=r" (ret));
return ret;
}
A single call to clock_gettime() takes around 20 to 100 nano seconds. reading the rdtsc register and converting the cycles to time is always faster.
I have done some experiment with CLOCK_MONOTONIC_RAW here: Unexpected periodic behaviour of an ultra low latency hard real time multi threaded x86 code
I need a high-resolution timer for the embedded profiler in the Linux build of our application. Our profiler measures scopes as small as individual functions, so it needs a timer precision of better than 25 nanoseconds.
Have you considered oprofile or perf? You can use the performance counter hardware on your CPU to get profiling data without adding instrumentation to the code itself. You can see data per-function, or even per-line-of-code. The "only" drawback is that it won't measure wall clock time consumed, it will measure CPU time consumed, so it's not appropriate for all investigations.
Give clockid_t CLOCK_MONOTONIC_RAW a try?
CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
Similar to CLOCK_MONOTONIC, but provides access to a
raw hardware-based time that is not subject to NTP
adjustments or the incremental adjustments performed by
adjtime(3).
From Man7.org
It's hard to give a globally applicable answer because the hardware and software implementation will vary widely.
However, yes, most modern platforms will have a suitable clock_gettime call that is implemented purely in user-space using the VDSO mechanism, and will in my experience take 20 to 30 nanoseconds to complete (but see Wojciech's comment below about contention).
Internally, this is using rdtsc or rdtscp for the fine-grained portion of the time-keeping, plus adjustments to keep this in sync with wall-clock time (depending on the clock you choose) and a multiplication to convert from whatever units rdtsc has on your platform to nanoseconds.
Not all of the clocks offered by clock_gettime will implement this fast method, and it's not always obvious which ones do. Usually CLOCK_MONOTONIC is a good option, but you should test this on your own system.
You are calling clock_getttime with control parameter which means the api is branching through if-else tree to see what kind of time you want. I know you cant't avoid that with this call, but see if you can dig into the system code and call what the kernal is eventually calling directly. Also, I note that you are including the loop time (i++, and conditional branch).

Need to do 64 bit multiplication on a machine with 32 bit longs

I'm working on a small embedded system that has 32 bit long ints. For one calculation I need output the time since 1970 in ms. I can get the time in 32 bit unsigned long seconds since 1970, but how can I represent this as a 64 bit no. of ms if my biggest int is only 32bits? I'm sure stackoverflow will have a cunning answer! I am using Dynamic C, close to standard C. I have some sample code from another system which has a 64 bit long long data type:
long long T = (long long)(SampleTime * 1000.0 + 0.5);
data.TimeLower = (unsigned int)(T & 0xffffffff);
data.TimeUpper = (unsigned short)((T >> 32) & 0xffff);
Since you are only multiplying by 1000 (seconds -> millis), you can do it with two 16 bit mutliplies and one add and a bit of bit fiddling, I have used your putative data type to store the result below:
uint32_t time32 = time();
uint32_t t1 = (time32 & 0xffff) * 1000;
uint32_t t2 = ((time32 >> 16) * 1000) + (t1 >> 16);
data.TimeLower = (uint32_t) ((t2 & 0xffff) << 16) | (t1 & 0xffff);
data.TimeUpper = (uint32_t) (t2 >> 16);
The standard approach, assuming you have a 16x16->32 multiply available, would be to split both numbers into 16-bit high and low parts, compute four partial products, and add the results. If you don't have a 16x16->32 primitive which is faster than a 32x32->32 primitive, though, I'm not sure what the best approach would be. I would think that a 32x32->32 multiply should be more useful than a 16x16->32, but I can't think how one would use it.
Personally, I wish there were a standard primitive to return the top half of a NxN multiply (32x32, certainly; also 16x16 for smaller machines and 64x64 for larger ones).
It might be helpful if you were more specific about what kinds of calculations you need to do. 64-bit multiplication implemented with 32-bit operations is quite slow, and you may have the additional overhead of 64-bit division (to convert back to seconds and milliseconds), which is even slower.
Without knowing more about what exactly you need to do, it seems to me that it would be more efficient to use a struct, containing a 32-bit unsigned int for the number of seconds and a 16-bit int for the number of milliseconds (the "remainder"). (Or use a 32-bit int for the milliseconds if 64-bit alignment is more important than saving a couple of bytes.)

Resources