I've got a piece of userspace code which is parsing /proc/PID/task/TID/stat to get the cpu usage. I can use HZ to get the jiffies per second but this code could move to another machine which has a different configured value. Is there any way to get the value of HZ from userspace at runtime?
You divide it by the number you get from sysconf(_SC_CLK_TCK).
However, I think this is probably always 100 under Linux regardless of the actual clock tick, it's always presented to userspace as 100.
See man proc(5).
To clarify the math behind MarkR's answer:
sysconf(_SC_CLK_TCK) will get you jiffies per second. Divide jiffies by the number you get from sysconf(_SC_CLK_TCK) to get the total number of seconds.
jiffies jiffies seconds
-------------------- = ----------------- = ------- = seconds
sysconf(_SC_CLK_TCK) (jiffies/second) 1
Source of "ps" command include file <linux/param.h> to get value of HZ.
They also look for an "ELF note" with number 17 to find value of HZ (sysinfo.c):
//extern char** environ;
/* for ELF executables, notes are pushed before environment and args */
static unsigned long find_elf_note(unsigned long findme){
unsigned long *ep = (unsigned long *)environ;
while(*ep++);
while(*ep){
if(ep[0]==findme) return ep[1];
ep+=2;
}
return NOTE_NOT_FOUND;
}
[...]
hz = find_elf_note(17);
I have to admit it look weird for me since ELF notes is a section defined during compilation.
For shell-scripting, etc, use getconf CLK_TCK from the command-line. Use can use this to pass that parameter in as an environment variable or on the command-line.
main(int argc, char **argv) {
unsigned long clk_tck = atol(
getenv("CLK_TCK") || "0"
) || sysconf(_SC_CLK_TCK) ;
... /* your code */
This uses the sysconf as above, but allows you to override it with an environment variable, which can be set with the above command.
Related
I have encountered an interesting issue where a PERCPU_ARRAY created on one system with 2 processors creates an array with 2 per-CPU elements and on another system with 2 processors, an array with 128 per-CPU elements. The latter was rather unexpected to me!
The way I discovered this behavior is that a program that allocated an array for the number of CPUs (using get_nprocs_conf(3)) and then read in the PERCPU_ARRAY into it (using bpf_map_lookup_elem()) ended up writing past the end of the array and crashing.
I would like to find out what is the proper way to determine in a program that reads BPF maps the number of elements in a PERCPU_ARRAY used on a system.
Failing that, I think the second best approach is to pick a buffer for reading in that is "large enough." Here, the problem is similar: what is that number and is there way to learn it at runtime?
The question comes from reading the source of bpftool, which figures this out:
unsigned int get_possible_cpus(void)
{
int cpus = libbpf_num_possible_cpus();
if (cpus < 0) {
p_err("Can't get # of possible cpus: %s", strerror(-cpus));
exit(-1);
}
return cpus;
}
int libbpf_num_possible_cpus(void)
{
static const char *fcpu = "/sys/devices/system/cpu/possible";
static int cpus;
int err, n, i, tmp_cpus;
bool *mask;
/* ---8<--- snip */
}
So that's how they do it!
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
int main() {
int i = 0;
struct rusage r_usage;
while (++i <= 10) {
void *m = malloc(20*1024*1024);
memset(m,0,20*1024*1024);
getrusage(RUSAGE_SELF,&r_usage);
printf("Memory usage = %ld\n",r_usage.ru_maxrss);
sleep (3);
}
printf("\nAllocated memory, sleeping ten seconds after which we will check again...\n\n");
sleep (10);
getrusage(RUSAGE_SELF,&r_usage);
printf("Memory usage = %ld\n",r_usage.ru_maxrss);
return 0;
}
The above code uses ru_maxrss attribute of rusage structure. It gives the value of maximum resident set size. What does it mean? Each time the program is executed it gives a different value. So please explain the output of this code?
These are screenshots of two executions of the same code which give different outputs, how can these numbers be explained or what can be interpreted from these two outputs?
Resident set size (RSS) means, roughly, the total amount of physical memory assigned to a process at a given point in time. It does not count pages that have been swapped out, or that are mapped from a file but not currently loaded into physical memory.
"Maximum RSS" means the maximum of the RSS since the process's birth, i.e. the largest it has ever been. So this number tells you the largest amount of physical memory your process has ever been using at any one instant.
It can vary from one run to the next if, for instance, the OS decided to swap out different amounts of your program's memory at different times. This decision would depend in part on what the rest of the system is doing, and where else physical memory is needed.
Given the program below, segfault() will (As the name suggests) segfault the program by accessing 256k below the stack. nofault() however, gradually pushes below the stack all the way to 1m below, but never segfaults.
Additionally, running segfault() after nofault() doesn't result in an error either.
If I put sleep()s in nofault() and use the time to cat /proc/$pid/maps I see the allocated stack space grows between the first and second call, this explains why segfault() doesn't crash afterwards - there's plenty of memory.
But the disassembly shows there's no change to %rsp. This makes sense since that would screw up the call stack.
I presumed that the maximum stack size would be baked into the binary at compile time (In retrospect that would be very hard for a compiler to do) or that it would just periodically check %rsp and add a buffer after that.
How does the kernel know when to increase the stack memory?
#include <stdio.h>
#include <unistd.h>
void segfault(){
char * x;
int a;
for( x = (char *)&x-1024*256; x<(char *)(&x+1); x++){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
}
void nofault(){
char * x;
int a;
sleep(20);
for( x = (char *)(&x); x>(char *)&x-1024*1024; x--){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
sleep(20);
}
int main(){
nofault();
segfault();
}
The processor raises a page fault when you access an unmapped page. The kernel's page fault handler checks whether the address is reasonably close to the process's %rsp and if so, it allocates some memory and resumes the process. If you are too far below %rsp, the kernel passes the fault along to the process as a signal.
I tried to find the precise definition of what addresses are close enough to %rsp to trigger stack growth, and came up with this from linux/arch/x86/mm.c:
/*
* Accessing the stack below %sp is always a bug.
* The large cushion allows instructions like enter
* and pusha to work. ("enter $65535, $31" pushes
* 32 pointers and then decrements %sp by 65535.)
*/
if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
bad_area(regs, error_code, address);
return;
}
But experimenting with your program I found that 65536+32*sizeof(unsigned long) isn't the actual cutoff point between segfault and no segfault. It seems to be about twice that value. So I'll just stick with the vague "reasonably close" as my official answer.
I need a high-resolution timer for the embedded profiler in the Linux build of our application. Our profiler measures scopes as small as individual functions, so it needs a timer precision of better than 25 nanoseconds.
Previously our implementation used inline assembly and the rdtsc operation to query the high-frequency timer from the CPU directly, but this is problematic and requires frequent recalibration.
So I tried using the clock_gettime function instead to query CLOCK_PROCESS_CPUTIME_ID. The docs allege this gives me nanosecond timing, but I found that the overhead of a single call to clock_gettime() was over 250ns. That makes it impossible to time events 100ns long, and having such high overhead on the timer function seriously drags down app performance, distorting the profiles beyond value. (We have hundreds of thousands of profiling nodes per second.)
Is there a way to call clock_gettime() that has less than ¼μs overhead? Or is there some other way that I can reliably get the timestamp counter with <25ns overhead? Or am I stuck with using rdtsc?
Below is the code I used to time clock_gettime().
// calls gettimeofday() to return wall-clock time in seconds:
extern double Get_FloatTime();
enum { TESTRUNS = 1024*1024*4 };
// time the high-frequency timer against the wall clock
{
double fa = Get_FloatTime();
timespec spec;
clock_getres( CLOCK_PROCESS_CPUTIME_ID, &spec );
printf("CLOCK_PROCESS_CPUTIME_ID resolution: %ld sec %ld nano\n",
spec.tv_sec, spec.tv_nsec );
for ( int i = 0 ; i < TESTRUNS ; ++ i )
{
clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &spec );
}
double fb = Get_FloatTime();
printf( "clock_gettime %d iterations : %.6f msec %.3f microsec / call\n",
TESTRUNS, ( fb - fa ) * 1000.0, (( fb - fa ) * 1000000.0) / TESTRUNS );
}
// and so on for CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_THREAD_CPUTIME_ID.
Results:
CLOCK_PROCESS_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 3115.784947 msec 0.371 microsec / call
CLOCK_MONOTONIC resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2505.122119 msec 0.299 microsec / call
CLOCK_REALTIME resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2456.186031 msec 0.293 microsec / call
CLOCK_THREAD_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2956.633930 msec 0.352 microsec / call
This is on a standard Ubuntu kernel. The app is a port of a Windows app (where our rdtsc inline assembly works just fine).
Addendum:
Does x86-64 GCC have some intrinsic equivalent to __rdtsc(), so I can at least avoid inline assembly?
No. You'll have to use platform-specific code to do it. On x86 and x86-64, you can use 'rdtsc' to read the Time Stamp Counter.
Just port the rdtsc assembly you're using.
__inline__ uint64_t rdtsc(void) {
uint32_t lo, hi;
__asm__ __volatile__ ( // serialize
"xorl %%eax,%%eax \n cpuid"
::: "%rax", "%rbx", "%rcx", "%rdx");
/* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
This is what happens when you call clock_gettime() function.
Based on the clock you choose it will call the respective function. (from vclock_gettime.c file from kernel)
int clock_gettime(clockid_t, struct __kernel_old_timespec *)
__attribute__((weak, alias("__vdso_clock_gettime")));
notrace int
__vdso_clock_gettime_stick(clockid_t clock, struct __kernel_old_timespec *ts)
{
struct vvar_data *vvd = get_vvar_data();
switch (clock) {
case CLOCK_REALTIME:
if (unlikely(vvd->vclock_mode == VCLOCK_NONE))
break;
return do_realtime_stick(vvd, ts);
case CLOCK_MONOTONIC:
if (unlikely(vvd->vclock_mode == VCLOCK_NONE))
break;
return do_monotonic_stick(vvd, ts);
case CLOCK_REALTIME_COARSE:
return do_realtime_coarse(vvd, ts);
case CLOCK_MONOTONIC_COARSE:
return do_monotonic_coarse(vvd, ts);
}
/*
* Unknown clock ID ? Fall back to the syscall.
*/
return vdso_fallback_gettime(clock, ts);
}
CLOCK_MONITONIC better (though I use CLOCK_MONOTONIC_RAW) since it is not affected from NTP time adjustment.
This is how the do_monotonic_stick is implemented inside kernel:
notrace static __always_inline int do_monotonic_stick(struct vvar_data *vvar,
struct __kernel_old_timespec *ts)
{
unsigned long seq;
u64 ns;
do {
seq = vvar_read_begin(vvar);
ts->tv_sec = vvar->monotonic_time_sec;
ns = vvar->monotonic_time_snsec;
ns += vgetsns_stick(vvar);
ns >>= vvar->clock.shift;
} while (unlikely(vvar_read_retry(vvar, seq)));
ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
ts->tv_nsec = ns;
return 0;
}
And the vgetsns_stick() function which provides nano seconds resolution is implemented as:
notrace static __always_inline u64 vgetsns(struct vvar_data *vvar)
{
u64 v;
u64 cycles;
cycles = vread_tick();
v = (cycles - vvar->clock.cycle_last) & vvar->clock.mask;
return v * vvar->clock.mult;
}
Where the function vread_tick() reads the cycles from register based on the CPU:
notrace static __always_inline u64 vread_tick(void)
{
register unsigned long long ret asm("o4");
__asm__ __volatile__("rd %%tick, %L0\n\t"
"srlx %L0, 32, %H0"
: "=r" (ret));
return ret;
}
A single call to clock_gettime() takes around 20 to 100 nano seconds. reading the rdtsc register and converting the cycles to time is always faster.
I have done some experiment with CLOCK_MONOTONIC_RAW here: Unexpected periodic behaviour of an ultra low latency hard real time multi threaded x86 code
I need a high-resolution timer for the embedded profiler in the Linux build of our application. Our profiler measures scopes as small as individual functions, so it needs a timer precision of better than 25 nanoseconds.
Have you considered oprofile or perf? You can use the performance counter hardware on your CPU to get profiling data without adding instrumentation to the code itself. You can see data per-function, or even per-line-of-code. The "only" drawback is that it won't measure wall clock time consumed, it will measure CPU time consumed, so it's not appropriate for all investigations.
Give clockid_t CLOCK_MONOTONIC_RAW a try?
CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
Similar to CLOCK_MONOTONIC, but provides access to a
raw hardware-based time that is not subject to NTP
adjustments or the incremental adjustments performed by
adjtime(3).
From Man7.org
It's hard to give a globally applicable answer because the hardware and software implementation will vary widely.
However, yes, most modern platforms will have a suitable clock_gettime call that is implemented purely in user-space using the VDSO mechanism, and will in my experience take 20 to 30 nanoseconds to complete (but see Wojciech's comment below about contention).
Internally, this is using rdtsc or rdtscp for the fine-grained portion of the time-keeping, plus adjustments to keep this in sync with wall-clock time (depending on the clock you choose) and a multiplication to convert from whatever units rdtsc has on your platform to nanoseconds.
Not all of the clocks offered by clock_gettime will implement this fast method, and it's not always obvious which ones do. Usually CLOCK_MONOTONIC is a good option, but you should test this on your own system.
You are calling clock_getttime with control parameter which means the api is branching through if-else tree to see what kind of time you want. I know you cant't avoid that with this call, but see if you can dig into the system code and call what the kernal is eventually calling directly. Also, I note that you are including the loop time (i++, and conditional branch).
How do I manually convert jiffies to milliseconds and vice versa in Linux? I know kernel 2.6 has a function for this, but I'm working on 2.4 (homework) and though I looked at the code it uses lots of macro constants which I have no idea if they're defined in 2.4.
As a previous answer said, the rate at which jiffies increments is fixed.
The standard way of specifying time for a function that accepts jiffies is using the constant HZ.
That's the abbreviation for Hertz, or the number of ticks per second. On a system with a timer tick set to 1ms, HZ=1000. Some distributions or architectures may use another number (100 used to be common).
The standard way of specifying a jiffies count for a function is using HZ, like this:
schedule_timeout(HZ / 10); /* Timeout after 1/10 second */
In most simple cases, this works fine.
2*HZ /* 2 seconds in jiffies */
HZ /* 1 second in jiffies */
foo * HZ /* foo seconds in jiffies */
HZ/10 /* 100 milliseconds in jiffies */
HZ/100 /* 10 milliseconds in jiffies */
bar*HZ/1000 /* bar milliseconds in jiffies */
Those last two have a bit of a problem, however, as on a system with a 10 ms timer tick, HZ/100 is 1, and the precision starts to suffer. You may get a delay anywhere between 0.0001 and 1.999 timer ticks (0-2 ms, essentially). If you tried to use HZ/200 on a 10ms tick system, the integer division gives you 0 jiffies!
So the rule of thumb is, be very careful using HZ for tiny values (those approaching 1 jiffie).
To convert the other way, you would use:
jiffies / HZ /* jiffies to seconds */
jiffies * 1000 / HZ /* jiffies to milliseconds */
You shouldn't expect anything better than millisecond precision.
Jiffies are hard-coded in Linux 2.4. Check the definition of HZ, which is defined in the architecture-specific param.h. It's often 100 Hz, which is one tick every (1 sec/100 ticks * 1000 ms/sec) 10 ms.
This holds true for i386, and HZ is defined in include/asm-i386/param.h.
There are functions in include/linux/time.h called timespec_to_jiffies and jiffies_to_timespec where you can convert back and forth between a struct timespec and jiffies:
#define MAX_JIFFY_OFFSET ((~0UL >> 1)-1)
static __inline__ unsigned long
timespec_to_jiffies(struct timespec *value)
{
unsigned long sec = value->tv_sec;
long nsec = value->tv_nsec;
if (sec >= (MAX_JIFFY_OFFSET / HZ))
return MAX_JIFFY_OFFSET;
nsec += 1000000000L / HZ - 1;
nsec /= 1000000000L / HZ;
return HZ * sec + nsec;
}
static __inline__ void
jiffies_to_timespec(unsigned long jiffies, struct timespec *value)
{
value->tv_nsec = (jiffies % HZ) * (1000000000L / HZ);
value->tv_sec = jiffies / HZ;
}
Note: I checked this info in version 2.4.22.
I found this sample code on kernelnewbies. Make sure you link with -lrt
#include <unistd.h>
#include <time.h>
#include <stdio.h>
int main()
{
struct timespec res;
double resolution;
printf("UserHZ %ld\n", sysconf(_SC_CLK_TCK));
clock_getres(CLOCK_REALTIME, &res);
resolution = res.tv_sec + (((double)res.tv_nsec)/1.0e9);
printf("SystemHZ %ld\n", (unsigned long)(1/resolution + 0.5));
return 0;
}
To obtain the USER_HZ value (see the comments under the accepted answer) using CLI:
getconf CLK_TCK