How does one do a "zero-syscall clock_gettime" without dynamic linking?

How does one do a "zero-syscall clock_gettime" without dynamic linking? - linux

I ran the code below with strace. I can see it doesn't use a system call to get the time. After write only clock_nanosleep and exit_group are called. It correctly gives me 3 every run (I was expecting to occasionally get 2).
Musl .80 release says "zero-syscall clock_gettime support (dynamic-linked x86_64 only)"
Is there a way to do this without depending on being dynamically linked? Perhaps with cpuid? Compiling with -nostdlib says it depends on clock_gettime. Is there a way to implement CLOCK_REALTIME or CLOCK_MONOTONIC? I'd find using __rdtsc(p) acceptable if there's a way to get cpu frequency so I can estimate how much time has pass. I don't need nanosecond precision.
#include <time.h>
int main() {
struct timespec time, time2;
write(1, "Test\n", 5);
clock_gettime(CLOCK_REALTIME, &time);
sleep(3);
clock_gettime(CLOCK_REALTIME, &time2);
return time2.tv_sec - time.tv_sec;
}

Related

Why does strace ignore some syscalls (randomly) depending on environment/kernel?

If I compile the following program:
$ cat main.cpp && g++ main.cpp
#include <time.h>
int main() {
struct timespec ts;
return clock_gettime(CLOCK_MONOTONIC, &ts);
}
and then run it under strace in "standard" Kubuntu, I get this:
strace -tt --trace=clock_gettime ./a.out
17:58:40.395200 +++ exited with 0 +++
As you can see, there is no clock_gettime (full strace output is here).
On the other hand, if I run the same app in my custom built linux kernel under qemu, I get the following output:
strace -tt --trace=clock_gettime ./a.out
18:00:53.082115 clock_gettime(CLOCK_MONOTONIC, {tv_sec=101481, tv_nsec=107976517}) = 0
18:00:53.082331 +++ exited with 0 +++
Which is more expected - there is clock_gettime.
So, my questions are:
Why does strace ignore/omit clock_gettime if I run it in Kubuntu?
Why strace's behaviour differs depending on environment/kernel?

Answer to the first question
From vdso man
strace(1), seccomp(2), and the vDSO
When tracing systems calls with strace(1), symbols (system calls) that are exported by the vDSO will not appear in the trace output. Those system calls will likewise not be visible to seccomp(2) filters.
Answer to the second question:
In the vDSO, clock_gettimeofday and related functions are reliant on specific clock modes; see __arch_get_hw_counter.
If the clock mode is VCLOCK_TSC, the time is read without a syscall, using RDTSC; if it’s VCLOCK_PVCLOCK or VCLOCK_HVCLOCK, it’s read from a specific page to retrieve the information from the hypervisor. HPET doesn’t declare a clock mode, so it ends up with the default VCLOCK_NONE, and the vDSO issues a system call to retrieve the time.
And indeed:
In the default kernel (from Kubuntu):
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
Custom built kernel:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
More info about various clock sources. In particular:
The documentation of Red Hat MRG version 2 states that TSC is the preferred clock source due to its much lower overhead, but it uses HPET as a fallback. A benchmark in that environment for 10 million event counts found that TSC took about 0.6 seconds, HPET took slightly over 12 seconds, and ACPI Power Management Timer took around 24 seconds.

It may be due to the fact that clock_gettime() is part of the optimized syscalls. Look at the vdso mechanism described in this answer.
Considering clock_gettime(), on some architecture (e.g. Linux on ARM v7 32 bits), only a subset of the available clock identifiers are supported in the VDSO implementation. For the others, there is a fallback into the actual system call. Here is the source code of the VDSO implementation of clock_gettime() in the Linux kernel for the ARM v7 (file arch/arm/vdso/vgettimeofday.c in the source tree):
notrace int __vdso_clock_gettime(clockid_t clkid, struct timespec *ts)
{
struct vdso_data *vdata;
int ret = -1;
vdata = __get_datapage();
switch (clkid) {
case CLOCK_REALTIME_COARSE:
ret = do_realtime_coarse(ts, vdata);
break;
case CLOCK_MONOTONIC_COARSE:
ret = do_monotonic_coarse(ts, vdata);
break;
case CLOCK_REALTIME:
ret = do_realtime(ts, vdata);
break;
case CLOCK_MONOTONIC:
ret = do_monotonic(ts, vdata);
break;
default:
break;
}
if (ret)
ret = clock_gettime_fallback(clkid, ts);
return ret;
}
The above source code shows a switch/case with the supported clock identifiers and in the default case, there is a fallback to the actual system call.
On such architecture, spying a software like systemd which uses clock_gettime() with CLOCK_MONOTONIC and CLOCK_BOOTTIME, strace only shows the calls with the latter identifier as it is not part of the supported cases in VDSO mode.
Cf. this link for reference

perf stat time elapsed for sys and user [duplicate]

$ time foo
real 0m0.003s
user 0m0.000s
sys 0m0.004s
$
What do real, user and sys mean in the output of time? Which one is meaningful when benchmarking my app?

Real, User and Sys process time statistics
One of these things is not like the other. Real refers to actual elapsed time; User and Sys refer to CPU time used only by the process.
Real is wall clock time - time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).
User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.
Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like 'user', this is only CPU time used by the process. See below for a brief description of kernel mode (also known as 'supervisor' mode) and the system call mechanism.
User+Sys will tell you how much actual CPU time your process used. Note that this is across all CPUs, so if the process has multiple threads (and this process is running on a computer with more than one processor) it could potentially exceed the wall clock time reported by Real (which usually occurs). Note that in the output these figures include the User and Sys time of all child processes (and their descendants) as well when they could have been collected, e.g. by wait(2) or waitpid(2), although the underlying system calls return the statistics for the process and its children separately.
Origins of the statistics reported by time (1)
The statistics reported by time are gathered from various system calls. 'User' and 'Sys' come from wait (2) (POSIX) or times (2) (POSIX), depending on the particular system. 'Real' is calculated from a start and end time gathered from the gettimeofday (2) call. Depending on the version of the system, various other statistics such as the number of context switches may also be gathered by time.
On a multi-processor machine, a multi-threaded process or a process forking children could have an elapsed time smaller than the total CPU time - as different threads or processes may run in parallel. Also, the time statistics reported come from different origins, so times recorded for very short running tasks may be subject to rounding errors, as the example given by the original poster shows.
A brief primer on Kernel vs. User mode
On Unix, or any protected-memory operating system, 'Kernel' or 'Supervisor' mode refers to a privileged mode that the CPU can operate in. Certain privileged actions that could affect security or stability can only be done when the CPU is operating in this mode; these actions are not available to application code. An example of such an action might be manipulation of the MMU to gain access to the address space of another process. Normally, user-mode code cannot do this (with good reason), although it can request shared memory from the kernel, which could be read or written by more than one process. In this case, the shared memory is explicitly requested from the kernel through a secure mechanism and both processes have to explicitly attach to it in order to use it.
The privileged mode is usually referred to as 'kernel' mode because the kernel is executed by the CPU running in this mode. In order to switch to kernel mode you have to issue a specific instruction (often called a trap) that switches the CPU to running in kernel mode and runs code from a specific location held in a jump table. For security reasons, you cannot switch to kernel mode and execute arbitrary code - the traps are managed through a table of addresses that cannot be written to unless the CPU is running in supervisor mode. You trap with an explicit trap number and the address is looked up in the jump table; the kernel has a finite number of controlled entry points.
The 'system' calls in the C library (particularly those described in Section 2 of the man pages) have a user-mode component, which is what you actually call from your C program. Behind the scenes, they may issue one or more system calls to the kernel to do specific services such as I/O, but they still also have code running in user-mode. It is also quite possible to directly issue a trap to kernel mode from any user space code if desired, although you may need to write a snippet of assembly language to set up the registers correctly for the call.
More about 'sys'
There are things that your code cannot do from user mode - things like allocating memory or accessing hardware (HDD, network, etc.). These are under the supervision of the kernel, and it alone can do them. Some operations like malloc orfread/fwrite will invoke these kernel functions and that then will count as 'sys' time. Unfortunately it's not as simple as "every call to malloc will be counted in 'sys' time". The call to malloc will do some processing of its own (still counted in 'user' time) and then somewhere along the way it may call the function in kernel (counted in 'sys' time). After returning from the kernel call, there will be some more time in 'user' and then malloc will return to your code. As for when the switch happens, and how much of it is spent in kernel mode... you cannot say. It depends on the implementation of the library. Also, other seemingly innocent functions might also use malloc and the like in the background, which will again have some time in 'sys' then.

To expand on the accepted answer, I just wanted to provide another reason why real ≠ user + sys.
Keep in mind that real represents actual elapsed time, while user and sys values represent CPU execution time. As a result, on a multicore system, the user and/or sys time (as well as their sum) can actually exceed the real time. For example, on a Java app I'm running for class I get this set of values:
real 1m47.363s
user 2m41.318s
sys 0m4.013s

• real: The actual time spent in running the process from start to finish, as if it was measured by a human with a stopwatch
• user: The cumulative time spent by all the CPUs during the computation
• sys: The cumulative time spent by all the CPUs during system-related tasks such as memory allocation.
Notice that sometimes user + sys might be greater than real, as
multiple processors may work in parallel.

Minimal runnable POSIX C examples
To make things more concrete, I want to exemplify a few extreme cases of time with some minimal C test programs.
All programs can be compiled and run with:
gcc -ggdb3 -o main.out -pthread -std=c99 -pedantic-errors -Wall -Wextra main.c
time ./main.out
and have been tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28, Linux kernel 4.18, ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB).
sleep syscall
Non-busy sleep as done by the sleep syscall only counts in real, but not for user or sys.
For example, a program that sleeps for a second:
#define _XOPEN_SOURCE 700
#include <stdlib.h>
#include <unistd.h>
int main(void) {
sleep(1);
return EXIT_SUCCESS;
}
GitHub upstream.
outputs something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
The same holds for programs blocked on IO becoming available.
For example, the following program waits for the user to enter a character and press enter:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%c\n", getchar());
return EXIT_SUCCESS;
}
GitHub upstream.
And if you wait for about one second, it outputs just like the sleep example something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
For this reason time can help you distinguish between CPU and IO bound programs: What do the terms "CPU bound" and "I/O bound" mean?
Multiple threads
The following example does niters iterations of useless purely CPU-bound work on nthreads threads:
#define _XOPEN_SOURCE 700
#include <assert.h>
#include <inttypes.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
uint64_t niters;
void* my_thread(void *arg) {
uint64_t *argument, i, result;
argument = (uint64_t *)arg;
result = *argument;
for (i = 0; i < niters; ++i) {
result = (result * result) - (3 * result) + 1;
}
*argument = result;
return NULL;
}
int main(int argc, char **argv) {
size_t nthreads;
pthread_t *threads;
uint64_t rc, i, *thread_args;
/* CLI args. */
if (argc > 1) {
niters = strtoll(argv[1], NULL, 0);
} else {
niters = 1000000000;
}
if (argc > 2) {
nthreads = strtoll(argv[2], NULL, 0);
} else {
nthreads = 1;
}
threads = malloc(nthreads * sizeof(*threads));
thread_args = malloc(nthreads * sizeof(*thread_args));
/* Create all threads */
for (i = 0; i < nthreads; ++i) {
thread_args[i] = i;
rc = pthread_create(
&threads[i],
NULL,
my_thread,
(void*)&thread_args[i]
);
assert(rc == 0);
}
/* Wait for all threads to complete */
for (i = 0; i < nthreads; ++i) {
rc = pthread_join(threads[i], NULL);
assert(rc == 0);
printf("%" PRIu64 " %" PRIu64 "\n", i, thread_args[i]);
}
free(threads);
free(thread_args);
return EXIT_SUCCESS;
}
GitHub upstream + plot code.
Then we plot wall, user and sys as a function of the number of threads for a fixed 10^10 iterations on my 8 hyperthread CPU:
Plot data.
From the graph, we see that:
for a CPU intensive single core application, wall and user are about the same
for 2 cores, user is about 2x wall, which means that the user time is counted across all threads.
user basically doubled, and while wall stayed the same.
this continues up to 8 threads, which matches my number of hyperthreads in my computer.
After 8, wall starts to increase as well, because we don't have any extra CPUs to put more work in a given amount of time!
The ratio plateaus at this point.
Note that this graph is only so clear and simple because the work is purely CPU-bound: if it were memory bound, then we would get a fall in performance much earlier with less cores because the memory accesses would be a bottleneck as shown at What do the terms "CPU bound" and "I/O bound" mean?
Quickly checking that wall < user is a simple way to determine that a program is multithreaded, and the closer that ratio is to the number of cores, the more effective the parallelization is, e.g.:
multithreaded linkers: Can gcc use multiple cores when linking?
C++ parallel sort: Are C++17 Parallel Algorithms implemented already?
Sys heavy work with sendfile
The heaviest sys workload I could come up with was to use the sendfile, which does a file copy operation on kernel space: Copy a file in a sane, safe and efficient way
So I imagined that this in-kernel memcpy will be a CPU intensive operation.
First I initialize a large 10GiB random file with:
dd if=/dev/urandom of=sendfile.in.tmp bs=1K count=10M
Then run the code:
#define _GNU_SOURCE
#include <assert.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/sendfile.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
int main(int argc, char **argv) {
char *source_path, *dest_path;
int source, dest;
struct stat stat_source;
if (argc > 1) {
source_path = argv[1];
} else {
source_path = "sendfile.in.tmp";
}
if (argc > 2) {
dest_path = argv[2];
} else {
dest_path = "sendfile.out.tmp";
}
source = open(source_path, O_RDONLY);
assert(source != -1);
dest = open(dest_path, O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
assert(dest != -1);
assert(fstat(source, &stat_source) != -1);
assert(sendfile(dest, source, 0, stat_source.st_size) != -1);
assert(close(source) != -1);
assert(close(dest) != -1);
return EXIT_SUCCESS;
}
GitHub upstream.
which gives basically mostly system time as expected:
real 0m2.175s
user 0m0.001s
sys 0m1.476s
I was also curious to see if time would distinguish between syscalls of different processes, so I tried:
time ./sendfile.out sendfile.in1.tmp sendfile.out1.tmp &
time ./sendfile.out sendfile.in2.tmp sendfile.out2.tmp &
And the result was:
real 0m3.651s
user 0m0.000s
sys 0m1.516s
real 0m4.948s
user 0m0.000s
sys 0m1.562s
The sys time is about the same for both as for a single process, but the wall time is larger because the processes are competing for disk read access likely.
So it seems that it does in fact account for which process started a given kernel work.
Bash source code
When you do just time <cmd> on Ubuntu, it use the Bash keyword as can be seen from:
type time
which outputs:
time is a shell keyword
So we grep source in the Bash 4.19 source code for the output string:
git grep '"user\b'
which leads us to execute_cmd.c function time_command, which uses:
gettimeofday() and getrusage() if both are available
times() otherwise
all of which are Linux system calls and POSIX functions.
GNU Coreutils source code
If we call it as:
/usr/bin/time
then it uses the GNU Coreutils implementation.
This one is a bit more complex, but the relevant source seems to be at resuse.c and it does:
a non-POSIX BSD wait3 call if that is available
times and gettimeofday otherwise
1: https://i.stack.imgur.com/qAfEe.png**Minimal runnable POSIX C examples**
To make things more concrete, I want to exemplify a few extreme cases of time with some minimal C test programs.
All programs can be compiled and run with:
gcc -ggdb3 -o main.out -pthread -std=c99 -pedantic-errors -Wall -Wextra main.c
time ./main.out
and have been tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28, Linux kernel 4.18, ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB).
sleep
Non-busy sleep does not count in either user or sys, only real.
For example, a program that sleeps for a second:
#define _XOPEN_SOURCE 700
#include <stdlib.h>
#include <unistd.h>
int main(void) {
sleep(1);
return EXIT_SUCCESS;
}
GitHub upstream.
outputs something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
The same holds for programs blocked on IO becoming available.
For example, the following program waits for the user to enter a character and press enter:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%c\n", getchar());
return EXIT_SUCCESS;
}
GitHub upstream.
And if you wait for about one second, it outputs just like the sleep example something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
For this reason time can help you distinguish between CPU and IO bound programs: What do the terms "CPU bound" and "I/O bound" mean?
multithreaded linkers: Can gcc use multiple cores when linking?
C++ parallel sort: Are C++17 Parallel Algorithms implemented already?

Real shows total turn-around time for a process;
while User shows the execution time for user-defined instructions
and Sys is for time for executing system calls!
Real time includes the waiting time also (the waiting time for I/O etc.)

In very simple terms, I like to think about it like this:
real is the actual amount of time it took to run the command (as if you had timed it with a stopwatch)
user and sys are how much 'work' the CPU had to do to execute the command. This 'work' is expressed in units of time.
Generally speaking:
user is how much work the CPU did to run to run the command's code
sys is how much work the CPU had to do to handle 'system overhead' type tasks (such as allocating memory, file I/O, ect.) in order to support the running command
Since these last two times are counting 'work' done, they don't include time a thread might have spent waiting (such as waiting on another process or for disk I/O to finish).
real, however, is a measure of actual runtime and not 'work', so it does include any time spent waiting (which is why sometimes real > usr+sys).
And finally, sometimes the reverse is true (usr+sys > real) for multi-threaded applications. This also arises because we are comparing 'work-time' to actual time. For example, if 3 processors each run continuously for 10 minutes to execute a command, you will get real = 10m but usr = 30m.

I want to mention some other scenario when the real-time is much much bigger than user + sys. I've created a simple server which respondes after a long time
real 4.784
user 0.01s
sys 0.01s
the issue is that in this scenario the process waits for the response which is not on the user site nor in the system.
Something similar happens when you run the find command. In that case, the time is spent mostly on requesting and getting a response from SSD.

Must mention that at least on my AMD Ryzen CPU, the user is always large than real in multi-threaded program(or single threaded program compiled with -O3).
eg.
real 0m5.815s
user 0m8.213s
sys 0m0.473s

Set linux time to millisecond precision

I have an embedded Linux device that interfaces with another "master" device over a serial comm protocol. Periodically the master passes its date down to the slave device, because later the slave will return information to the master that needs to be accurately timestamped. However, the Linux 'date' command only sets the system date to within a second accuracy. This isn't enough for our uses.
Does anybody know how to set a Linux machine's time more precisely than 1 second?

The settimeofday(2) method given in other answers has a serious problem: it does exactly what you say you want. :)
The problem with directly changing a system's time, instantaneously, is that it can confuse programs that get the time of day before and after the change if the adjustment was negative. That is, they can perceive time to go backwards.
The fix for this is adjtime(3) which is simple and portable, or adjtimex(2) which is complicated, powerful and Linux-specific. Both of these calls use sophisticated algorithms to slowly adjust the system time over some period, forward only, until the desired change is achieved.
By the way, are you sure you aren't reinventing the wheel here? I recommend that you read Julien Ridoux and Darryl Veitch's ACM Queue paper Principles of Robust Timing over the Internet. You're working on embedded systems, so I would expect the ringing in Figure 5 to give you cold shivers. Can you say "damped oscillator?" adjtime() and adjtimex() use this troubled algorithm, so in some sense I am arguing against my own advice above, but the Mills algorithm is still better than the step adjustment non-algorithm. If you choose to implement RADclock instead, so much the better.

The settimeofday() system call takes and uses microsecond precision. You'll have to write a short program to use it, but that is quite straightforward.
struct timeval tv;
tv .tv_sec = (some time_t value)
tv .tv_usec = (the number of microseconds after the second)
int rc = settimeofday (&tv, NULL);
if (rc)
errormessage ("error %d setting system time", errno);

You can use the settimeofday(2) system call; the interface supports microsecond resolution.
#include <sys/time.h>
int gettimeofday(struct timeval *tv, struct timezone *tz);
int settimeofday(const struct timeval *tv, const struct timezone *tz);
struct timeval {
time_t tv_sec; /* seconds */
suseconds_t tv_usec; /* microseconds */
};
You can use the clock_settime(2) system call; the interface provides multiple clocks and the interface supports nanosecond resolution.
#include <time.h>
int clock_getres(clockid_t clk_id, struct timespec *res);
int clock_gettime(clockid_t clk_id, struct timespec *tp);
int clock_settime(clockid_t clk_id, const struct timespec
*tp);
struct timespec {
time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
CLOCK_REALTIME
System-wide real-time clock. Setting this clock
requires appropriate privileges.
CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time
since some unspecified starting point.
CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
Similar to CLOCK_MONOTONIC, but provides access to a
raw hardware-based time that is not subject to NTP
adjustments.
CLOCK_PROCESS_CPUTIME_ID
High-resolution per-process timer from the CPU.
CLOCK_THREAD_CPUTIME_ID
Thread-specific CPU-time clock.
This interface provides the nicety of the clock_getres(2) call, which can tell you exactly what the resolution is -- just because the interface accepts nanoseconds doesn't mean it can actually support nanosecond-resolution. (I've got a fuzzy memory that 20 ns is about the limits of many systems but no references to support this.)

If you're running an IP-capable networking protocol over the serial link (something like, ooh, PPP for example), you can just run an ntpd on the "master" host, then sync time using ntpd or ntpdate on the embedded device. NTP will take care of you.

Is there a way to check whether the processor cache has been flushed recently?

On i386 linux. Preferably in c/(c/posix std libs)/proc if possible. If not is there any piece of assembly or third party library that can do this?
Edit: I'm trying to develop test whether a kernel module clear a cache line or the whole proccesor(with wbinvd()). Program runs as root but I'd prefer to stay in user space if possible.

Cache coherent systems do their utmost to hide such things from you. I think you will have to observe it indirectly, either by using performance counting registers to detect cache misses or by carefully measuring the time to read a memory location with a high resolution timer.
This program works on my x86_64 box to demonstrate the effects of clflush. It times how long it takes to read a global variable using rdtsc. Being a single instruction tied directly to the CPU clock makes direct use of rdtsc ideal for this.
Here is the output:
took 81 ticks
took 81 ticks
flush: took 387 ticks
took 72 ticks
You see 3 trials: The first ensures i is in the cache (which it is, because it was just zeroed as part of BSS), the second is a read of i that should be in the cache. Then clflush kicks i out of the cache (along with its neighbors) and shows that re-reading it takes significantly longer. A final read verifies it is back in the cache. The results are very reproducible and the difference is substantial enough to easily see the cache misses. If you cared to calibrate the overhead of rdtsc() you could make the difference even more pronounced.
If you can't read the memory address you want to test (although even mmap of /dev/mem should work for these purposes) you may be able to infer what you want if you know the cacheline size and associativity of the cache. Then you can use accessible memory locations to probe the activity in the set you're interested in.
Source code:
#include <stdio.h>
#include <stdint.h>
inline void
clflush(volatile void *p)
{
asm volatile ("clflush (%0)" :: "r"(p));
}
inline uint64_t
rdtsc()
{
unsigned long a, d;
asm volatile ("rdtsc" : "=a" (a), "=d" (d));
return a | ((uint64_t)d << 32);
}
volatile int i;
inline void
test()
{
uint64_t start, end;
volatile int j;
start = rdtsc();
j = i;
end = rdtsc();
printf("took %lu ticks\n", end - start);
}
int
main(int ac, char **av)
{
test();
test();
printf("flush: ");
clflush(&i);
test();
test();
return 0;
}

I dont know of any generic command to get the the cache state, but there are ways:
I guess this is the easiest: If you got your kernel module, just disassemble it and look for cache invalidation / flushing commands (atm. just 3 came to my mind: WBINDVD, CLFLUSH, INVD).
You just said it is for i386, but I guess you dont mean a 80386. The problem is that there are many different with different extension and features. E.g. the newest Intel series has some performance/profiling registers for the cache system included, which you can use to evalute cache misses/hits/number of transfers and similar.
Similar to 2, very depending on the system you got. But when you have a multiprocessor configuration you could watch the first cache coherence protocol (MESI) with the 2nd.
You mentioned WBINVD - afaik that will always flush complete, i.e. all, cache lines

It may not be an answer to your specific question, but have you tried using a cache profiler such as Cachegrind? It can only be used to profile userspace code, but you might be able to use it nonetheless, by e.g. moving the code of your function to userspace if it does not depend on any kernel-specific interfaces.
It might actually be more effective than trying to ask the processor for information that may or may not exist and that will be probably affected by your mere asking about it - yes, Heisenberg was way before his time :-)

What is the Linux version of GetTickCount? [duplicate]

I'm looking for an equivalent to GetTickCount() on Linux.
Presently I am using Python's time.time() which presumably calls through to gettimeofday(). My concern is that the time returned (the unix epoch), may change erratically if the clock is messed with, such as by NTP. A simple process or system wall time, that only increases positively at a constant rate would suffice.
Does any such time function in C or Python exist?

You can use CLOCK_MONOTONIC e.g. in C:
struct timespec ts;
if(clock_gettime(CLOCK_MONOTONIC,&ts) != 0) {
//error
}
See this question for a Python way - How do I get monotonic time durations in python?

This seems to work:
#include <unistd.h>
#include <time.h>
uint32_t getTick() {
struct timespec ts;
unsigned theTick = 0U;
clock_gettime( CLOCK_REALTIME, &ts );
theTick = ts.tv_nsec / 1000000;
theTick += ts.tv_sec * 1000;
return theTick;
}
yes, get_tick()
Is the backbone of my applications.
Consisting of one state machine for each 'task'
eg, can multi-task without using threads and Inter Process Communication
Can implement non-blocking delays.

You should use: clock_gettime(CLOCK_MONOTONIC, &tp);. This call is not affected by the adjustment of the system time just like GetTickCount() on Windows.

Yes, the kernel has high-resolution timers but it is differently. I would recommend that you look at the sources of any odd project that wraps this in a portable manner.
From C/C++ I usually #ifdef this and use gettimeofday() on Linux which gives me microsecond resolution. I often add this as a fraction to the seconds since epoch I also receive giving me a double.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string