How to implement sleep utility in RISC-V? - riscv

I want to implement sleep utility that receives number of seconds as an input and pauses for given seconds on a educatational xv6 operation system that runs on risc-v processors.
The OS already have system call that get number of ticks and pauses: https://github.com/mit-pdos/xv6-riscv/blob/riscv/kernel/sysproc.c#L56
Timers are initialized using a timer vector: https://github.com/mit-pdos/xv6-riscv/blob/riscv/kernel/kernelvec.S#L93
The timer vector is initialized with CLINT_MTIMECMP function that tells timer controller when to wake the next interrupt.
What I do not understand is how to know the time between the ticks and how many ticks are done during 1 second.

Edit: A quick google of "qemu timebase riscv mtime" found a google groups chat which states that RDTIME is nanoseconds since boot and mtime is an emulated 10Mhz clock.
I haven't done a search to find the information you need, but I think I have some contextual information that would help you find it. I would recommend searching QEMU documentation / code (probably from Github search)for how mtime and mtimecmp work.
In section 10.1 (Counter - Base Counter and Timers) of specification1, it is explained that the RDTIME psuedo-instruction should have some fixed tick rate that can be determined based on the implementation 2. That tick rate would also be shared for mtimecmp and mtime as defined in the privileged specification 3.
I would presume the ticks used be the sleep system call would be the same as these ticks from the specifications. In that case, xv6 is just a kernel and wouldn't then define how many ticks/second there are. It seems that xv6 is made to run on top of qemu so the definition of ticks/second should be defined somewhere in the qemu code and might be documented.
From the old wiki for QEMU-riscv it should be clear that the SiFive CLINT defines the features xv6 needs to work, but I doubt that it specifies how to know the tickrate. Spike also supports the CLINT interface so it may also be instructive to search for the code in spike that handles it.
1 I used version 20191213 of the unprivileged specification as a reference
2
The RDTIME pseudoinstruction reads the low XLEN bits of the time CSR, which counts wall-clock
real time that has passed from an arbitrary start time in the past. RDTIMEH is an RV32I-only in-
struction that reads bits 63–32 of the same real-time counter. The underlying 64-bit counter should
never overflow in practice. The execution environment should provide a means of determining the
period of the real-time counter (seconds/tick). The period must be constant. The real-time clocks
of all harts in a single user application should be synchronized to within one tick of the real-time
clock. The environment should provide a means to determine the accuracy of the clock.
3
3.1.10
Machine Timer Registers (mtime and mtimecmp)
Platforms provide a real-time counter, exposed as a memory-mapped machine-mode read-write
register, mtime. mtime must run at constant frequency, and the platform must provide a mechanism
for determining the timebase of mtime.

Related

How accurate is the Linux bash time command?

I want to timestamp some events in a logfile from a bash script. I need this timestamp to be as accurate as possible. I see that the standard way of doing this from bash seems to be the time command, which can produce a nanoseconds timestamp with the +%s%N option.
However, when doing this from C I remembered that multiple timekeeping functions had multiple clock sources, and not all of them were equally accurate or had the same guarantees (e.g. being monotonic). How do I know what clock source time uses?
The man 1 time is rather clear:
These statistics consist of (i) the elapsed real time between
invocation and termination, (ii) the user CPU time (the sum of the tms_utime and tms_cutime values in a struct tms as returned by
times(2)), and (iii) the system CPU time (the sum of the tms_stime and tms_cstime values in a struct tms as returned by
times(2)).
So we can go to man 3p times where is just states The accuracy of the times reported is intentionally left unspecified to allow implementations flexibility in design, from uniprocessor to multi-processor networks. So we can go to man 2 times, and learn that it's all measured with clock_t and maybe we should use clock_gettime instead
How do I know what clock source time uses?
As usually on a GNU system, all programs are open source. So you go and download sources of the kernel and you shell and inspect them to see how it works. I see in bash time_command() there are many methods available and nowadays bash uses rusage as a replacement for times.
How accurate is the Linux bash time command?
Both getrusage() and times() are system calls by themselfs, so the values are returned straight from the kernel. My guess would be that they are measured with the accuracy the kernel can give us - so with jiffies/HZ.
The resolution of the measurement will be equal to jiffies, so usually with 300 HZ thats 3.333ms if my math is right. The accuracy will depend on your hardware, maybe also workload - my overestimated guess would be that the values will be right up to one or two jiffies of accuracy, so up to ~7 milliseconds.

Is clock_nanosleep affected by adjtime and NTP?

Usually CLOCK_MONOTONIC_RAW is used for obtaining a clock that is not affected by NTP or adjtime(). However clock_nanosleep() doesn't support CLOCK_MONOTONIC_RAW and trying to use it anyway will result in return code 95 Operation not supported (Kernel 4.6.0).
Does clock_nanosleep() somehow take these clock adjustments into account or will the sleep time be affected by it?
What are the alternatives if a sleeping time is required which should not be affected by clock adjustments?
CLOCK_MONOTONIC_RAW never had support for clock_nanosleep() since it was introduced in Linux 2.6.28. It was also explicitly fixed to not have this support in 2.6.32 because of oopses. The code had been refactored several times after that, but still there is no support for CLOCK_MONOTONIC_RAW in clock_nanosleep() and I wasn't able to find any comments on why is that.
At the very minimum, the fact that there was a patch that explicitly disabled this functionality and it passed all reviews tells us that it doesn't look like a big problem for kernel developers. So, at the moment (4.7) the only things CLOCK_MONOTONIC_RAW supports are clock_getres() and clock_gettime().
Speaking of adjustments, as already noted by Rich CLOCK_MONOTONIC is subject to rate adjustments just by the nature of this clock. This happens because hrtimer_interrupt() runs its queues with adjusted monotonic time value (ktime_get_update_offsets_now()->timekeeping_get_ns()->timekeeping_delta_to_ns() and that operates with xtime_nsec which is subject to adjustment). Actually, looking at this code I'm probably no longer surprised that CLOCK_MONOTONIC_RAW has no support for clock_nanosleep() (and probably won't have it in future) — adjusted monotonic clock usage seems to be the basis for hrtimers.
As for alternatives, I think there are none. nanosleep() uses the same CLOCK_MONOTONIC, setitimer() has its own set of timers, alarm() uses ITIMER_REAL (same as setitimer()), that (with some indirection) is also our good old friend CLOCK_MONOTONIC. What else do we have? I guess nothing.
As an unrelated side note, there is an interesting observation in that if you call clock_nanosleep() for relative interval (that is not TIMER_ABSTIME) then CLOCK_REALTIME actually becomes a synonym for CLOCK_MONOTONIC.

How to Configure and Sample Intel Performance Counters In-Process

In a nutshell, I'm trying to achieve the following inside a userland benchmark process (pseudo-code, assuming x86_64 and a UNIX system):
results[] = ...
for (iteration = 0; iteration < num_iterations; iteration++) {
pctr_start = sample_pctr();
the_benchmark();
pctr_stop = sample_pctr();
results[iteration] = pctr_stop - pctr_start;
}
FWIW, the performance counter I am thinking of using is CPU_CLK_UNHALTED.THREAD_ALL, to read the number of core cycles independent of clock frequency changes (In an earlier question I had been planning to use the TSC register for this, but alas, that is not what this register measures at all).
My initial intention was to use inline assembler to first configure a counter using WRMSR, then to read the counter using RDPMC inside sample_pctr().
I stumbled at the first hurdle, as writing MSRs requires kernel privileges. It seems like you can in fact read the counters from user space (if configured correctly), but the act of configuring the counter (with an MSR) needs to be undertaken by the kernel.
Does anyone know a lightweight way to ask the kernel to configure the a performance counters from user-space so that I can then use RDPMC from within my benchmark harness?
Stuff I've looked into/thought about:
Perf tools for Linux. Seems to be geared up for sampling over the whole lifetime of a process, not within a process as specific points (before and after each iteration).
Use perf syscalls directly (i.e. perf_event_open). Looks like the counter value will only update periodically (using a sample rate) or after the counter exceeds a threshold. I need the counter value precisely at the moment I ask. This is why RDPMC seemed so attractive. I imagine that sampling frequently will itself skew the performance counter readings.
PAPI builds on perf, so probably inherits the above problem.
Write a kernel module -- too much effort, too error prone.
Ideally I would like a solution which works on OpenBSD and Linux, but somehow I think that is a tall order. Perhaps just for Linux for now.
Any help is most appreciated. Thanks.
EDIT: I just found the Linux msr device node, which would probably suffice. I'll leave the question up in case a better answer shows up.
It seems the best way -- for Linux at least -- is to use the msr device node.
You simply open a device node, seek to the address of the MSR required, and read or write 8 bytes.
OpenBSD is harder, since (at the time of writing) there is no user-space proxy to the MSRs. So you would need to write a kernel module or implement a sysctl by hand.

Lowering linux kernel timer frequency

When I run my Virtual Machine with Gentoo as guest, I have found that there is considerable overhead coming from tick_periodic function. (This is the function which runs on every timer interrupt.) This function updates a global jiffy using write_seqlocks which leads to the overhead.
Here's a grep of HZ and relevant stuff in my kernel config file.
sharan013#sitmac4:~$ cat /boot/config | egrep 'HZ|TIME'
# CONFIG_RCU_FAST_NO_HZ is not set
CONFIG_NO_HZ=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
# CONFIG_MACHZ_WDT is not set
CONFIG_TIMERFD=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_X86_CYCLONE_TIMER=y
CONFIG_HPET_TIMER=y
Clearly it has set the configuration to 1000, but when I do sysconf(_SC_CLK_TCK), I get 100 as my timer frequency. So what is my system's timer frequency?
What I want to do is to bring the frequency down to 100, even lower if possible. Although it might effect the interactivity and precision of poll/select and schedulers time slice, I am ready to sacrifice these things for lesser timer interrupt as it will speed up VM.
When I tried to find out what has to be done I read in some place that you can do so by changing in the configuration file, else where I read that adding divider=10 to the boot parameter does the job, else where I read that none of it is needed if you can set the CONFIG_HIGH_RES_TIMERS to acheive low-latency timers even without increasing the timer frequency and the same is possible with a tickless system CONFIG_NO_HZ.
I am extermely confused about what is the right approach.
All I want is to bring down the timer interrupt to as low as possible.
Can I know the right way of doing this?
Don't worry! Your confusion is nothing but expected. Linux timer interrupts are very confusing and have had a long and quite exciting history.
CLK_TCK
Linux has no sysconf system call and glibc is just returning the constant value 100. Sorry.
HZ <-- what you probably want
When configuring your kernel you can choose a timer frequency of either 100Hz, 250Hz, 300Hz or 1000Hz. All of these are supported, and although 1000Hz is the default it's not always the best.
People will generally choose a high value when they value latency (a desktop or a webserver) and a low value when they value throughput (HPC).
CONFIG_HIGH_RES_TIMERS
This has nothing to do with timer interrupts, it's just a mechanism that allows you to have higher resolution timers. This basically means that timeouts on calls like select can be more accurate than 1/HZ seconds.
divider
This command line option is a patch provided by Red Hat. You can probably use this (if you're using Red Hat or CentOS), but I'd be careful. It's caused lots of bugs and you should probably just recompile with a different Hz value.
CONFIG_NO_HZ
This really doesn't do much, it's for power saving and it means that the ticks will stop (or at least become less frequent) when nothing is executing. This is probably already enabled on your kernel. It doesn't make any difference when at least one task is runnable.
Frederic Weisbecker actually has a patch pending which generalizes this to cases where only a single task is running, but it's a little way off yet.

linux- How to determine time taken by each function in C Program

I want to check time taken by each function and system calls made by each function in my project .My code is part of user as well as kernel space. So i need time taken in both space. I am interested to know performance in terms of CPU time and Disk IO. Should i use profiler tool ? if yes , which will be more preferable ? or what other option i have ?
Please help,
Thanks
As for kernel level profiling or time taken by some instructions or functions could be measured in clock tics used. To get actual how many clock ticks have been used to do a given task could be measured by kernel function as...
#include <sys/time.h>
unsigned long ini,end;
rdtscl(ini);
...your code....
rdtscl(end);
printk("time lapse in cpu clics: %lu\n",(end-ini));
for more details http://www.xml.com/ldd/chapter/book/ch06.html
and if your code is taking more time then you can also use jiffies effectively.
And for user-space profiling you can use various timing functions whicg give the time in nanosecond resolution or oprofile(http://oprofile.sourceforge.net/about/) & refer tis Timer function to provide time in nano seconds using C++
For kernel-space function tracing and profiling (which includes a call-graph format and the time taken by individual functions), consider using the Ftrace framework.
Specifically for function profiling (within the kernel), enable the CONFIG_FUNCTION_PROFILER kernel config: under Kernel Hacking / Tracing / Kernel function profiler.
It's Help :
CONFIG_FUNCTION_PROFILER:
This option enables the kernel function profiler. A file is created
in debugfs called function_profile_enabled which defaults to zero.
When a 1 is echoed into this file profiling begins, and when a
zero is entered, profiling stops. A "functions" file is created in
the trace_stats directory; this file shows the list of functions that
have been hit and their counters.
Some resources:
Documentation/trace/ftrace.txt
Secrets of the Ftrace function tracer
Using ftrace to Identify the Process Calling a Kernel Function
Well I only develop in userspace so I don't know, how much this will help you with disk IO or Kernelspace profiling, but I profiled a lot with oprofile.
I haven't used it in a while, so I cannot give you a step by step guide, but you may find more informations here:
http://oprofile.sourceforge.net/doc/results.html
Usually this helped me finding my problems.
You may have to play a bit with the opreport output, to get the results you want.

Resources