What caused the performance degradation when reading a local modified cache line with concurrent readers accessing it?

What caused the performance degradation when reading a local modified cache line with concurrent readers accessing it? - multithreading

Assume that we have multiple threads accessing the same cache line parallelly. One of them (writer) repeatedly write to that cache line, and read from it. The other threads (readers) only repeatedly read from it. I am clear that the readers suffer from performance degradation since the invalidation-based MESI protocol requires the readers to invalidate their local cache before the writer writes to it, which occurs frequently. But I think the writer's read to that cache line should be fast because local write will not cause such invalidation.
However, the strange thing is that, when I run such experiment on a dual-socket machine with two Intel Xeon Scalable Gold 5220R processors (24 cores each) running at 2.20GHz, the writer's read to that cache line becomes a performance bottleneck.
This is my test program (compiled with gcc 8.4.0, -O2):
#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <unistd.h>
#include <sys/syscall.h>
#define CACHELINE_SIZE 64
volatile struct {
/* cacheline 1 */
size_t x, y;
char padding[CACHELINE_SIZE - 2 * sizeof(size_t)];
/* cacheline 2 */
size_t p, q;
} __attribute__((aligned(CACHELINE_SIZE))) data;
static inline void bind_core(int core) {
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(core, &mask);
if ((sched_setaffinity(0, sizeof(cpu_set_t), &mask)) != 0) {
perror("bind core failed\n");
}
}
#define smp_mb() asm volatile("lock; addl $0,-4(%%rsp)" ::: "memory", "cc")
void *writer_work(void *arg) {
long id = (long) arg;
int i;
bind_core(id);
printf("writer tid: %ld\n", syscall(SYS_gettid));
while (1) {
/* read after write */
data.x = 1;
data.y;
for (i = 0; i < 50; i++) __asm__("nop"); // to highlight bottleneck
}
}
void *reader_work(void *arg) {
long id = (long) arg;
bind_core(id);
while (1) {
/* read */
data.y;
}
}
#define NR_THREAD 48
int main() {
pthread_t threads[NR_THREAD];
int i;
printf("%p %p\n", &data.x, &data.p);
data.x = data.y = data.p = data.q = 0;
pthread_create(&threads[0], NULL, writer_work, 0);
for (i = 1; i < NR_THREAD; i++) {
pthread_create(&threads[i], NULL, reader_work, i);
}
for (i = 0; i < NR_THREAD; i++) {
pthread_join(threads[i], NULL);
}
return 0;
}
I use perf record -t <tid> to collect cycles event for the writer thread and use perf annotate writer_work to show detailed proportion in writer_work:
: while (1) {
: /* read after write */
: data.x = 1;
0.20 : a50: movq $0x1,0x200625(%rip) # 201080 <data>
: data.y;
94.40 : a5b: mov 0x200626(%rip),%rax # 201088 <data+0x8>
0.03 : a62: mov $0x32,%eax
0.00 : a67: nopw 0x0(%rax,%rax,1)
: for (i = 0; i < 50; i++) __asm__("nop");
0.03 : a70: nop
0.03 : a71: sub $0x1,%eax
5.17 : a74: jne a70 <writer_work+0x50>
0.15 : a76: jmp a50 <writer_work+0x30>
It seems that the load instruction of data.y becomes the bottleneck.
First, I think the performance degradation should be related to the cache line modification, because when I comment out the writer's store operation, the perf result indicates that the writer's read is not the bottleneck any more. Second, I think it should have something to do with the concurrent readers, because when I comment out the readers' read, writer's read is not blamed by perf. Concurrent readers accessing the same cacheline also slow down the writer, because when I change sum += data.y to sum += data.z in the writer side, which leads to a load to another cache line, the count decreases.
There is another possibility, suggested by Peter Cordes, that it's whatever instruction follows the store that's getting the blame. However, when I move the nop loop before the load, perf still blames for the load instruction.
: writer_work():
: while (1) {
: /* read after write */
: data.x = 1;
0.00 : a50: movq $0x1,0x200625(%rip) # 201080 <data>
0.03 : a5b: mov $0x32,%eax
: for (i = 0; i < 50; i++) __asm__("nop");
0.03 : a60: nop
0.09 : a61: sub $0x1,%eax
6.24 : a64: jne a60 <writer_work+0x40>
: data.y;
93.60 : a66: mov 0x20061b(%rip),%rax # 201088 <data+0x8>
: data.x = 1;
0.02 : a6d: jmp a50 <writer_work+0x30>
So my question is, what caused the performance degradation when reading a local modified cache line with concurrent readers accessing it? Any help would be appreciated!

Related

VMX performance issue with rdtsc (no rdtsc exiting, using rdtsc offseting)

I am working a Linux kernel module (VMM) to test Intel VMX, to run a self-made VM (The VM starts in real-mode, then switches to 32bit protected mode with Paging enabled).
The VMM is configured to NOT use rdtsc exit, and use rdtsc offsetting.
Then, the VM runs rdtsc to check the performance, like below.
static void cpuid(uint32_t code, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) {
__asm__ volatile(
"cpuid"
:"=a"(*eax),"=b"(*ebx),"=c"(*ecx), "=d"(*edx)
:"a"(code)
:"cc");
}
uint64_t rdtsc(void)
{
uint32_t lo, hi;
// RDTSC copies contents of 64-bit TSC into EDX:EAX
asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
void i386mode_tests(void)
{
u32 eax, ebx, ecx, edx;
u32 i = 0;
asm ("mov %%cr0, %%eax\n"
"mov %%eax, %0 \n" : "=m" (eax) : :);
my_printf("Guest CR0 = 0x%x\n", eax);
cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
vm_tsc[0]= rdtsc();
for (i = 0; i < 100; i ++) {
rdtsc();
}
vm_tsc[1]= rdtsc();
my_printf("Rdtsc takes %d\n", vm_tsc[1] - vm_tsc[0]);
}
The output is something like this,
Guest CR0 = 0x80050033
Rdtsc takes 2742
On the other hand, I make a host application to do the same thing, like above
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
static void cpuid(uint32_t code, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) {
__asm__ volatile(
"cpuid"
:"=a"(*eax),"=b"(*ebx),"=c"(*ecx), "=d"(*edx)
:"a"(code)
:"cc");
}
uint64_t rdtsc(void)
{
uint32_t lo, hi;
// RDTSC copies contents of 64-bit TSC into EDX:EAX
asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
int main(int argc, char **argv)
{
uint64_t vm_tsc[2];
uint32_t eax, ebx, ecx, edx, i;
cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
vm_tsc[0]= rdtsc();
for (i = 0; i < 100; i ++) {
rdtsc();
}
vm_tsc[1]= rdtsc();
printf("Rdtsc takes %ld\n", vm_tsc[1] - vm_tsc[0]);
return 0;
}
It outputs followings,
Rdtsc takes 2325
Running above two codes in 40 iterations to get the average value as followings,
avag(VM) = 3188.000000
avag(host) = 2331.000000
The performance difference can NOT be ignored, when running the codes in VM and in host. It is NOT expected.
My understanding is, using TSC offsetting + no RDTSC exit, there should be little difference in rdtsc, running in VM and host.
Here are VMCS fields,
0xA501E97E = control_VMX_cpu_based
0xFFFFFFFFFFFFFFF0 = control_CR0_mask
0x0000000080050033 = control_CR0_shadow
In the last level of EPT PTEs, bit[5:3] = 6 (Write Back), bit[6] = 1. EPTP[2:0] = 6 (Write Back)
I tested in bare-metal, and in VMware, I got the similar results.
I am wondering if there is anything I missed in this case.

Large overhead in CUDA kernel launch outside GPU execution

I am measuring the running time of kernels, as seen from a CPU thread, by measuring the interval from before launching a kernel to after a cudaDeviceSynchronize (using gettimeofday). I have a cudaDeviceSynchronize before I start recording the interval. I also instrument the kernels to record the timestamp on the GPU (using clock64) at the start of the kernel by thread(0,0,0) of each block from block(0,0,0) to block(occupancy-1,0,0) to an array of size equal to number of SMs. Every thread at the end of the kernel code, updates the timestamp to another array (of the same size) at the index equal to the index of the SM it runs on.
The intervals calculated from the two arrays are 60-70% of that measured from the CPU thread.
For example, on a K40, while gettimeofday gives an interval of 140ms, the avg of intervals calculated from GPU timestamps is only 100ms. I have experimented with many grid sizes (15 blocks to 6K blocks) but have found similar behavior so far.
__global__ void some_kernel(long long *d_start, long long *d_end){
if(threadIdx.x==0){
d_start[blockIdx.x] = clock64();
}
//some_kernel code
d_end[blockIdx.x] = clock64();
}
Does this seem possible to the experts?

Does this seem possible to the experts?
I suppose anything is possible for code you haven't shown. After all, you may just have a silly bug in any of your computation arithmetic. But if the question is "is it sensible that there should be 40ms of unaccounted-for time overhead on a kernel launch, for a kernel that takes ~140ms to execute?" I would say no.
I believe the method I outlined in the comments is reasonably accurate. Take the minimum clock64() timestamp from any thread in the grid (but see note below regarding SM restriction). Compare it to the maximum time stamp of any thread in the grid. The difference will be comparable to the reported execution time of gettimeofday() to within 2 percent, according to my testing.
Here is my test case:
$ cat t1040.cu
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define LS_MAX 2000000000U
#define MAX_SM 64
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
__device__ int result;
__device__ unsigned long long t_start[MAX_SM];
__device__ unsigned long long t_end[MAX_SM];
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
__device__ __inline__ uint32_t __mysmid(){
uint32_t smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
__global__ void kernel(unsigned ls){
unsigned long long int ts = clock64();
unsigned my_sm = __mysmid();
atomicMin(t_start+my_sm, ts);
// junk code to waste time
int tv = ts&0x1F;
for (unsigned i = 0; i < ls; i++){
tv &= (ts+i);}
result = tv;
// end of junk code
ts = clock64();
atomicMax(t_end+my_sm, ts);
}
// optional command line parameter 1 = kernel duration, parameter 2 = number of blocks, parameter 3 = number of threads per block
int main(int argc, char *argv[]){
unsigned ls;
if (argc > 1) ls = atoi(argv[1]);
else ls = 1000000;
if (ls > LS_MAX) ls = LS_MAX;
int num_sms = 0;
cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, 0);
cudaCheckErrors("cuda get attribute fail");
int gpu_clk = 0;
cudaDeviceGetAttribute(&gpu_clk, cudaDevAttrClockRate, 0);
if ((num_sms < 1) || (num_sms > MAX_SM)) {printf("invalid sm count: %d\n", num_sms); return 1;}
unsigned blks;
if (argc > 2) blks = atoi(argv[2]);
else blks = num_sms;
if ((blks < 1) || (blks > 0x3FFFFFFF)) {printf("invalid blocks: %d\n", blks); return 1;}
unsigned ntpb;
if (argc > 3) ntpb = atoi(argv[3]);
else ntpb = 256;
if ((ntpb < 1) || (ntpb > 1024)) {printf("invalid threads: %d\n", ntpb); return 1;}
kernel<<<1,1>>>(100); // warm up
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
unsigned long long *h_start, *h_end;
h_start = new unsigned long long[num_sms];
h_end = new unsigned long long[num_sms];
for (int i = 0; i < num_sms; i++){
h_start[i] = 0xFFFFFFFFFFFFFFFFULL;
h_end[i] = 0;}
cudaMemcpyToSymbol(t_start, h_start, num_sms*sizeof(unsigned long long));
cudaMemcpyToSymbol(t_end, h_end, num_sms*sizeof(unsigned long long));
unsigned long long htime = dtime_usec(0);
kernel<<<blks,ntpb>>>(ls);
cudaDeviceSynchronize();
htime = dtime_usec(htime);
cudaMemcpyFromSymbol(h_start, t_start, num_sms*sizeof(unsigned long long));
cudaMemcpyFromSymbol(h_end, t_end, num_sms*sizeof(unsigned long long));
cudaCheckErrors("some error");
printf("host elapsed time (ms): %f \n device sm clocks:\n start:", htime/1000.0f);
unsigned long long max_diff = 0;
for (int i = 0; i < num_sms; i++) {printf(" %12lu ", h_start[i]);}
printf("\n end: ");
for (int i = 0; i < num_sms; i++) {printf(" %12lu ", h_end[i]);}
for (int i = 0; i < num_sms; i++) if ((h_start[i] != 0xFFFFFFFFFFFFFFFFULL) && (h_end[i] != 0) && ((h_end[i]-h_start[i]) > max_diff)) max_diff=(h_end[i]-h_start[i]);
printf("\n max diff clks: %lu\nmax diff kernel time (ms): %f\n", max_diff, max_diff/(float)(gpu_clk));
return 0;
}
$ nvcc -o t1040 t1040.cu -arch=sm_35
$ ./t1040 1000000 1000 128
host elapsed time (ms): 2128.818115
device sm clocks:
start: 3484744 3484724
end: 2219687393 2228431323
max diff clks: 2224946599
max diff kernel time (ms): 2128.117432
$
Notes:
This code can only be run on a cc3.5 or higher GPU due to the use of 64-bit atomicMin and atomicMax.
I've run it on a variety of grid configurations, on both a GT640 (very low end cc3.5 device) and K40c (high end) and the timing results between host and device agree to within 2% (for reasonably long kernel execution times. If you pass 1 as the command line parameter, with very small grid sizes, the kernel execution time will be very short (nanoseconds) whereas the host will see about 10-20us. This is kernel launch overhead being measured. So the 2% number is for kernels that take much longer than 20us to execute).
It accepts 3 (optional) command line parameters, the first of which varies the amount of time the kernel will execute.
My timestamping is done on a per-SM basis, because the clock64() resource is indicated to be a per-SM resource. The sm clocks are not guaranteed to be synchronized between SMs.
You can modify the grid dimensions. The second optional command line parameter specifies the number of blocks to launch. The third optional command line parameter specifies the number of threads per block. The timing methodology I have shown here should not be dependent on number of blocks launched or number of threads per block. If you specify fewer blocks than SMs, the code should ignore "unused" SM data.

pthreads code not scaling up

I wrote the following very simple pthread code to test how it scales up. I am running the code on a machine with 8 logical processors and at no time do I create more than 8 threads (to avoid context switching).
With increasing number of threads, each thread has to do lesser amount of work. Also, it is evident from the code that there are no shared Data structures between the threads which might be a bottleneck. But still, my performance degrades as I increase the number of threads.
Can somebody tell me what am I doing wrong here.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int NUM_THREADS = 3;
unsigned long int COUNTER = 10000000000000;
unsigned long int LOOP_INDEX;
void* addNum(void *data)
{
unsigned long int sum = 0;
for(unsigned long int i = 0; i < LOOP_INDEX; i++) {
sum += 100;
}
return NULL;
}
int main(int argc, char** argv)
{
NUM_THREADS = atoi(argv[1]);
pthread_t *threads = (pthread_t*)malloc(sizeof(pthread_t) * NUM_THREADS);
int rc;
clock_t start, diff;
LOOP_INDEX = COUNTER/NUM_THREADS;
start = clock();
for (int t = 0; t < NUM_THREADS; t++) {
rc = pthread_create((threads + t), NULL, addNum, NULL);
if (rc) {
printf("ERROR; return code from pthread_create() is %d", rc);
exit(-1);
}
}
void *status;
for (int t = 0; t < NUM_THREADS; t++) {
rc = pthread_join(threads[t], &status);
}
diff = clock() - start;
int sec = diff / CLOCKS_PER_SEC;
printf("%d",sec);
}
Note: All the answers I found online said that the overhead of creating the threads is more than the work they are doing. To test it, I commented out everything in the "addNum()" function. But then, after doing that no matter how many threads I create, the time taken by the code is 0 seconds. So there is no overhead as such, I think.

clock() counts CPU time used, across all threads. So all that's telling you is that you're using a little bit more total CPU time, which is exactly what you would expect.
It's the total wall clock elapsed time which should be going down if your parallelisation is effective. Measure that with clock_gettime() specifying the CLOCK_MONOTONIC clock instead of clock().

Why does my process take too long to die?

Basically I'm using Linux 2.6.34 on PowerPC (Freescale e500mc). I have a process (a kind of VM that was developed in-house) that uses about 2.25 G of mlocked VM. When I kill it, I notice that it takes upwards of 2 minutes to terminate.
I investigated a little. First, I closed all open file descriptors but that didn't seem to make a difference. Then I added some printk in the kernel and through it I found that all delay comes from the kernel unlocking my VMAs. The delay is uniform across pages, which I verified by repeatedly checking the locked page count in /proc/meminfo. I've checked with programs that allocate that much memory and they all die as soon as I signal them.
What do you think I should check now? Thanks for your replies.
Edit: I had to find a way to share more information about the problem so I wrote this below program:
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
#include <sys/time.h>
#define MAP_PERM_1 (PROT_WRITE | PROT_READ | PROT_EXEC)
#define MAP_PERM_2 (PROT_WRITE | PROT_READ)
#define MAP_FLAGS (MAP_ANONYMOUS | MAP_FIXED | MAP_PRIVATE)
#define PG_LEN 4096
#define align_pg_32(addr) (addr & 0xFFFFF000)
#define num_pg_in_range(start, end) ((end - start + 1) >> 12)
inline void __force_pgtbl_alloc(unsigned int start)
{
volatile int *s = (int *) start;
*s = *s;
}
int __map_a_page_at(unsigned int start, int whichperm)
{
int perm = whichperm ? MAP_PERM_1 : MAP_PERM_2;
if(MAP_FAILED == mmap((void *)start, PG_LEN, perm, MAP_FLAGS, 0, 0)){
fprintf(stderr,
"mmap failed at 0x%x: %s.\n",
start, strerror(errno));
return 0;
}
return 1;
}
int __mlock_page(unsigned int addr)
{
if (mlock((void *)addr, (size_t)PG_LEN) < 0){
fprintf(stderr,
"mlock failed on page: 0x%x: %s.\n",
addr, strerror(errno));
return 0;
}
return 1;
}
void sigint_handler(int p)
{
struct timeval start = {0 ,0}, end = {0, 0}, diff = {0, 0};
gettimeofday(&start, NULL);
munlockall();
gettimeofday(&end, NULL);
timersub(&end, &start, &diff);
printf("Munlock'd entire VM in %u secs %u usecs.\n",
diff.tv_sec, diff.tv_usec);
exit(0);
}
int make_vma_map(unsigned int start, unsigned int end)
{
int num_pg = num_pg_in_range(start, end);
if (end < start){
fprintf(stderr,
"Bad range: start: 0x%x end: 0x%x.\n",
start, end);
return 0;
}
for (; num_pg; num_pg --, start += PG_LEN){
if (__map_a_page_at(start, num_pg % 2) && __mlock_page(start))
__force_pgtbl_alloc(start);
else
return 0;
}
return 1;
}
void display_banner()
{
printf("-----------------------------------------\n");
printf("Virtual memory allocator. Ctrl+C to exit.\n");
printf("-----------------------------------------\n");
}
int main()
{
unsigned int vma_start, vma_end, input = 0;
int start_end = 0; // 0: start; 1: end;
display_banner();
// Bind SIGINT handler.
signal(SIGINT, sigint_handler);
while (1){
if (!start_end)
printf("start:\t");
else
printf("end:\t");
scanf("%i", &input);
if (start_end){
vma_end = align_pg_32(input);
make_vma_map(vma_start, vma_end);
}
else{
vma_start = align_pg_32(input);
}
start_end = !start_end;
}
return 0;
}
As you would see, the program accepts ranges of virtual addresses, each range being defined by start and end. Each range is then further subdivided into page-sized VMAs by giving different permissions to adjacent pages. Interrupting (using SIGINT) the program triggers a call to munlockall() and the time for said procedure to complete is duly noted.
Now, when I run it on freescale e500mc with Linux version at 2.6.34 over the range 0x30000000-0x35000000, I get a total munlockall() time of almost 45 seconds. However, if I do the same thing with smaller start-end ranges in random orders (that is, not necessarily increasing addresses) such that the total number of pages (and locked VMAs) is roughly the same, observe total munlockall() time to be no more than 4 seconds.
I tried the same thing on x86_64 with Linux 2.6.34 and my program compiled against the -m32 parameter and it seems the variations, though not so pronounced as with ppc, are still 8 seconds for the first case and under a second for the second case.
I tried the program on Linux 2.6.10 on the one end and on 3.19, on the other and it seems these monumental differences don't exist there. What's more, munlockall() always completes at under a second.
So, it seems that the problem, whatever it is, exists only around the 2.6.34 version of the Linux kernel.

You said the VM was developed in-house. Does this mean you have access to the source? I would start by checking to see if it has anything to stop it from immediately terminating to avoid data loss.
Otherwise, could you potentially try to provide more information? You may also want to check out: https://unix.stackexchange.com/ as they would be better suited to help with any issues the linux kernel may be having.

Difference between CLOCK_REALTIME and CLOCK_MONOTONIC?

Could you explain the difference between CLOCK_REALTIME and CLOCK_MONOTONIC clocks returned by clock_gettime() on Linux?
Which is a better choice if I need to compute elapsed time between timestamps produced by an external source and the current time?
Lastly, if I have an NTP daemon periodically adjusting system time, how do these adjustments interact with each of CLOCK_REALTIME and CLOCK_MONOTONIC?

CLOCK_REALTIME represents the machine's best-guess as to the current wall-clock, time-of-day time. As Ignacio and MarkR say, this means that CLOCK_REALTIME can jump forwards and backwards as the system time-of-day clock is changed, including by NTP.
CLOCK_MONOTONIC represents the absolute elapsed wall-clock time since some arbitrary, fixed point in the past. It isn't affected by changes in the system time-of-day clock.
If you want to compute the elapsed time between two events observed on the one machine without an intervening reboot, CLOCK_MONOTONIC is the best option.
Note that on Linux, CLOCK_MONOTONIC does not measure time spent in suspend, although by the POSIX definition it should. You can use the Linux-specific CLOCK_BOOTTIME for a monotonic clock that keeps running during suspend.

Robert Love's book LINUX System Programming 2nd Edition, specifically addresses your question at the beginning of Chapter 11, pg 363:
The important aspect of a monotonic time source is NOT the current
value, but the guarantee that the time source is strictly linearly
increasing, and thus useful for calculating the difference in time
between two samplings
That said, I believe he is assuming the processes are running on the same instance of an OS, so you might want to have a periodic calibration running to be able to estimate drift.

CLOCK_REALTIME is affected by NTP, and can move forwards and backwards. CLOCK_MONOTONIC is not, and advances at one tick per tick.

POSIX 7 quotes
POSIX 7 specifies both at http://pubs.opengroup.org/onlinepubs/9699919799/functions/clock_getres.html:
CLOCK_REALTIME:
This clock represents the clock measuring real time for the system. For this clock, the values returned by clock_gettime() and specified by clock_settime() represent the amount of time (in seconds and nanoseconds) since the Epoch.
CLOCK_MONOTONIC (optional feature):
For this clock, the value returned by clock_gettime() represents the amount of time (in seconds and nanoseconds) since an unspecified point in the past (for example, system start-up time, or the Epoch). This point does not change after system start-up time. The value of the CLOCK_MONOTONIC clock cannot be set via clock_settime().
clock_settime() gives an important hint: POSIX systems are able to arbitrarily change CLOCK_REALITME with it, so don't rely on it flowing neither continuously nor forward. NTP could be implemented using clock_settime(), and could only affect CLOCK_REALTIME.
The Linux kernel implementation seems to take boot time as the epoch for CLOCK_MONOTONIC: Starting point for CLOCK_MONOTONIC

In addition to Ignacio's answer, CLOCK_REALTIME can go up forward in leaps, and occasionally backwards. CLOCK_MONOTONIC does neither; it just keeps going forwards (although it probably resets at reboot).
A robust app needs to be able to tolerate CLOCK_REALTIME leaping forwards occasionally (and perhaps backwards very slightly very occasionally, although that is more of an edge-case).
Imagine what happens when you suspend your laptop - CLOCK_REALTIME jumps forwards following the resume, CLOCK_MONOTONIC does not. Try it on a VM.

Sorry, no reputation to add this as a comment. So it goes as an complementary answer.
Depending on how often you will call clock_gettime(), you should keep in mind that only some of the "clocks" are provided by Linux in the VDSO (i.e. do not require a syscall with all the overhead of one -- which only got worse when Linux added the defenses to protect against Spectre-like attacks).
While clock_gettime(CLOCK_MONOTONIC,...), clock_gettime(CLOCK_REALTIME,...), and gettimeofday() are always going to be extremely fast (accelerated by the VDSO), this is not true for, e.g. CLOCK_MONOTONIC_RAW or any of the other POSIX clocks.
This can change with kernel version, and architecture.
Although most programs don't need to pay attention to this, there can be latency spikes in clocks accelerated by the VDSO: if you hit them right when the kernel is updating the shared memory area with the clock counters, it has to wait for the kernel to finish.
Here's the "proof" (GitHub, to keep bots away from kernel.org):
https://github.com/torvalds/linux/commit/2aae950b21e4bc789d1fc6668faf67e8748300b7

There's one big difference between CLOCK_REALTIME and MONOTONIC. CLOCK_REALTIME can jump forward or backward according to NTP.
By default, NTP allows the clock rate to be speeded up or slowed down by up to 0.05%, but NTP cannot cause the monotonic clock to jump forward or backward.

I'd like to clarify what "the system is suspended" means under this context.
I am reading timefd_create and from the manpage,
https://man7.org/linux/man-pages/man2/timerfd_create.2.html
CLOCK_BOOTTIME (Since Linux 3.15)
Like CLOCK_MONOTONIC, this is a monotonically increasing
clock. However, whereas the CLOCK_MONOTONIC clock does
not measure the time while a system is suspended, the
CLOCK_BOOTTIME clock does include the time during which
the system is suspended. This is useful for applications
that need to be suspend-aware. CLOCK_REALTIME is not
suitable for such applications, since that clock is
affected by discontinuous changes to the system clock.
Based on the above description, we can indicate that CLOCK_REALTIME and CLOCK_BOOTTIME still count time when the system is suspended, while CLOCK_MONOTONIC doesn't.
I was confused about what "the system is suspended" mean exactly. At first I was thinking it means when we send Ctrl + Z from the terminal, making the process suspended. But it's not.
#MarkR's answer inspired me:
Imagine what happens when you suspend your laptop - .... Try it
on a VM.
So literally "the system is suspended" means you put your computer into sleep mode.
That said, CLOCK_REALTIME counts the time when the computer is asleep.
Compare the output of these 2 pieces of code
timefd_create_realtime_clock.c
copy from man timefd_create
#include <sys/timerfd.h>
#include <time.h>
#include <unistd.h>
#include <inttypes.h> /* Definition of PRIu64 */
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h> /* Definition of uint64_t */
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
static void
print_elapsed_time(void)
{
static struct timespec start;
struct timespec curr;
static int first_call = 1;
int secs, nsecs;
if (first_call) {
first_call = 0;
if (clock_gettime(CLOCK_MONOTONIC, &start) == -1)
handle_error("clock_gettime");
}
if (clock_gettime(CLOCK_MONOTONIC, &curr) == -1)
handle_error("clock_gettime");
secs = curr.tv_sec - start.tv_sec;
nsecs = curr.tv_nsec - start.tv_nsec;
if (nsecs < 0) {
secs--;
nsecs += 1000000000;
}
printf("%d.%03d: ", secs, (nsecs + 500000) / 1000000);
}
int
main(int argc, char *argv[])
{
struct itimerspec new_value;
int max_exp, fd;
struct timespec now;
uint64_t exp, tot_exp;
ssize_t s;
if ((argc != 2) && (argc != 4)) {
fprintf(stderr, "%s init-secs [interval-secs max-exp]\n",
argv[0]);
exit(EXIT_FAILURE);
}
if (clock_gettime(CLOCK_REALTIME, &now) == -1)
handle_error("clock_gettime");
/* Create a CLOCK_REALTIME absolute timer with initial
expiration and interval as specified in command line. */
new_value.it_value.tv_sec = now.tv_sec + atoi(argv[1]);
new_value.it_value.tv_nsec = now.tv_nsec;
if (argc == 2) {
new_value.it_interval.tv_sec = 0;
max_exp = 1;
} else {
new_value.it_interval.tv_sec = atoi(argv[2]);
max_exp = atoi(argv[3]);
}
new_value.it_interval.tv_nsec = 0;
fd = timerfd_create(CLOCK_REALTIME, 0);
if (fd == -1)
handle_error("timerfd_create");
if (timerfd_settime(fd, TFD_TIMER_ABSTIME, &new_value, NULL) == -1)
handle_error("timerfd_settime");
print_elapsed_time();
printf("timer started\n");
for (tot_exp = 0; tot_exp < max_exp;) {
s = read(fd, &exp, sizeof(uint64_t));
if (s != sizeof(uint64_t))
handle_error("read");
tot_exp += exp;
print_elapsed_time();
printf("read: %" PRIu64 "; total=%" PRIu64 "\n", exp, tot_exp);
}
exit(EXIT_SUCCESS);
}
timefd_create_monotonic_clock.c
modify a bit, change CLOCK_REALTIME to CLOCK_MONOTONIC
#include <sys/timerfd.h>
#include <time.h>
#include <unistd.h>
#include <inttypes.h> /* Definition of PRIu64 */
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h> /* Definition of uint64_t */
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
static void
print_elapsed_time(void)
{
static struct timespec start;
struct timespec curr;
static int first_call = 1;
int secs, nsecs;
if (first_call) {
first_call = 0;
if (clock_gettime(CLOCK_MONOTONIC, &start) == -1)
handle_error("clock_gettime");
}
if (clock_gettime(CLOCK_MONOTONIC, &curr) == -1)
handle_error("clock_gettime");
secs = curr.tv_sec - start.tv_sec;
nsecs = curr.tv_nsec - start.tv_nsec;
if (nsecs < 0) {
secs--;
nsecs += 1000000000;
}
printf("%d.%03d: ", secs, (nsecs + 500000) / 1000000);
}
int
main(int argc, char *argv[])
{
struct itimerspec new_value;
int max_exp, fd;
struct timespec now;
uint64_t exp, tot_exp;
ssize_t s;
if ((argc != 2) && (argc != 4)) {
fprintf(stderr, "%s init-secs [interval-secs max-exp]\n",
argv[0]);
exit(EXIT_FAILURE);
}
// T_NOTE: comment
// if (clock_gettime(CLOCK_REALTIME, &now) == -1)
// handle_error("clock_gettime");
/* Create a CLOCK_REALTIME absolute timer with initial
expiration and interval as specified in command line. */
// new_value.it_value.tv_sec = now.tv_sec + atoi(argv[1]);
// new_value.it_value.tv_nsec = now.tv_nsec;
new_value.it_value.tv_sec = atoi(argv[1]);
new_value.it_value.tv_nsec = 0;
if (argc == 2) {
new_value.it_interval.tv_sec = 0;
max_exp = 1;
} else {
new_value.it_interval.tv_sec = atoi(argv[2]);
max_exp = atoi(argv[3]);
}
new_value.it_interval.tv_nsec = 0;
// fd = timerfd_create(CLOCK_REALTIME, 0);
fd = timerfd_create(CLOCK_MONOTONIC, 0);
if (fd == -1)
handle_error("timerfd_create");
// if (timerfd_settime(fd, TFD_TIMER_ABSTIME, &new_value, NULL) == -1)
if (timerfd_settime(fd, 0, &new_value, NULL) == -1)
handle_error("timerfd_settime");
print_elapsed_time();
printf("timer started\n");
for (tot_exp = 0; tot_exp < max_exp;) {
s = read(fd, &exp, sizeof(uint64_t));
if (s != sizeof(uint64_t))
handle_error("read");
tot_exp += exp;
print_elapsed_time();
printf("read: %" PRIu64 "; total=%" PRIu64 "\n", exp, tot_exp);
}
exit(EXIT_SUCCESS);
}
compile both and run in 2 tabs in same terminal
./timefd_create_monotonic_clock 3 1 100
./timefd_create_realtime_clock 3 1 100
put my Ubuntu Desktop into sleep
Wait a few miniutes
Wake up my Ubuntu by pressing power button once
Check the terminal output
Output:
The realtime clock stopped immedicately. Because it've counted the time elapsed when the computer is suspended/asleep.
tian#tian-B250M-Wind:~/playground/libuv-vs-libevent$ ./timefd_create_realtime_clock 3 1 100
0.000: timer started
3.000: read: 1; total=1
4.000: read: 1; total=2
5.000: read: 1; total=3
6.000: read: 1; total=4
7.000: read: 1; total=5
8.000: read: 1; total=6
9.000: read: 1; total=7
10.000: read: 1; total=8
11.000: read: 1; total=9
12.000: read: 1; total=10
13.000: read: 1; total=11
14.000: read: 1; total=12
15.000: read: 1; total=13
16.000: read: 1; total=14
17.000: read: 1; total=15
18.000: read: 1; total=16
19.000: read: 1; total=17
20.000: read: 1; total=18
21.000: read: 1; total=19
22.000: read: 1; total=20
23.000: read: 1; total=21
24.000: read: 1; total=22
25.000: read: 1; total=23
26.000: read: 1; total=24
27.000: read: 1; total=25
28.000: read: 1; total=26
29.000: read: 1; total=27
30.000: read: 1; total=28
31.000: read: 1; total=29
33.330: read: 489; total=518 # wake up here
tian#tian-B250M-Wind:~/playground/libuv-vs-libevent$
tian#tian-B250M-Wind:~/Desktop/playground/libuv-vs-libevent$ ./timefd_create_monotonic_clock 3 1 100
0.000: timer started
3.000: read: 1; total=1
3.1000: read: 1; total=2
4.1000: read: 1; total=3
6.000: read: 1; total=4
7.000: read: 1; total=5
7.1000: read: 1; total=6
9.000: read: 1; total=7
10.000: read: 1; total=8
11.000: read: 1; total=9
12.000: read: 1; total=10
13.000: read: 1; total=11
14.000: read: 1; total=12
15.000: read: 1; total=13
16.000: read: 1; total=14
16.1000: read: 1; total=15
18.000: read: 1; total=16
19.000: read: 1; total=17
19.1000: read: 1; total=18
21.000: read: 1; total=19
22.001: read: 1; total=20
23.000: read: 1; total=21
25.482: read: 2; total=23
26.000: read: 1; total=24
26.1000: read: 1; total=25
28.000: read: 1; total=26
28.1000: read: 1; total=27
29.1000: read: 1; total=28
30.1000: read: 1; total=29
31.1000: read: 1; total=30
32.1000: read: 1; total=31
33.1000: read: 1; total=32
35.000: read: 1; total=33
36.000: read: 1; total=34
36.1000: read: 1; total=35
38.000: read: 1; total=36
39.000: read: 1; total=37
40.000: read: 1; total=38
40.1000: read: 1; total=39
42.000: read: 1; total=40
43.001: read: 1; total=41
43.1000: read: 1; total=42
45.000: read: 1; total=43
46.000: read: 1; total=44
47.000: read: 1; total=45
47.1000: read: 1; total=46
48.1000: read: 1; total=47
50.001: read: 1; total=48
^C
tian#tian-B250M-Wind:~/Desktop/playground/libuv-vs-libevent$

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

What caused the performance degradation when reading a local modified cache line with concurrent readers accessing it? - multithreading

Related

VMX performance issue with rdtsc (no rdtsc exiting, using rdtsc offseting)

Large overhead in CUDA kernel launch outside GPU execution

pthreads code not scaling up

Why does my process take too long to die?

Difference between CLOCK_REALTIME and CLOCK_MONOTONIC?

Categories

Resources