Perf stat counts context-switches in what way? - linux

perf stat displays some interesting statistics that can be gathered from examining hardware and software counters.
In my research, I couldn't find any reliable information about what counts as a context-switch in perf stat. In spite of my efforts, I was unable to understand the kernel code in its entirety.
Suppose my InfiniBand network application calls a blocking read system call in the event mode 2000 times and perf stat counts 1,241 context switches. The context-switches refer to either the schedule-in process or the schedule-out process, or both?
The __schedule() function (kernel/sched/core.c) increments the switch_count counter whenever prev != next.
It seems that perf stats' context-switches include involuntary switches as well as voluntary switches.
It seems to me that only deschedule events are counted if the current context runs the schedule code and increases the nvcsw and nivcsw counters in the task_struct.
output from perf stat -- my_application:
1,241 context-switches
Meanwhile, if I only count the sched:sched_switch event the output is close to the expected number.
output from perf stat -e sched:sched_switch -- my_application:
2,168 sched:sched_switch
Is there a difference between context-switches and the sched_switch- event?

I think you only get a count for context-switches if a different task actually runs on a core that was running one of your threads. A read() that blocks, but resumes before any user-space code from any other task runs on the core, probably won't count.
Just entering the kernel at all for a system-call clearly doesn't count; perf stat ls only counts one context-switch in a largish directory for me, or zero if I ls a smaller directory like /. I get much higher counts, like 711 for a recursive ls of a directory that I hadn't accessed recently, on a magnetic HDD. So it spent significant time waiting for I/O, and maybe running bottom-half interrupt handlers.
The fact that the count can be odd means it's not counting both deschedule and re-schedule separately; since I'm looking at counts for a single-threaded process that eventually exited, if it was counting both the count would have to be even.
I expect the counting is done when schedule() decides that current should change to point to a new task that isn't this one. (current is the Linux kernel's per-core variable that points to the task_struct of the current task, e.g. a user-space thread.) So every time that happens to a thread that's part of your process, you get 1 count.
Indeed, the OP helpfully tracked down the source code; it's in __schedule in kernel/sched/core.c. For example in Linux 6.1
static void __sched notrace __schedule(unsigned int sched_mode)
{
struct task_struct *prev, *next;
unsigned long *switch_count;
// and some other declarations I omitted
...
cpu = smp_processor_id();
rq = cpu_rq(cpu); // stands for run queue
prev = rq->curr;
...
switch_count = &prev->nivcsw; // either Num InVoluntary CSWs I think
...
if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
...
switch_count = &prev->nvcsw; // or Num Voluntary CSWs
}
next = pick_next_task(rq, prev, &rf);
...
if (likely(prev != next)) {
...
++*switch_count; //// INCREMENT THE SELECTED COUNTER
...
trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);
// then make some function calls to actually do the context switch
...
}
I would guess the context-switches perf event sums both involuntary and voluntary switches away from a thread. (Assuming that's what nv and niv stand for.)

Related

Analyzing Context Switch in Multithread [duplicate]

I want to calculate the context switch time and I am thinking to use mutex and conditional variables to signal between 2 threads so that only one thread runs at a time. I can use CLOCK_MONOTONIC to measure the entire execution time and CLOCK_THREAD_CPUTIME_ID to measure how long each thread runs.
Then the context switch time is the (total_time - thread_1_time - thread_2_time).
To get a more accurate result, I can just loop over it and take the average.
Is this a correct way to approximate the context switch time? I cant think of anything that might go wrong but I am getting answers that are under 1 nanosecond..
I forgot to mention that the more time I loop it over and take the average, the smaller results I get.
Edit
here is a snippet of the code that I have
typedef struct
{
struct timespec start;
struct timespec end;
}thread_time;
...
// each thread function looks similar like this
void* thread_1_func(void* time)
{
thread_time* thread_time = (thread_time*) time;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->start));
for(x = 0; x < loop; ++x)
{
//where it switches to another thread
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->end));
return NULL;
};
void* thread_2_func(void* time)
{
//similar as above
}
int main()
{
...
pthread_t thread_1;
pthread_t thread_2;
thread_time thread_1_time;
thread_time thread_2_time;
struct timespec start, end;
// stamps the start time
clock_gettime(CLOCK_MONOTONIC, &start);
// create two threads with the time structs as the arguments
pthread_create(&thread_1, NULL, &thread_1_func, (void*) &thread_1_time);
pthread_create(&thread_2, NULL, &thread_2_func, (void*) &thread_2_time);
// waits for the two threads to terminate
pthread_join(thread_1, NULL);
pthread_join(thread_2, NULL);
// stamps the end time
clock_gettime(CLOCK_MONOTONIC, &end);
// then I calculate the difference between between total execution time and the total execution time of two different threads..
}
First of all, using CLOCK_THREAD_CPUTIME_ID is probably very wrong; this clock will give the time spent in that thread, in user mode. However the context switch does not happen in user mode, You'd want to use another clock. Also, on multiprocessing systems the clocks can give different values from processor to another! Thus I suggest you use CLOCK_REALTIME or CLOCK_MONOTONIC instead. However be warned that even if you read either of these twice in rapid succession, the timestamps usually will tens of nanoseconds apart already.
As for context switches - tthere are many kinds of context switches. The fastest approach is to switch from one thread to another entirely in software. This just means that you push the old registers on stack, set task switched flag so that SSE/FP registers will be lazily saved, save stack pointer, load new stack pointer and return from that function - since the other thread had done the same, the return from that function happens in another thread.
This thread to thread switch is quite fast, its overhead is about the same as for any system call. Switching from one process to another is much slower: this is because the user-space page tables must be flushed and switched by setting the CR0 register; this causes misses in TLB, which maps virtual addresses to physical ones.
However the <1 ns context switch/system call overhead does not really seem plausible - it is very probable that there is either hyperthreading or 2 CPU cores here, so I suggest that you set the CPU affinity on that process so that Linux only ever runs it on say the first CPU core:
#include <sched.h>
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
result = sched_setaffinity(0, sizeof(mask), &mask);
Then you should be pretty sure that the time you're measuring comes from a real context switch. Also, to measure the time for switching floating point / SSE stacks (this happens lazily), you should have some floating point variables and do calculations on them prior to context switch, then add say .1 to some volatile floating point variable after the context switch to see if it has an effect on the switching time.
This is not straight forward but as usual someone has already done a lot of work on this. (I'm not including the source here because I cannot see any License mentioned)
https://github.com/tsuna/contextswitch/blob/master/timetctxsw.c
If you copy that file to a linux machine as (context_switch_time.c) you can compile and run it using this
gcc -D_GNU_SOURCE -Wall -O3 -std=c11 -lpthread context_switch_time.c
./a.out
I got the following result on a small VM
2000000 thread context switches in 2178645536ns (1089.3ns/ctxsw)
This question has come up before... for Linux you can find some material here.
Write a C program to measure time spent in context switch in Linux OS
Note, while the user was running the test in the above link they were also hammering the machine with games and compiling which is why the context switches were taking a long time. Some more info here...
how can you measure the time spent in a context switch under java platform

Pinning a process to any CPU respecting affinity

Let's say I want to programmatically pin the current process to a single CPU, but I don't care which CPU that is.
One easy way to use sched_setaffinity with a fixed CPU number, probably 0 since there should always be a "CPU 0"1.
However, this approach fails if the affinity of the process has been set to a subset of the existing CPUs, not including the one you picked, e.g., by launching it with taskset.
So I want to pick "any CPU" to pin to, but only out of the CPUs that the current affinity mask allows. Here's one approach:
cpu_set_t cpu_set;
if (sched_getaffinity(0, sizeof(cpu_set), &cpu_set)) {
err("failed while getting existing cpu affinity");
}
for (int cpu = 0; cpu < CPU_SETSIZE; cpu++) {
if (CPU_ISSET(cpu, &cpu_set)) {
CPU_ZERO(cpu_set);
CPU_SET(cpu, &cpu_set);
}
}
int result = sched_setaffinity(0, sizeof(cpu_set), &cpu_set);
Basically we get the current affinity mask, then loop over every possible CPU looking for the first one that is allowed, then pass a mask with only this CPU set to sched_setaffinity.
However, if the current affinity mask has changed between the get and set calls the set call will fail. Any way around this race condition?
1 Although CPU zero won't always be online.
You could use getcpu() to discover the cpu that your process is running within, and use the result to set affinity to that cpu:
unsigned mycpu=0;
if( -1 == getcpu(&mycpu,NULL,NULL) ) {
// handle error
}
Presumably any CPU affinity rules that are in place would be honored by the scheduler, thus the getcpu() call would return a CPU that the process is allowed to run on.
There's still the potential that the affinity set might change, but that seems like a very unlikely case, and the allowed CPUs might be affected at some point in the future, outside the control of the process in question.
I suppose you could detect the error in the sched_setaffinity() call, and retry the process until the setaffinity call works...
Considering that the affinity mask of the process can change at any moment, you can iteratively try to pin the process to the current CPU and stop when it is successful.
cpu_set_t cpu_set;
int cpu = 0;
int result = -1;
while (result<0){
cpu = sched_getcpu();
if (cpu>0){
CPU_ZERO(&cpu_set);
CPU_SET(cpu, &cpu_set);
result = sched_setaffinity(0, sizeof(cpu_set), &cpu_set);
}
}

printf in RT thread

I am writing a multi-thread application in Linux.
There is no RT patch in kernel, yet I use threads with priorities.
On checking the time it takes to execute printf , I measure different values every time I measure, although it is done in the highest priority thread :
if(clock_gettime(CLOCK_MONOTONIC, &start))
{ /* handle error */
}
for(int i=0; i< 1000; i++)
printf("hello world");
if(clock_gettime(CLOCK_MONOTONIC, &end))
{
/* handle error */
}
elapsedSeconds = TimeSpecToSeconds(&end) - TimeSpecToSeconds(&start);
Why does printf change timing and in non deterministic way , i.e. each
How should printf be used with RT threads ?
Can it be used inside RT thread or should it be totally avoided ?
Is write to disk should be treated in the same way as printf ? Should it be used only in separate low priority thread ?
printf under the hood triggers a non-realtime (even blocking) mechanism of the buffered IO.
It's not only non-deterministic, but opens the possibility of a priority inversion.
You should be very careful using it from a real time thread (I would say totally avoid it.
Normally, in a latency bound code you would use a wait-free binary audit into a chain of (pre-allocated or memory mapped) ring buffers and flush them using a background lower priority thread (or even a separate process).

Is there an equivalent to the windows GetSystemTimes() function in Linux?

In Windows there is a function called GetSystemTimes() that returns the system idle time, the amount of time spent executing kernel code, and the amount of time spent executing user mode code.
Is there an equivalent function(s) in linux?
The original answer gave a solution to getting the user and system time of the current running process. However, you want the information on the entire system. As far as I know, the only way to get this information is to parse the contents of /proc/stat. In particular, the first line, labeled cpu:
cpu 85806677 11713309 6660413 3490353007 6236822 300919 807875 0
This is followed by per cpu summaries if you are running an SMP system. The line itself has the following information (in order):
time in user mode
time in user mode with low priority
time in system mode
time idle
time waiting for I/O to complete
time servicing interrupts
time servicing software interrupts
time spent in virtualization
The times are reported in units of USER_HZ.
There may be other columns after this depending on the version of your kernel.
Original answer:
You want times(2):
times() stores the current process times in the struct tms that buf points to. The struct tms is as defined in <sys/times.h>:
struct tms {
clock_t tms_utime; /* user time */
clock_t tms_stime; /* system time */
clock_t tms_cutime; /* user time of children */
clock_t tms_cstime; /* system time of children */
};
Idle time can be inferred from tracking elapsed wall clock time, and subtracting away the non-idle times reported from the call.

CPU time after the process finished

Is there a function in Linux which allows me to see how much CPU did a process use after it finished? I need something similar to bash "time" command. I am fork()ing the process and then waiting using wait() for a child to finish. The way of accurately measuring "real" time (actual time elapsed between fork() and exit()), even when wait() was called a long after the child process became zombie is also welcome, but I'm not sure if its possible.
Sure, wait3 and wait4 have you covered. Alternatively (and more portably) you could use getrusage(2).
The wait3() and wait4() system calls are similar to waitpid(2), but
additionally return resource usage information about the child in the
structure pointed to by rusage.
Example: wait3
struct rusage usage;
wait3(&status, 0, &usage);
Example: getrusage
Of course, wait3 and wait4 are just a convenience. So you could use getrusage:
getrusage(RUSAGE_CHILDREN, &usage);
The disadvantage is that this tells you the resources used by ALL the terminated children.
So, once you get it, what do you do with rusage ? struct rusage has the following form:
struct rusage {
struct timeval ru_utime; /* user CPU time used */
struct timeval ru_stime; /* system CPU time used */
/* More. */
};
The bash feature "times" reports the total user/system times used by the shell and its children. This feature, unfortunately, doesn't report total memory, i/o etc. IE: it doesn't employ getrusage (which it should).
The /usr/bin/time program will give you executon time, memory footprint. So you can do /usr/bin/time bash myprog.sh ... and get the accumulated times for all children.

Resources