Pinning a process to any CPU respecting affinity - linux

Let's say I want to programmatically pin the current process to a single CPU, but I don't care which CPU that is.
One easy way to use sched_setaffinity with a fixed CPU number, probably 0 since there should always be a "CPU 0"1.
However, this approach fails if the affinity of the process has been set to a subset of the existing CPUs, not including the one you picked, e.g., by launching it with taskset.
So I want to pick "any CPU" to pin to, but only out of the CPUs that the current affinity mask allows. Here's one approach:
cpu_set_t cpu_set;
if (sched_getaffinity(0, sizeof(cpu_set), &cpu_set)) {
err("failed while getting existing cpu affinity");
}
for (int cpu = 0; cpu < CPU_SETSIZE; cpu++) {
if (CPU_ISSET(cpu, &cpu_set)) {
CPU_ZERO(cpu_set);
CPU_SET(cpu, &cpu_set);
}
}
int result = sched_setaffinity(0, sizeof(cpu_set), &cpu_set);
Basically we get the current affinity mask, then loop over every possible CPU looking for the first one that is allowed, then pass a mask with only this CPU set to sched_setaffinity.
However, if the current affinity mask has changed between the get and set calls the set call will fail. Any way around this race condition?
1 Although CPU zero won't always be online.

You could use getcpu() to discover the cpu that your process is running within, and use the result to set affinity to that cpu:
unsigned mycpu=0;
if( -1 == getcpu(&mycpu,NULL,NULL) ) {
// handle error
}
Presumably any CPU affinity rules that are in place would be honored by the scheduler, thus the getcpu() call would return a CPU that the process is allowed to run on.
There's still the potential that the affinity set might change, but that seems like a very unlikely case, and the allowed CPUs might be affected at some point in the future, outside the control of the process in question.
I suppose you could detect the error in the sched_setaffinity() call, and retry the process until the setaffinity call works...

Considering that the affinity mask of the process can change at any moment, you can iteratively try to pin the process to the current CPU and stop when it is successful.
cpu_set_t cpu_set;
int cpu = 0;
int result = -1;
while (result<0){
cpu = sched_getcpu();
if (cpu>0){
CPU_ZERO(&cpu_set);
CPU_SET(cpu, &cpu_set);
result = sched_setaffinity(0, sizeof(cpu_set), &cpu_set);
}
}

Related

Analyzing Context Switch in Multithread [duplicate]

I want to calculate the context switch time and I am thinking to use mutex and conditional variables to signal between 2 threads so that only one thread runs at a time. I can use CLOCK_MONOTONIC to measure the entire execution time and CLOCK_THREAD_CPUTIME_ID to measure how long each thread runs.
Then the context switch time is the (total_time - thread_1_time - thread_2_time).
To get a more accurate result, I can just loop over it and take the average.
Is this a correct way to approximate the context switch time? I cant think of anything that might go wrong but I am getting answers that are under 1 nanosecond..
I forgot to mention that the more time I loop it over and take the average, the smaller results I get.
Edit
here is a snippet of the code that I have
typedef struct
{
struct timespec start;
struct timespec end;
}thread_time;
...
// each thread function looks similar like this
void* thread_1_func(void* time)
{
thread_time* thread_time = (thread_time*) time;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->start));
for(x = 0; x < loop; ++x)
{
//where it switches to another thread
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->end));
return NULL;
};
void* thread_2_func(void* time)
{
//similar as above
}
int main()
{
...
pthread_t thread_1;
pthread_t thread_2;
thread_time thread_1_time;
thread_time thread_2_time;
struct timespec start, end;
// stamps the start time
clock_gettime(CLOCK_MONOTONIC, &start);
// create two threads with the time structs as the arguments
pthread_create(&thread_1, NULL, &thread_1_func, (void*) &thread_1_time);
pthread_create(&thread_2, NULL, &thread_2_func, (void*) &thread_2_time);
// waits for the two threads to terminate
pthread_join(thread_1, NULL);
pthread_join(thread_2, NULL);
// stamps the end time
clock_gettime(CLOCK_MONOTONIC, &end);
// then I calculate the difference between between total execution time and the total execution time of two different threads..
}
First of all, using CLOCK_THREAD_CPUTIME_ID is probably very wrong; this clock will give the time spent in that thread, in user mode. However the context switch does not happen in user mode, You'd want to use another clock. Also, on multiprocessing systems the clocks can give different values from processor to another! Thus I suggest you use CLOCK_REALTIME or CLOCK_MONOTONIC instead. However be warned that even if you read either of these twice in rapid succession, the timestamps usually will tens of nanoseconds apart already.
As for context switches - tthere are many kinds of context switches. The fastest approach is to switch from one thread to another entirely in software. This just means that you push the old registers on stack, set task switched flag so that SSE/FP registers will be lazily saved, save stack pointer, load new stack pointer and return from that function - since the other thread had done the same, the return from that function happens in another thread.
This thread to thread switch is quite fast, its overhead is about the same as for any system call. Switching from one process to another is much slower: this is because the user-space page tables must be flushed and switched by setting the CR0 register; this causes misses in TLB, which maps virtual addresses to physical ones.
However the <1 ns context switch/system call overhead does not really seem plausible - it is very probable that there is either hyperthreading or 2 CPU cores here, so I suggest that you set the CPU affinity on that process so that Linux only ever runs it on say the first CPU core:
#include <sched.h>
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
result = sched_setaffinity(0, sizeof(mask), &mask);
Then you should be pretty sure that the time you're measuring comes from a real context switch. Also, to measure the time for switching floating point / SSE stacks (this happens lazily), you should have some floating point variables and do calculations on them prior to context switch, then add say .1 to some volatile floating point variable after the context switch to see if it has an effect on the switching time.
This is not straight forward but as usual someone has already done a lot of work on this. (I'm not including the source here because I cannot see any License mentioned)
https://github.com/tsuna/contextswitch/blob/master/timetctxsw.c
If you copy that file to a linux machine as (context_switch_time.c) you can compile and run it using this
gcc -D_GNU_SOURCE -Wall -O3 -std=c11 -lpthread context_switch_time.c
./a.out
I got the following result on a small VM
2000000 thread context switches in 2178645536ns (1089.3ns/ctxsw)
This question has come up before... for Linux you can find some material here.
Write a C program to measure time spent in context switch in Linux OS
Note, while the user was running the test in the above link they were also hammering the machine with games and compiling which is why the context switches were taking a long time. Some more info here...
how can you measure the time spent in a context switch under java platform

printf in RT thread

I am writing a multi-thread application in Linux.
There is no RT patch in kernel, yet I use threads with priorities.
On checking the time it takes to execute printf , I measure different values every time I measure, although it is done in the highest priority thread :
if(clock_gettime(CLOCK_MONOTONIC, &start))
{ /* handle error */
}
for(int i=0; i< 1000; i++)
printf("hello world");
if(clock_gettime(CLOCK_MONOTONIC, &end))
{
/* handle error */
}
elapsedSeconds = TimeSpecToSeconds(&end) - TimeSpecToSeconds(&start);
Why does printf change timing and in non deterministic way , i.e. each
How should printf be used with RT threads ?
Can it be used inside RT thread or should it be totally avoided ?
Is write to disk should be treated in the same way as printf ? Should it be used only in separate low priority thread ?
printf under the hood triggers a non-realtime (even blocking) mechanism of the buffered IO.
It's not only non-deterministic, but opens the possibility of a priority inversion.
You should be very careful using it from a real time thread (I would say totally avoid it.
Normally, in a latency bound code you would use a wait-free binary audit into a chain of (pre-allocated or memory mapped) ring buffers and flush them using a background lower priority thread (or even a separate process).

Measure time a task spends between 2 points in linux (task profiling)

I'll soon start banging my head on the wall:
It's very simple really, I want to measure the time a task spends between 2 points (in Linux - 1 core - 1 CPU).
During this time the task must have total control over the CPU and NOT get interrupted by any other task or HW interrupts.
To achieve this, I'v created a kernel module to make sure the above criterions are met.
In this kernel module I've tried to:
First, disable IRQs:
I've used spin_lock_irqsave()/spin_lock_irqrestore() - Which i presume is the right way to be sure that all local interrupts are disabled and my task has the cpu for it self during the critical region.
Then,
Used preempt_disable() -> Since current = my task, then logically the kernel should continue running my task until I re-enable preemption -> Does not work (my_task->nvcsw and my_task->nivcsw show that a csw has occurred -> my-task got preempted)
I've tried to increase the priority of my task by changing my_task->prio and my_task->static_prio to 1 -> highest real-time prio (my_task->policy = SCHED_FIFO)...
Did not work either (my_task->nvcsw and my_task->nivcsw show that a csw has occurred -> my-task got preempted) and my_task->prio got a new prio (120) by the scheduler I presume....
Is there any way to deterministically garantee that a task does not get interrupted/preeempted in Linux? Is there any way to force the scheduler to run a task (for a short time 50-500us) until it's done?
Here is my code to enable/disable parts of the OS (the task in question sends a enable/disable commands before and after the critical region using procfs and handled by this switch):
// Handle request
switch( enable ){
// Disable OS
case COS_OS_DISABLE:
// Disable preemption
preempt_disable()
// Save policy
last_policy = pTask->policy;
// Save task priorities
last_prio = pTask->prio;
last_static_prio = pTask->static_prio;
last_normal_prio = pTask->normal_prio;
last_rt_priority = pTask->rt_priority;
// Set priorities to highest real time prio
pTask->prio = 1;
pTask->static_prio = 1;
pTask->normal_prio = 1;
pTask->rt_priority = 1;
// Set scheduler policy to FIFO
pTask->policy = SCHED_FIFO;
// Lock kernel: It will disable interrupts _locally_, but the spinlock itself will guarantee the global lock, so it will guarantee that there is only one thread-of-control within the region(s) protected by that lock.
spin_lock_irqsave( &mr_lock , flags );
break;
// Default: Enable OS always
case COS_OS_ENABLE:
default:
// Reset task priorities
pTask->prio = last_prio;
pTask->static_prio = last_static_prio;
pTask->normal_prio = last_normal_prio;
pTask->rt_priority = last_rt_priority;
// Reset scheduler policy
pTask->policy = last_policy;
// Unlock kernel
spin_unlock_irqrestore( &mr_lock , flags );
// Enable preemption
preempt_enable();
break;
}
Disabling interrupts is allowed only for kernel code, and only for a short time.
With the stock kernel, it is not possible to give a user-space task total control of the CPU.
If you want to measure only the time used by your user-space task, you could run your task normally and use the u modifer of perf to ignore interrupts; however, this would not prevent any cache effects of the interrupt handlers.

why doesn't the System Monitor show correct CPU affinity?

I have search for questions/answers on CPU affinity and read the results but I am still cannot get my threads to nail up to a single CPU.
I am working on an application that will be run on a dedicated linux box so I am not concerned about other processes, only my own. This app currently spawns off one pthread and then the main thread enters a while loop to process control messages using POSIX msg queues. This while loop blocks waiting for a control msg to come in and then processes it. So the main thread is very simple and non-critical. My code is working very well as I can send this app messages and it will process them just fine. All control messages are very small in size and are use to just control the functionality of the application, that is, only a few control messages are ever send/received.
Before I enter this while loop, I use sched_getaffinity() to log all of the CPUs available. Then I use sched_setaffinity() to set this process to a single CPU. Then I call sched_getaffinity() again to check if it is set to run on only one CPU and it is indeed correct.
The single pthread that was spawned off does a similar thing. The first thing I do in the newly created pthread is call pthread_getaffinity_np() and check the available CPUs, then call pthread_setaffinity_np() to set it to a different CPU then call pthread_getaffinity_np() to check if it is set as desired and it is indeed correct.
This is what is confusing. When I run the app and view the CPU History in System Monitor, I see no difference from when I run the app without all of this set affinity stuff. The scheduler still runs a couple of seconds in each of the 4 CPUs on this quad core box. So it appears that the scheduler is ignoring my affinity settings.
Am I wrong in expecting to see some proof that the main thread and the pthread are actually running in their own single CPU?? or have I forgotten to do something more to get this to work as I intend?
Thanks,
-Andres
You have no answers, I will give you what I can:some partial help
Assuming you checked the return values from pthread_setaffinity_np:
How you assign your cpuset is very important, create it in the main thread. For what you want. It will propagate to successive threads. Did you check return codes?
The cpuset you actually get will be the intersection of hardware available cpus and the cpuset you define.
min.h in the code below is a generic build include file. You have to define _GNU_SOURCE - please note the comment on the last line of the code. CPUSET and CPUSETSIZE are macros. I think I define them somewhere else, I do not remember. They may be in a standard header.
#define _GNU_SOURCE
#include "min.h"
#include <pthread.h>
int
main(int argc, char **argv)
{
int s, j;
cpu_set_t cpuset;
pthread_t tid=pthread_self();
// Set affinity mask to include CPUs 0 & 1
CPU_ZERO(&cpuset);
for (j = 0; j < 2; j++)
CPU_SET(j, &cpuset);
s = pthread_setaffinity_np(tid, sizeof(cpu_set_t), &cpuset);
if (s != 0)
{
fprintf(stderr, "%d ", s);
perror(" pthread_setaffinity_np");
exit(1);
}
// lets see what we really have in the actual affinity mask assigned our thread
s = pthread_getaffinity_np(tid, sizeof(cpu_set_t), &cpuset);
if (s != 0)
{
fprintf(stderr, "%d ", s);
perror(" pthread_setaffinity_np");
exit(1);
}
printf("my cpuset has:\n");
for (j = 0; j < CPU_SETSIZE; j++)
if (CPU_ISSET(j, &cpuset))
printf(" CPU %d\n", j);
// #Andres note: any pthread_create call from here on creates a thread with the identical
// cpuset - you do not have to call it in every thread.
return 0;
}

What is the best way for interprocessor communication in Linux?

I have two CPUs on the chip and they have a shared memory. This is not a SMP architecture. Just two CPUs on the chip with shared memory.
There is a Unix-like operating system on the first CPU and there is a Linux operating system on the second CPU.
The first CPU does some job and the result of this job is some data. After first CPU finishes its job it should say to another CPU that job is finished and the second CPU have to process this data.
What is the way to handle interprocessor communication? What algorithm should I use to do that?
Any reference to an article about it would be greatly appreciated.
It all depends on the hardware. If all you have is shared memory, and no other way of communication, then you have to use a polling of some sort.
Are both of your processor running linux ? How do they handle the shared memory ?
A good solution is to use a linked list as a fifo. On this fifo you put data descriptor, like adress and size.
For example, you can have an input and output fifo, and go like this :
Processor A does some calculation
Processor A push the data descriptoron the output fifo
Processor A wait for data descriptor on the input fifo
loop
Processor B wait for data descriptor on the output fifo
Processor B works with data
Processor B push used data descriptor on the input fifo
loop
Of course, the hard part is in the locking. May be you should reformulate your question to emphasize this is not 'standard' SMP.
If you have no atomic test and set bit operation available on the memory, I guess you have to go with a scheme where some zone of memory is write only for one processor, and read only for the other.
Edit : See Hasturkun answer, for a way of passing messages from one processor to the other, using ordered write instead of atomicity to provide serialized access to some predefined data.
Ok. I understand the question.I have worked on this kind of an issue.
Now first thing that you need to understand is the working of the shared memory that exists between the 2 CPUs. Because these shared memory can be accessed in different ways, u need to figure out which one suits u the best.
Most times hardware semaphores will be provided in the shared memory along with the hardware interrupt to notify the message transfer from one processor to the other processor.
So have a look at this first.
A really good method is to just send IP packets back and forth (using sockets). this has the advantage that you can test stuff off-chip - as in, run a test version of one process on a PC, if you have networking.
If both processors are managed by a single os, then you can use any of the standard IPC to communicate with each other as OS takes care of everything. If they are running on different OSes then sockets would be your best bet.
EDIT
Quick unidirectional version:
Ready flag
Done flag
init:
init()
{
ready = 0;
done = 1;
}
writer:
send()
{
while (!done)
sleep();
/* copy data in */
done = 0;
ready = 1;
}
reader:
poll()
{
while (1)
{
if (ready)
{
recv();
}
sleep();
}
}
recv()
{
/* copy data out */
ready = 0;
done = 1;
}
Build a message passing system via the shared mem (which should be coherent, either by being uncached for both processors, or by use of cache flush/invalidate calls).
Your shared memory structure should have (at least) the following fields:
Current owner
Message active (as in, should be read)
Request usage fields
Flow will probably be like this: (assumed send/recv synchronized not to run at same time)
poll()
{
/* you're better off using interrupts instead, if you have them */
while(1)
{
if (current_owner == me)
{
if (active)
{
recv();
}
else if (!request[me] && request[other])
{
request[other] = 0;
current_owner = other;
}
}
sleep();
}
}
recv()
{
/* copy data... */
active = 0;
/* check if we still want it */
if (!request[me] && request[other])
{
request[other] = 0;
current_owner = other;
}
}
send()
{
request[me] = 1;
while (current_owner != me || active)
{
sleep();
}
request[me] = 0;
/* copy data in... */
/* pass to other side */
active = 1;
current_owner = other;
}
How about using the shared mem?
I don't have a good link right now, but if you google for IPC + shared mem I bet you find some good info :)
Are you sure you need to do this? In my experience you're better off letting your compiler & operating system manage how your process uses multiple CPUs.

Resources