what does sched_feat macro in scheduler mean - linux

The following macro is defined in ./kernel/sched/sched.h
#define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x]))
#else /* !(SCHED_DEBUG && HAVE_JUMP_LABEL) */
#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
#endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
I am unable to understand what role does it play.

The sched_feat() macro is used in scheduler code to test if a certain scheduler feature is enabled. For example, in kernel/sched/core.c, there is a snippet of code
int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
{
if (!sched_feat(OWNER_SPIN))
return 0;
which is testing whether the "spin-wait on mutex acquisition if the mutex owner is running" feature is set. You can see the full list of scheduler features in kernel/sched/features.h but a short summary is that they are tunables settable at runtime without rebuilding the kernel through /sys/kernel/debug/sched_features.
For example if you have not changed the default settings on your system, you will see "OWNER_SPIN" in your /sys/kernel/debug/sched_features, which means the !sched_feat(OWNER_SPIN) in the snippet above will evaluate to false and the scheduler code will continue on into the rest of the code in mutex_spin_on_owner().
The reason that the macro definition you partially copied is more complicated than you might expect is that it uses the jump labels feature when available and needed to eliminate the overhead of these conditional tests in frequently run scheduler code paths. (The jump label version is only used when HAVE_JUMP_LABEL is set in the config, for obvious reasons, and when SCHED_DEBUG is set because otherwise the scheduler feature bits can't change at runtime) You can follow the link above to lwn.net for more details, but in a nutshell jump labels are a way to use runtime binary patching to make conditional tests of flags much cheaper at the cost of making changing the flags much more expensive.
You can also look at the scheduler commit that introduced jump label use to see how the code used to be a bit simpler but not quite as efficient.

Related

Can I block a new process execution using Kprobe?

Kprobe has a pre-handler function vaguely documented as followed:
User's pre-handler (kp->pre_handler)::
#include <linux/kprobes.h>
#include <linux/ptrace.h>
int pre_handler(struct kprobe *p, struct pt_regs *regs);
Called with p pointing to the kprobe associated with the breakpoint,
and regs pointing to the struct containing the registers saved when
the breakpoint was hit. Return 0 here unless you're a Kprobes geek.
I was wondering if one can use this function (or any other Kprobe feature) to prevent a process from being executed \ forked.
As documented in the kernel documentation, you can change the execution path by changing the appropriate register (e.g., IP register in x86):
Changing Execution Path
-----------------------
Since kprobes can probe into a running kernel code, it can change the
register set, including instruction pointer. This operation requires
maximum care, such as keeping the stack frame, recovering the execution
path etc. Since it operates on a running kernel and needs deep knowledge
of computer architecture and concurrent computing, you can easily shoot
your foot.
If you change the instruction pointer (and set up other related
registers) in pre_handler, you must return !0 so that kprobes stops
single stepping and just returns to the given address.
This also means post_handler should not be called anymore.
Note that this operation may be harder on some architectures which use
TOC (Table of Contents) for function call, since you have to setup a new
TOC for your function in your module, and recover the old one after
returning from it.
So you might be able to block a process' execution by jumping over some code. I wouldn't recommend it; you're more likely to cause a kernel crash than to succeed in stopping the execution of a new process.
seccomp-bpf is probably better suited for your use case. This StackOverflow answer gives you all the information you need to leverage seccomp-bpf.

User defined atomic less than

I've been reading and it seems that std::atomic doesn't support a compare and swap of the less/greater than variant.
I'm using OpenMP and need to safely update a global minimum value.
I was thinking this would be as easy as using a built-in API.
But alas, so instead I'm trying to come up with my own implementation.
I'm primarily concerned with the fact that I don't want to use an omp critical section to do a less than comparison every single time because it may incur significant synchronization overhead for very little gain in most cases.
But in those cases where a new global minima is potentially found (less often), the synchronization overhead is acceptable. I'm thinking I can implement it using the following method. Hoping for someone to advise.
Use an std::atomic_uint as the global minima.
Atomically read the value into thread local stack.
Compare it against the current value and if it's less, attempt to enter a critical section.
Once synchronized, verify that the atomic value is still less than the new one and update accordingly (the body of the critical section should be cheap, just update a few values).
This is for a homework assignment, so I'm trying to keep the implementation my own. Please don't recommend various libraries to accomplish this. But please do comment on the synchronization overhead that this operation can incur or if it's bad, elaborate on why. Thanks.
What you're looking for would be called fetch_min() if it existed: fetch old value and update the value in memory to min(current, new), exactly like fetch_add but with min().
This operation is not directly supported in hardware on x86, but machines with LL/SC could emit slightly more efficient asm for it than from emulating it with a CAS ( old, min(old,new) ) retry loop.
You can emulate any atomic operation with a CAS retry loop. In practice it usually doesn't have to retry, because the CPU that succeeded at doing a load usually also succeeds at CAS a few cycles later after computing whatever with the load result, so it's efficient.
See Atomic double floating point or SSE/AVX vector load/store on x86_64 for an example of creating a fetch_add for atomic<double> with a CAS retry loop, in terms of compare_exchange_weak and plain + for double. Do that with min and you're all set.
Re: clarification in comments: I think you're saying you have a global minimum, but when you find a new one, you want to update some associated data, too. Your question is confusing because "compare and swap on less/greater than" doesn't help you with that.
I'd recommend using atomic<unsigned> globmin to track the global minimum, so you can read it to decide whether or not to enter the critical section and update related state that goes with that minimum.
Only ever modify globmin while holding the lock (i.e. inside the critical section). Then you can update it + the associated data. It has to be atomic<> so readers that look at just globmin outside the critical section don't have data race UB. Readers that look at the associated extra data must take the lock that protects it and makes sure that updates of globmin + the extra data happen "atomically", from the perspective of readers that obey the lock.
static std::atomic<unsigned> globmin;
std::mutex globmin_lock;
static struct Extradata globmin_extra;
void new_min_candidate(unsigned newmin, const struct Extradata &newdata)
{
// light-weight early out check to avoid the critical section
// No ordering requirement as long as globmin is monotonically decreasing with time
if (newmin < globmin.load(std::memory_order_relaxed))
{
// enter a critical section. Use OpenMP stuff if you want, this is plain ISO C++
std::lock_guard<std::mutex> lock(globmin_lock);
// Check globmin again, after we've excluded other threads from modifying it and globmin_extra
if (newmin < globmin.load(std::memory_order_relaxed)) {
globmin.store(newmin, std::memory_order_relaxed);
globmin_extra = newdata;
}
// else leave the critical section with no update:
// another thread raced with use *outside* the critical section
// release the lock / leave critical section (lock goes out of scope here: RAII)
}
// else do nothing
}
std::memory_order_relaxed is sufficient for globmin: there's no ordering required with anything else, just atomicity. We get atomicity / consistency for the associated data from the critical section/lock, not from memory-ordering semantics of loading / storing globmin.
This way the only atomic read-modify-write operation is the locking itself. Everything on globmin is either load or store (much cheaper). The main cost with multiple threads will still be bouncing the cache line around, but once you own a cache line, each atomic RMW is maybe 20x more expensive than a simple store on modern x86 (http://agner.org/optimize/).
With this design, if most candidates aren't lower than globmin, the cache line will stay in the Shared state most of the time, so the globmin.load(std::memory_order_relaxed) outside the critical section can hit in L1D cache. It's just an ordinary load instruction, so it's extremely cheap. (On x86, even seq-cst loads are just ordinary loads (and release loads are just ordinary stores, but seq_cst stores are more expensive). On other architectures where the default ordering is weaker, seq_cst / acquire loads need a barrier.)

Kernel spin-lock enables preemption before releasing lock

When I was discussing the behavior of spinlocks in uni- and SMP kernels with some colleagues, we dived into the code and found a line that really surprised us, and we can’t figure out why it’s done this way.
short calltrace to show where we’re coming from:
spin_lock calls raw_spin_lock,
raw_spin_lock calls _raw_spin_lock, and
on a uni-processor system, _raw_spin_lock is #defined as __LOCK
__LOCK is a define:
#define __LOCK(lock) \
do { preempt_disable(); ___LOCK(lock); } while (0)
So far, so good. We disable preemption by increasing the kernel task’s lock counter. I assume this is done to improve performance: since you should not hold a spinlock for more than a very short time, you should just finish your critical section instead of being interrupted and potentially have another task spin its scheduling slice away while waiting for you to finish.
However, now we finally come to my question. The corresponding unlock code looks like this:
#define __UNLOCK(lock) \
do { preempt_enable(); ___UNLOCK(lock); } while (0)
Why would you call preempt_enable() before ___UNLOCK? This seems very unintuitive to us, because you might get preempted immediately after calling preempt_enable, without ever having the chance to release your spinlock. It feels like this renders the whole preempt_disable/preempt_enable logic somewhat ineffective, especially since preempt_disable specifically checks during its call whether the lock counter is 0 again, and then calls the scheduler. It seems to us like it would make so much more sense to first release the lock, then decrease the lock counter and thus potentially enable scheduling again.
What are we missing? What is the idea behind calling preempt_enable before ___UNLOCK instead of the other way round?
You're looking at the uni-processor defines. As the comment in spinlock_api_up.h says (http://lxr.free-electrons.com/source/include/linux/spinlock_api_up.h#L21):
/*
* In the UP-nondebug case there's no real locking going on, so the
* only thing we have to do is to keep the preempt counts and irq
* flags straight, to suppress compiler warnings of unused lock
* variables, and to add the proper checker annotations:
*/
The ___LOCK and ___UNLOCK macros are there for annotation purposes, and unless __CHECKER__ is defined (It is defined by sparse), it ends up to be compiled out.
In other words, preempt_enable() and preempt_disable() are the ones doing the locking in a single processor case.

meaning of enqueue_wakeup macro linux

I am not sure what is the meaning of the macro ENQUEUE_WAKEUP in linux mean. I have an intuition that it means to enqueue a task after it has woke up but still want to be sure.
The macro definition is:
#define ENQUEUE_WAKEUP 1
Note: For reference purposes in v3.5.4 it is defined in /include/linux/sched.h and referenced at many place but one such place I am having problem is function enqueue_task_rt in file ./kernel/sched/rt.c
This is were it was introduced.
sched: Add enqueue/dequeue flags
In order to reduce the dependency on TASK_WAKING rework the enqueue
interface to support a proper flags field.
ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task
http://lkml.indiana.edu/hypermail/linux/kernel/1004.0/00744.html

Thread Communication

Is there any tool available for tracing communication among threads;
1. running in a single process
2. running in different processes (IPC)
I am presuming you need to trace this for debugging. Under normal circumstances it's hard to do this, without custom written code. For a similar problem that I faced, I had a per-processor tracing buffer, which used to record briefly the time and interesting operation that was performed by the running thread. The log was a circular trace which used to store data like this:
struct trace_data {
int op;
void *data;
struct time t;
union {
struct {
int op1_field1;
int op1_field2;
} d1;
struct {
int op2_field1;
int op2_field2;
} d2
} u;
}
The trace log was an array of these structures of length 1024, one for each processor. Each thread used to trace operations, as well as time to determine causality of events. The fields which were used to store data in the "union" depended upon the operation being done. The "data" pointer's meaning depended upon the "op" as well. When the program used to crash, I'd open the core in gdb and I had a gdb script which would go through the logs in each processor and print out the ops and their corresponding data, to find out the history of events.
For different processes you could do such logging to a file instead - one per process. This example is in C, but you can do this in whatever language you want to use, as long as you can figure out the CPU id on which the thread is running currently.
You might be looking for something like the Intel Thread Checker as long as you're using pthreads in (1).
For communication between different processes (2), you can use Aspect-Oriented Programming (AOP) if you have the source code, or write your own wrapper for the IPC functions and LD_PRELOAD it.
Edit: Whoops, you said tracing, not checking.
It will depend so much on the operating system and development environment that you are using. If you're with Visual Studio, look at the tools in Visual Studio 2010.

Resources