Why is there an infinite loop at the end of do_exit() defined in kernel/exit.c?

Why is there an infinite loop at the end of do_exit() defined in kernel/exit.c? - linux

In the Linux kernel, I am confused about the purpose of having the loop at the end of do_exit().
Isn't the call to schedule() the last code that will be ever executed by do_exit()..?
666 void do_exit(long code)
667 {
668 struct task_struct *tsk = current;
669 int group_dead;
670
671 ..........
833 /* causes final put_task_struct in finish_task_switch(). */
834 tsk->state = TASK_DEAD;
835 tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */
836 schedule();
837 BUG();
838 /* Avoid "noreturn function does return". */
839 for (;;)
840 cpu_relax(); /* For when BUG is null */
841 }
842

Depending on the kernel it may appear differently but generally the structure is the same.
In some implementations it starts with:
NORET_TYPE void do_exit(long code)
NORET_TYPE may be different across varying GCC compilers (it may be an attribute that marks a no return function) or it may be volatile. What does such a declaration on a function do? It effectively says I won't return. You can find more about it in the GCC documentation which says:
The attribute noreturn is not implemented in GCC versions earlier than 2.5. An alternative way to declare that a function does not return, which works in the current version and in some older versions, is as follows:
typedef void voidfn ();
volatile voidfn fatal;
volatile void functions are non-conforming extensions to the C standard created by the GCC developers. You won't find it in the ANSI C Standard (C89).
It happens to be when the kernel reaches do_exit() it doesn't intend to return from the function.Generally it will block indefinitely or until something resets the system (usually). The problem is that if you mark a function as not returning the compiler will warn you if your function exits. So you generally see an infinite loop of some sort (a while, for, goto etc). In your case it does:
/* Avoid "noreturn function does return". */
for (;;)
Interestingly enough the comment pretty much gives the reason. noreturn function does return is a gcc compiler warning. an infinite loop that is obvious to the compiler (like for (;;)) is enough to stop the compiler from complaining as it it will determine that the function can't reach a point where it can exit.
Even if there is no compiler warning to worry about the infinite loop prevents the function from returning. At some point a kernel (not necessarily Linux) is going to be faced with fact that it was called by a jmp instruction to get it started (At least on x86 systems). Often the jmp is used to set the code segment for entering protected mode or at the most basic level the BIOS jumps to the code that loads the boot sector (in very simple OSes). This means that there is a finite end to the code and to prevent the processor from executing invalid instructions it is better to make it busy doing nothing of interest.
The code cpu_relax(); /* For when BUG is null */ is to fix a kernel bug. It is mentioned in this post my Linus
sched: Fix ancient race in do_exit()

Related

Why does the ICU use this aliasing barrier when doing a reinterpret_cast?

I'm porting code from ICU 58.2 to ICU 59.1 where they changed the character type from a uint16_t to a char16_t. I was going to just do a straight reinterpret_cast where I needed to convert the types, but found that the ICU 59.1 actually provides functions for this conversion. What I don't understand is why they need to use this anti aliasing barrier before doing a reinterpret_cast.
#elif (defined(__clang__) || defined(__GNUC__)) && U_PLATFORM !=
U_PF_BROWSER_NATIVE_CLIENT
# define U_ALIASING_BARRIER(ptr) asm volatile("" : : "rm"(ptr) : "memory")
#endif
...
inline const UChar *toUCharPtr(const char16_t *p) {
#ifdef U_ALIASING_BARRIER
U_ALIASING_BARRIER(p);
#endif
return reinterpret_cast<const UChar *>(p);
Why wouldn't it be safe just to use reinterpret_cast without calling U_ALIASING_BARRIER?

At a guess, it's to stop any violations of the strict aliasing rule, that might occur in calling code that hasn't been completely cleaned up, from resulting in unexpected behaviour when optimizing (the hint to this is in the comment above: "Barrier for pointer anti-aliasing optimizations even across function boundaries.").
The strict aliasing rule forbids dereferencing pointers that alias the same value when they have incompatible types (a C notion, but C++ says a similar thing with more words). Here's a small gotcha: char16_t and uint16_t aren't required to be compatible. uint16_t is actually an optionally-supported type (in both C and C++); char16_t is equivalent to uint_least16_t, which isn't necessarily the same type. It will have the same width on x86, but a compiler isn't required to have it tagged as actually being the same thing. It might even be intentionally lax with assuming types that typically indicate different intent could alias.
There's a more complete explanation in the linked answer, but basically given code like this:
uint16_t buffer[] = ...
buffer[0] = u'a';
uint16_t * pc1 = buffer;
char16_t * pc2 = (char16_t *)pc1;
pc2[0] = u'b';
uint16_t c3 = pc1[0];
...if for whatever reason the compiler doesn't have char16_t and uint16_t tagged as compatible, and you're compiling with optimizations on including its equivalent of -fstrict-aliasing, it's allowed to assume that the write through pc2 couldn't have modified whatever pc1 points at, and not reload the value before assigning it to c3, possibly giving it u'a' instead.
Code a bit like the example could plausibly arise mid-way through a conversion process where the previous code was happily using uint16_t * everywhere, but now a char16_t * is made available at the top of a block for compatibility with ICU 59, before all the code below has been completely changed to read only through the correctly-typed pointer.
Since compilers don't generally optimize hand-coded assembly, the presence of an asm block will force it to check all of its assumptions about the state of registers and other temporary values, and do a full reload of every value the first time it's dereferenced after U_ALIASING_BARRIER, regardless of optimization flags. This won't protect you from any further aliasing problems if you continue to write through the uint16_t * below the conversion (if you do that, it's legitimately your own fault), but it should at least ensure the state from before the conversion call doesn't persist in a way that could cause writes through the new pointer to be accidentally skipped afterwards.

nr_cpus boot parameter in Linux kernel

I was browsing Linux kernel code to understand the nr_cpus boot parameter.
As per the documentation,
(https://www.kernel.org/doc/Documentation/kernel-parameters.txt)
[SMP] Maximum number of processors that an SMP kernel
could support. nr_cpus=n : n >= 1 limits the kernel to
supporting 'n' processors. Later in runtime you can not
use hotplug cpu feature to put more cpu back to online.
just like you compile the kernel NR_CPUS=n
In the smp.c code, the value is set to nr_cpu_ids which is then used everywhere in kernel.
http://lxr.free-electrons.com/source/kernel/smp.c
527 static int __init nrcpus(char *str)
528 {
529 int nr_cpus;
530
531 get_option(&str, &nr_cpus);
532 if (nr_cpus > 0 && nr_cpus < nr_cpu_ids)
533 nr_cpu_ids = nr_cpus;
534
535 return 0;
536 }
537
538 early_param("nr_cpus", nrcpus);
What I do not understand the nr_cpu_ids is also set by setup_nr_cpu_ids.
555 /* An arch may set nr_cpu_ids earlier if needed, so this would be redundant */
556 void __init setup_nr_cpu_ids(void)
557 {
558 nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
559 }
Initially, I thought this is called before the early_param invocation. After adding logs, I found that setup_nr_cpu_ids() is called after nr_cpus(). nr_cpu_ids is always set to the value sets in setup_nr_cpu_ids() in stead of nr_cpus(). I even verified its value in smp_init().
Can anybody please clarify if my observation is correct or not?
What is the exact usage of nr_cpu_ids?

As part of documentation describes from your question:
Maximum number of processors that an SMP kernel could support
Actually both of these function do the same. The early_param() provides ability to search the first parameter in the kernel command line and if the search was successful, the function which is noted in the second parameter of the early_param() will be called.
All functions which are marked with early_param will be called in the do_early_param() from the init/main.c which will be called from the setup_arch function. The setup_arch function is architecture specific and each architecture provides own implementation of the setup_arch(). So after the call of the nrcpus() function, the nr_cpu_ids will contain number of processors that kernel could support.
If you will look at the Linux kernel source code, you will note that the setup_nr_cpu_ids() function will be called from the init/main.c after functions which are marked with early_param. So in this case it is redundant. But sometimes it might be useful to get number of processors earlier.
For example, you can see it in the powerpc architecture. As described in the comment of the smp_setup_cpu_maps() functions where the setup_nr_cpu_ids() is called:
Having the possible map set up early allows us to restrict allocations
of things like irqstacks to nr_cpu_ids rather than NR_CPUS.

Typically arch detects the number of cpus available on the system. But, its possible to reduce the number of cpus you want to use. And, for that reason nr_cpus parameter was introduced. By default no one uses this parameter and in that case arch code is responsible for detecting the number of cpus available on the system i.e. for x86 arch look at prefill_possible_map, where a check is made to see whether nr_cpus were passed or not. If nr_cpus were passed then that value is used. After arch detects the number of possible cpus available then setup_nr_cpu_ids in kenrel/smp.c finalizes the value of nr_cpu_ids. Do note that, it might sounds redundant but since it works so no one complains.
So, your observation is partially correct due to the fact that you missed the point, how arch smpboot code integrates nr_cpus. Hope this clarifies you understanding.

Checkout cpumask.h, in particular this...
787 #define for_each_cpu_mask_nr(cpu, mask) \
788 for ((cpu) = -1; \
789 (cpu) = __next_cpu_nr((cpu), &(mask)), \
790 (cpu) < nr_cpu_ids; )
nr_cpu_ids is the max usable cpus and nr_cpus that you pass as a boot param is used to set it. This is what it's doing for me in kernel 3.16.
The comment here
555 /* An arch may set nr_cpu_ids earlier if needed, so this would be redundant */
556 void __init setup_nr_cpu_ids(void)
557 {
558 nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
559 }
is saying that if you have already set nr_cpu_ids that this call is redundant. The reason being the cpu_possible_mask has already been set to nr_cpu_id.

Pthreads & Multicore compiler

I'm working with the support SMP kernel: Snapgear 2.6.21.
I have created 4 threads in my c application, and I am trying to set thread 1 to run on CPU1, thread2 on CPU 2, etc.
However, the compiler sparc-linux-gcc does not recognize these functions:
CPU_SET (int cpu, cpu_set_t * set);
CPU_ZERO (cpu_set_t * set);
and this type: cpu_set_t
It always gives me these errors:
implicit declaration of function 'CPU_ZERO'
implicit declaration of function 'CPU_SET'
'cpu_set_t' undeclared (first use in this function)
Here is my code to bind active thread to processor 0:
cpu_set_t mask;
CPU_ZERO (& mask);
CPU_SET (0, & mask) // bind processor 0
sched_setaffinity (0, sizeof(mask), & mask);
I have included and defined at the top :
**define _GNU_SOURCE
include <sched.h>**
But I always get the same errors. can you help me please?

You should read sched_setaffinity(2) carefully and test its result (and display errno on failure, e.g. with perror).
Actually, I believe you should use pthread_setaffinity_np(3) instead (and of course test its failure, etc...)
Even more, I believe that you should not bother to explicitly set the affinity. Recent Linux kernels are often quite good at dispatching running threads on different CPUs.
So simply use pthreads and don't bother about affinity, unless you see actual issues when benchmarking.
BTW, passing the -H flag to your GCC (cross-)compiler could be helpful. It shows you the included files. Perhaps also look into the preprocessed form obtained with gcc -C -E ; it looks like some header files are missing or not found (maybe some missing -I include-directory at compilation time, or some missing headers on your development system)
BTW, your kernel version looks ancient. Can't you upgrade your kernel to something newer (3.15.x or some 3.y)?

How to access errno after clone (or: How to set errno location)

Per traditional POSIX, errno is simply an integer lvalue, which works perfectly well with fork, but oviously doesn't work nearly as well with threads. As per pthreads, errno is a thread-local integer lvalue. Under Linux/NTPL, as an implementation detail, errno is some "macro that expands to a function returning an integer lvalue".
On my Debian system, this seems to be *__errno_location (), on some other systems I've seen things like &(gettib()->errnum.
TL;DR
Assuming I've used clone to create a thread, can I just call errno and expect that it will work, or do I have to do some special rain dance? For example, do I need to read some special field in the thread information block, or some special TLS value, or, do I get to set the address of the thread-local variable where the glibc stores the error values somehow? Something like __set_errno_location() maybe?
Or, will it "just work" as it is?
Inevitably, someone will be tempted to reply "simply use phtreads" -- please don't. I do not want to use pthreads. I want clone. I do not want any of the ill-advised functionality of pthreads, and I do not want to deal with any of its quirks, nor do I want the overhead to implement those quirks. I recognize that much of the crud in pthreads comes from the fact that it has to work (and, surprisingly, it successfully works) amongst others for some completely broken systems that are nearly three decades old, but that doesn't mean that it is necessarily a good thing for everyone and every situation. Portability is not of any concern in this case.
All I want in this particular situation is fire up another process running in the same address space as the parent, synchronization via a simple lock (say, a futex), and write working properly (which means I also have to be able to read errno correctly). As little overhead as possible, no other functionality or special behavior needed or even desired.

According to the glibc source code, errno is defined as a thread-local variable. Unfortunately, this requires significant C library support. Any threads created using pthread_create() will be made aware of thread-local variables. I would not even bother trying to get glibc to accept your foreign threads.
An alternative would be to use a different libc implementation that may allow you to extract some of its internal structures and manually set the thread control block if errno is part of it. This would be incredibly hacky and unreliable. I doubt you'll find anything like __set_errno_location(), but rather something like __set_tcb().
#include <bits/some_hidden_file.h>
void init_errno(void)
{
struct __tcb* tcb;
/* allocate a dummy thread control block (malloc may set errno
* so might have to store the tcb on stack or allocate it in the
* parent) */
tcb = malloc(sizeof(struct __tcb));
/* initialize errno */
tcb->errno = 0;
/* set pointer to thread control block (x86) */
arch_prctl(ARCH_SET_FS, tcb);
}
This assumes that the errno macro expands to something like: ((struct __tcb*)__read_fs())->errno.
Of course, there's always the option of implementing an extremely small subset of libc yourself. Or you could write your own implementation of the write() system call with a custom stub to handle errno and have it co-exist with the chosen libc implementation.
#define my_errno /* errno variable stored at some known location */
ssize_t my_write(int fd, const void* buf, size_t len)
{
ssize_t ret;
__asm__ (
/* set system call number */
/* set up parameters */
/* make the call */
/* retrieve return value in c variable */
);
if (ret >= -4096 && ret < 0) {
my_errno = -ret;
return -1;
}
return ret;
}
I don't remember the exact details of GCC inline assembly and the system call invocation details vary depending on platform.
Personally, I'd just implement a very small subset of libc, which would just consist of a little assembler and a few constants. This is remarkably simple with so much reference code available out there, although it may be overambitious.

If errno is a thread local variable, so clone() will copy it in the new process's address space? i had overrode the errno_location() function like around 2001 to use an errno based on the pid.
http://tamtrajnana.blogspot.com/2012/03/thread-safety-of-errno-variable.html
since errno is now defined as "__thread int errno;" (see above comment) this explains how __thread types are handled: Linux's thread local storage implementation

Is there a way to check whether the processor cache has been flushed recently?

On i386 linux. Preferably in c/(c/posix std libs)/proc if possible. If not is there any piece of assembly or third party library that can do this?
Edit: I'm trying to develop test whether a kernel module clear a cache line or the whole proccesor(with wbinvd()). Program runs as root but I'd prefer to stay in user space if possible.

Cache coherent systems do their utmost to hide such things from you. I think you will have to observe it indirectly, either by using performance counting registers to detect cache misses or by carefully measuring the time to read a memory location with a high resolution timer.
This program works on my x86_64 box to demonstrate the effects of clflush. It times how long it takes to read a global variable using rdtsc. Being a single instruction tied directly to the CPU clock makes direct use of rdtsc ideal for this.
Here is the output:
took 81 ticks
took 81 ticks
flush: took 387 ticks
took 72 ticks
You see 3 trials: The first ensures i is in the cache (which it is, because it was just zeroed as part of BSS), the second is a read of i that should be in the cache. Then clflush kicks i out of the cache (along with its neighbors) and shows that re-reading it takes significantly longer. A final read verifies it is back in the cache. The results are very reproducible and the difference is substantial enough to easily see the cache misses. If you cared to calibrate the overhead of rdtsc() you could make the difference even more pronounced.
If you can't read the memory address you want to test (although even mmap of /dev/mem should work for these purposes) you may be able to infer what you want if you know the cacheline size and associativity of the cache. Then you can use accessible memory locations to probe the activity in the set you're interested in.
Source code:
#include <stdio.h>
#include <stdint.h>
inline void
clflush(volatile void *p)
{
asm volatile ("clflush (%0)" :: "r"(p));
}
inline uint64_t
rdtsc()
{
unsigned long a, d;
asm volatile ("rdtsc" : "=a" (a), "=d" (d));
return a | ((uint64_t)d << 32);
}
volatile int i;
inline void
test()
{
uint64_t start, end;
volatile int j;
start = rdtsc();
j = i;
end = rdtsc();
printf("took %lu ticks\n", end - start);
}
int
main(int ac, char **av)
{
test();
test();
printf("flush: ");
clflush(&i);
test();
test();
return 0;
}

I dont know of any generic command to get the the cache state, but there are ways:
I guess this is the easiest: If you got your kernel module, just disassemble it and look for cache invalidation / flushing commands (atm. just 3 came to my mind: WBINDVD, CLFLUSH, INVD).
You just said it is for i386, but I guess you dont mean a 80386. The problem is that there are many different with different extension and features. E.g. the newest Intel series has some performance/profiling registers for the cache system included, which you can use to evalute cache misses/hits/number of transfers and similar.
Similar to 2, very depending on the system you got. But when you have a multiprocessor configuration you could watch the first cache coherence protocol (MESI) with the 2nd.
You mentioned WBINVD - afaik that will always flush complete, i.e. all, cache lines

It may not be an answer to your specific question, but have you tried using a cache profiler such as Cachegrind? It can only be used to profile userspace code, but you might be able to use it nonetheless, by e.g. moving the code of your function to userspace if it does not depend on any kernel-specific interfaces.
It might actually be more effective than trying to ask the processor for information that may or may not exist and that will be probably affected by your mere asking about it - yes, Heisenberg was way before his time :-)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string