Microcontroller: Setting up Linker Script for Multitasking

Microcontroller: Setting up Linker Script for Multitasking - multithreading

What is the correct approach for setting up a linker script when implementing a multithreading program on a microcontroller?
Do you define the task stack pointer in its own ram section and then declare multiple sections for each task? Or is there a better way?

There is no "correct" approach.
One common method is to declare the stack for each task (and the control blocks for semaphores etc) statically in C, so they go in the .data or .bss section.
Another approach more suited to larger microcontrollers is to use heap allocation. This might seem error prone, but it is widely used. As long as all tasks and other resources are created at the start of execution then their placement will be close to deterministic and there is little probability of allocation failure as long as the system has sufficient memory in total.
It would be a lot of work and quite unusual to specify the details of each task in the linker script. It might be marginally more efficient (eg: no unnecessary zeroing of stack space just because it is in .bss) but it will be a lot harder to maintain.
If your system has multiple RAM banks with different speeds you might want to put some things in the faster and some in the slower RAM, but I wouldn't recommend more than that.

Related

Why can't we have a safe ISA?

Accroding to this paper: https://doi.org/10.1109/SP.2013.13, Memory corruption bugs are one of the oldest problems in computer security. The lack of memory safety and type safety has caused countless bugs, causing billions of dollars and huge efforts to fix them.
But the root of C/C++'s memory vulnerability can trace down to the ISA level. At ISA level, every instruction can access any memory address without any fine grained safe check (only corase grained check like page fault). Sure, we can implement memory safe at a higher software level, like Java (JVM), but this leads to significant cost of performance. In a word, we can't have both safety and performance at the same time on existing CPUs.
My question is, why can't we implement the safety at the hardware level? If the CPU has a safe ISA, which ensures the memory safe by, I don't know, taking the responsbilities of malloc and free, then maybe we can get rid of the performance decline of software safe checking. If anyone professional in microelectronics can tell me, is this idea realistic?

Depending on what you mean, it could make it impossible implement memory-unsafe languages like C in a normal way. e.g. every memory access would have to be to some object that has a known size? I'd guess an operating system for such a machine might have to work around that "feature" by telling it that the entire address space was one large array object. Or else you'd need some mechanism for a read system call to know the proper bounds of the object it's writing in the copy_to_user() part of its job. And then there's other OS stuff like accessing the same physical page from different virtual pages.
The OP (via asking on Reddit) found the CHERI project which is an attempt at this idea, involving "... revisit fundamental design choices in hardware and software to dramatically improve system security." Changing hardware alone can't work; compilers need to change, too. But they were able to adapt "Clang/LLVM, FreeBSD, FreeRTOS, and applications such as WebKit," so their approach could be practical. (Unlike the hypothetical versions I was imagining when writing other parts of this answer.)
CHERI uses "fine-grained memory protection", and "Language and compiler extensions" to implement memory-safe C and C++, and higher-level languages.
So it's not a drop-in replacement, and it sounds like you have to actively use the features to gain safety. As I argue in the rest of the answer, hardware can't do it alone, and it's highly non-trivial even with software cooperation. It's easy to come up with ways that wouldn't work. :P
For hardware-enforced memory-safety to be possible, hardware would have to know about every object and its size, and be able to cache that structure in a way that allows efficient lookups to find the bounds. Page tables (4k granularity, or larger in more modern ISAs) are already hard enough for hardware for hardware to cache efficiently for large programs, and that's without even considering which pointer goes with which object.
Checking a TLBs as part of every load and store can be done efficiently, but checking another structure in parallel with that might be problematic. Especially when the ranges don't have power-of-2 sizes and natural alignment, the way pages do, which makes it possible to build a TLB from content-addressable memory that checks for a match against each of several possible values for the high bits. (e.g. a page is 4k in size, always starting at a 4k alignment boundary.)
You mean it may cost too much at hardware level, like the die area?
Die area might not even be the biggest problem, especially these days. It would cost power, and/or cost latency in very important critical paths such as L1d hit load-use latency. Even if you could come up with some plausible way for software to make tables that hardware could check, or otherwise solve the other parts of this problem.
Modifying a page-table entry requires invalidating the entry, including TLB shootdown for other cores. If every free (and some malloc) cost inter-core communication to do similar things for object tables, that would be very expensive.
I think inventing a way for software to tell the hardware about objects would be an even bigger problem. malloc and free aren't something you can just build in to a CPU where memory addressing works anything like existing CPUs, or like it does in C. Software needs to manage memory, it doesn't make sense to try to build that in to a CPU. So then malloc and free (and mmap with file-backed mappings and shared memory...) need a way to tell the CPU about objects. Seems like a mess.
I think at best an ISA could provide more tools software can use to make bounds-checks cheaper. Perhaps some kind of extra semantics on loads/stores, like an extra operand for indexed addressing modes for load or store that takes a max?
At least if we want an ISA to work anything like current ones, rather than work like a JVM or a Transmeta Crusoe and internally recompile for some real ISA.
Intel's MPX ISA extension to x86 was an attempt to let software set up bound ranges, but it's been mostly abandoned due to lower performance than pure software. Intel even dropped it from their recent CPUs (Not present in 10th Gen CPUs using 10nm lithography, or later.)
This is all just off the top of my head; I haven't searched for any serious proposals for how a system could plausibly work.
I don't think memory safety is something you can easily add after the fact to languages like C that weren't originally designed with it.

Have a look to "Code for malloc and free" at SO. Those commands are very, very far away from even being defined within an instruction set.

Does using the program stack involves syscalls?

I'm studying operating system theory, and I know that heap allocation involves a specific syscall and I know that compilers usually optimize for this requesting more than needed beforehand.
But I don't find information about stack allocation. What about it? It involves a specific syscall every time you read from it or write to it (for example when you call a function with some parameters)? Or there is some other mechanism that don't involve syscall perhaps?

Typically when the OS starts your program it examines the executable file's headers and arranges various areas for various things (an area for your executable's code, and area for your executable's data, etc). This includes setting up an initial stack (and a lot more - e.g. finding shared libraries and doing dynamic linking).
After the OS has done all this, your executable starts executing. At this point you already have memory for a stack and can just use it without any system calls.
Note 1: If you create threads, then there will probably be a system call involved to create the thread and that system call will probably allocate memory for the new thread's stack.
Note 2: Typically there's "virtual memory" (what your program sees) and "physical memory" (what the hardware sees); and in between typically the OS does lots of tricks to improve performance and avoid wasting physical memory, and to hide resource limits (so you don't have to worry so much about running out of physical memory). One of these tricks is to allocate virtual memory (e.g. for a large stack) without allocating any actual physical memory, and then allocate the physical memory if/when the virtual memory is first modified. Other tricks include various "swap space" schemes, and memory mapped files. These tricks rely on requests generated by the CPU on your program's behalf (e.g. page fault exceptions) which aren't system calls, but have similar ("ask kernel to do something") characteristics.
Note 3: All of the above depends on which OS. Different operating systems do things differently. I've chosen words carefully - e.g. "Typically" means that most modern operating systems work like I've described (but "typically" does not imply that all possible operating systems work like that; and some operating systems do not work like I've described).

No, stack is normal memory. For process point of view, there is no difference (and so the nasty security bug, where you return a pointer to a data in stack, but stack now is changed.
As Brendan wrote, OS will setup stack for the process at program loading. But if you access a non-allocated page of stack (e.g. if your stack if growing), kernel may allocate automatically for you a new stack page. (not much different as when you try to allocate new memory in heap, and there is no more memory available on program space: but in this case you explicitly do a syscall to tell kernel you want more heap memory).
You will notice that usually stack go in one direction and heap (allocated memory) in the other direction (usually toward each others). So if you program need more stack you have space, but if you program do not need much stack, you can use memory for e.g. huge array. Or the contrary: if you do a lot of recursion, you allocate much stack (but you probably need less heap memory).
Two additional consideration: CPU may have special stack instruction. But you can see them as syntactic sugar (you can simulate PUSH and POP with MOV. CALL and RET with JMP (and simulated PUSH and POP).
And kernel may use a special stack for his own purposes (especially important for interrupts).

Why do register renaming, when we can increase the number of registers in the architecture?

In processors, why can't we simply increase the number of registers instead of having a huge reorder buffer and mapping the register for resolving name dependencies?

Lots of reasons.
first, we are often designing micro-architectures to execute programs for an existing architecture. Adding registers would change the architecture. At best, existing binaries would not benefit from the new registers, at worst they won't run at all without some kind of JIT compilation.
there is the problem of encoding. Adding new registers means increasing the number of bit dedicated to encode the registers, probably increasing the instruction size with effects on the cache and elsewhere.
there is the issue of the size of the visible state. Context swapping would have to save all the visible registers. Taking more time. Taking more place (and thus an effect on the cache, thus more time again).
there is the effect that dynamic renaming can be applied at places where static renaming and register allocation is impossible, or at least hard to do; and when they are possible, that takes more instructions thus increasing the cache pressure.
In conclusion there is a sweet spot which is usually considered at 16 or 32 registers for the integer/general purpose case. For floating point and vector registers, there are arguments to consider more registers (ISTR that Fujitsu was at a time using 128 or 256 floating point registers for its own extended SPARC).
Related question on electronics.se.
An additional note, the mill architecture takes another approach to statically scheduled processors and avoid some of the drawbacks, apparently changing the trade-off. But AFAIK, it is not yet know if there will ever be available silicon for it.

Because static scheduling at compile time is hard (software pipelining) and inflexible to variable timings like cache misses. Having the CPU able to find and exploit ILP (Instruction Level Parallelism) in more cases is very useful for hiding latency of cache misses and FP or integer math.
Also, instruction-encoding considerations. For example, Haswell's 168-entry integer register file would need about 8 bits per operand to encode if we had that many architectural registers. vs. 3 or 4 for actual x86 machine code.
Related:
http://www.lighterra.com/papers/modernmicroprocessors/ great intro to CPU design and how smarter CPUs can find more ILP
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths shows how OoO exec can overlap exec of two dependency chains, unless you block it.
http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ has some specific examples of how much OoO exec can do to hide cache-miss or other latency
this Q&A about how superscalar execution works.

Register identifier encoding space will be a problem. Indeed, many more registers has been tried. For example, SPARC has register windows, 72 to 640 registers of which 32 are visible at one time.
Instead, from Computer Organization And Design: RISC-V Edition.
Smaller is faster. The desire for speed is the reason that RISC-V has 32 registers rather than many more.
BTW, ROB size has to do with the processor being out-of-order, superscalar, rather than renaming and providing lots of general purpose registers.

Why are Sempaphores limited in Linux

We just ran out of semaphores on our Linux box, due to the use of too many Websphere Message Broker instances or somesuch.
A colleague and I got to wondering why this is even limited - it's just a bit of memory, right?
I thoroughly googled and found nothing.
Anyone know why this is?
cheers

Semaphores, when being used, require frequent access with very, very low overhead.
Having an expandable system where memory for each newly requested semaphore structure is allocated on the fly would introduce complexity that would slow down access to them because it would have to first look up where the particular semaphore in question at the moment is stored, then go fetch the memory where it is stored and check the value. It is easier and faster to keep them in one compact block of fixed memory that is readily at hand.
Having them dispersed throughout memory via dynamic allocation would also make it more difficult to efficiently use memory pages that are locked (that is, not subject to being swapped out when there are high demands on memory). The use of "locked in" memory pages for kernel data is especially important for time-sensitive and/or critical kernel functions.
Having the limit be a tunable parameter (see links in the comments of original question) allows it to be increased at runtime if needed via an "expensive" reallocation and relocation of the block. But typically this is done one time at system initialization before anything much is even using semaphores.
That said, the amount of memory used by a semaphore set is rather tiny. With modern memory available on systems being in the many gigabytes the original default limits on the number of them might seem a bit stingy. But keep in mind that on many systems semaphores are rarely used by user space processes and the linux kernel finds its way into lots of small embedded systems with rather limited memory, so setting the default limit arbitrarily high in case it might be used seems wasteful.
The few software packages, such as Oracle database for example, that do depend on having many semaphores available, typically do recommend in their installation and/or system tuning advice to increase the system limits.

Is a stack overflow a security hole?

Note: this question relates to stack overflows (think infinite recursion), NOT buffer overflows.
If I write a program that is correct, but it accepts an input from the Internet that determines the level of recursion in a recursive function that it calls, is that potentially sufficient to allow someone to compromise the machine?
I know someone might be able to crash the process by causing a stack overflow, but could they inject code? Or does the c runtime detect the stack overflow condition and abort cleanly?
Just curious...

Rapid Refresher
First off, you need to understand that the fundamental units of protection in modern OSes are the process and the memory page. Processes are memory protection domains; they are the level at which an OS enforces security policy, and they thus correspond strongly with a running program. (Where they don't, it's either because the program is running in multiple processes or because the program is being shared in some kind of framework; the latter case has the potential to be “security-interesting” but that's 'nother story.) Virtual memory pages are the level at which the hardware applies security rules; every page in a process's memory has attributes that determine what the process can do with the page: whether it can read the page, whether it can write to it, and whether it can execute program code on it (though the third attribute is rather more rarely used than perhaps it should be). Compiled program code is mapped into memory into pages that are both readable and executable, but not writable, whereas the stack should be readable and writable, but not executable. Most memory pages are not readable, writable or executable at all; the OS only lets a process use as many pages as it explicitly asks for, and that's what memory allocation libraries (malloc() et al.) manage for you.
Analysis
Provided each stack frame is smaller than a memory page[1] so that, as the program advances through the stack, it writes to each page, the OS (i.e., the privileged part of the runtime) can at least in principle detect stack overflows reliably and terminate the program if that occurs. Basically, all that happens to do this detection is that there is a page that the program cannot write to at the end of the stack; if the program tries to write to it, the memory management hardware traps it and the OS gets a chance to intervene.
The potential problems with this come if the OS can be tricked into not setting such a page or if the stack frames can become so large and sparsely written to that the guard page is jumped over. (Keeping more guard pages would help prevent the second case with little cost; forcing variable-sized stack allocations – e.g., alloca() – to always write to the space they allocate before returning control to the program, and so detect a smashed stack, would prevent the first with some cost in terms of speed, though the writes could be reasonably sparse to keep the cost fairly small.)
Consequences
What are the consequences of this? Well, the OS has to do the right thing with memory management. (#Michael's link illustrates what can happen when it gets that wrong.) But also it is dangerous to let an attacker determine memory allocation sizes where you don't force a write to the whole allocation immediately; alloca and C99 variable-sized arrays are a particular threat. Moreover, I would be more suspicious of C++ code as that tends to do a lot more stack-based memory allocation; it might be OK, but there's a greater potential for things to go wrong.
Personally, I prefer to keep stack sizes and stack-frame sizes small anyway and do all variable-sized allocations on the heap. In part, this is a legacy of working on some types of embedded system and with code which uses very large numbers of threads, but it does make protecting against stack overflow attacks much simpler; the OS can reliably trap them and all the attacker has then is a denial-of-service (annoying, but rarely fatal). I don't know whether this is a solution for all programmers.
[1] Typical page sizes: 4kB on 32-bit systems, 16kB on 64-bit systems. Check your system documentation for what it is in your environment.

Most systems (like Windows) exit when the stack is overflowed. I don't think you are likely to see a security issue here. At least, not an elevation of privilege security issue. You could get some denial of service issues.

There is no universally correct answer... on some system the stack might grow down or up to overwrite other memory that the program's using (or another program, or the OS), but on any well designed, vaguely security-conscious OS (Linux, any common UNIX variant, even Windows) there will be no rights escalation potential. On some systems, with stack size checks disabled, the address space might approach or exceed the free virtual memory size, allowing memory exhaustion to negatively affect or even bring down the entire machine rather than just the process, but on good OSes by default there's a limit on that (e.g. Linux's limit / ulimit commands).
Worth mentioning that it's typically pretty easy to use a counter to put an arbitrary but generous limit of recursive depth too: you can use a static local variable if single-threaded, or a final parameter (conveniently defaulted to 0 if your language allows it, else have an outer caller provide 0 the first time).

Yes it is. There are algorithms to avoid recursiveness. For example in case arithmetic expressions the reverse polish notation enable you to avoid recursiveness. The main idea behind is to alter the original expression. There could be some algorythm that can help you as well.
One other problem with stack overflow, that if error handling is not appropiate, it can cause anything. To explain it for example in Java StackOverflowError is an error, and it is caught if someone catches Throwable, which is a common mistake. So error handling is a key question in case of stack overflow.

Yes, it is. Availability is an important aspect of security, that is mostly overlooked.
Don't fall into that pit.
edit
As an example of poorly understood security-consciousness in modern OSs, take a look at a relatively newly discovered vulnerability that nobody yet patched completely. There are countless other examples of privilege escalation vulnerabilities that OS developers have written off as denial of service attacks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string