How to check if a timer handler thread is running in POSIX - linux

We are developing a kernel driver and corresponding test cases (in user land) and used timers in our code. Malloc is almost not available. Timers are set up by SIGEV_THREAD so new threads are created.
According to the instructions here and here, it is hard to implement a general clean up system. So I am trying to define a framework with coding rules to deal with this.
In this method, I need to count the number of running handlers of a specific timer. In kernel, I can use try_to_del_timer_sync() and re-add to achieve this, but I cannot find a method in userland, especially POSIX.
Linux-specific methods are also welcome.

Related

How does the Linux kernel realize reentrancy?

All Unix kernels are reentrant: several processes may be executing in kernel
mode at the same time. How can I realize this effect in code? How should I handle the situation whereby many processes invoke system calls, pending in kernel mode?
[Edit - the term "reentrant" gets used in a couple of different senses. This answer uses the basic "multiple contexts can be executing the same code at the same time." This usually applies to a single routine, but can be extended to apply to a set of cooperating routines, generally routines which share data. An extreme case of this is when applied to a complete program - a web server, or an operating system. A web-server might be considered non-reentrant if it could only deal with one client at a time. (Ugh!) An operating system kernel might be called non-reentrant if only one process/thread/processor could be executing kernel code at a time.
Operating systems like that occurred during the transition to multi-processor systems. Many went through a slow transition from written-for-uniprocessors to one-single-lock-protects-everything (i.e. non-reentrant) through various stages of finer and finer grained locking. IIRC, linux finally got rid of the "big kernel lock" at approx. version 2.6.37 - but it was mostly gone long before that, just protecting remnants not yet converted to a multiprocessing implementation.
The rest of this answer is written in terms of individual routines, rather than complete programs.]
If you are in user space, you don't need to do anything. You call whatever system calls you want, and the right thing happens.
So I'm going to presume you are asking about code in the kernel.
Conceptually, it's fairly simple. It's also pretty much identical to what happens in a multi-threaded program in user space, when multiple threads call the same subroutine. (Let's assume it's a C program - other languages may have differently named mechanisms.)
When the system call implementation is using automatic (stack) variables, it has its own copy - no problem with re-entrancy. When it needs to use global data, it generally needs to use some kind of locking - the specific locking required depends on the specific data it's using, and what it's doing with that data.
This is all pretty generic, so perhaps an example might help.
Let's say the system call want to modify some attribute of a process. The process is represented by a struct task_struct which is a member of various linked lists. Those linked lists are protected by the tasklist_lock. Your system call gets the tasklist_lock, finds the right process, possibly gets a per-process lock controlling the field it cares about, modifies the field, and drops both locks.
One more detail, which is the case of processes executing different system calls, which don't share data with each other. With a reasonable implementation, there are no conflicts at all. One process can get itself into the kernel to handle its system call without affecting the other processes. I don't remember looking specifically at the linux implementation, but I imagine it's "reasonable". Something like a trap into an exception handler, which looks in a table to find the subroutine to handle the specific system call requested. The table is effectively const, so no locks required.

How do you detect whether the calling thread of a function is already RTAI real-time?

I am working on a big project that uses RTAI both in kernel and user spaces. I won't get into the details of the project, but here is briefly where a problem arises.
In user-space, my project provides a library used by other people to write some software. Those programs themselves may have RTAI real-time threads.
Now, some functions in RTAI require that their calling thread have already rt_thread_inited so if I want to use them in a function in the library, I need to temporarily make the calling thread real-time by calling rt_thread_init and later rt_task_delete.
Now here's the problem:
If the calling thread of my function IS already real-time, then I am rt_thread_initing which I assume simply fails, but then I rt_task_delete and make that thread non-real-time (besides the fact that when the thread itself (assuming I changed nothing) again rt_task_deletes, RTAI crashes.
If the calling thread of my function IS not real-time, everything is ok.
For now, I resorted to taking a parameter in the function so that the calling function tells the library if it is real-time or not. However, I wanted to know if RTAI has a function or something so I could use to automatically detect whether the current thread is real-time or not.
Don't know if there are any RTAI users here (I certainly didn't see the RTAI tag), but hope there would be.
Never tried it myself, so this is a guess - but did you consider using rt_whoami?
Get the task pointer of the current task.
https://www.rtai.org/documentation/magma/html/api/api_8c.html#a12
I would imagine it will fail (return NULL?) if you are in a non RT task...

Replacing system calls (syscalls) in Linux 2.6+

I'm looking into writing a userland threading library, since there seems to be no active work in this area, and I believe the C++0x promises and futures may give this model some power. Unfortunately, in order to make this model work, it is essential to ensure a context switch on blocking calls. As such, I would like to intercept every syscall in order to replace it with an asynchronous version. There are some caveats:
I know there are asynchronous syscalls for just about every regular syscall, but for backwards compatibility reasons this is not a viable solution.
I know that in Linux 2.4 or earlier it was possible to directly change the sys_call_table, but this has vanished.
As I would like my library to be statically linked if desired, the LD_PRELOAD trick isn't viable.
Similarly, kernel modules are not an option because this is supposed to be a userland library.
Finally, ptrace() is also not an option for similar reasons. I can't have my library forking a new process just in order to be used.
Is this possible?
I'm looking into writing a userland threading library, since there seems to be no active work in this area
You might want to take a look at the thread libraries Marcel (and its publications) and MPC, which implement hybrid (kernel and user-level) threads, mainly in the purpose of High-Performance Computing, so they had to find some solution for this blocking system calls.
So as to avoid the blocking of kernel threads when the application
makes blocking system calls, Marcel uses Scheduler Activations when
they are available, or just intercepts such blocking calls at dynamic
symbols level.

Multithreading in Lua

I was having a discussion with my friend the other day. I was saying how that, in pure Lua, you couldn't build a preemptive multitasking system. He claims you can, because of the following reason:
Both C and Lua have no inbuilt threading libraries [OP's note: well, Lua technically does, but AFAIK it's not useful for our purposes]. Windows, which is written in mostly C(++) has pre-emptive multitasking, which they built from scratch. Therefore, you should be able to do the same in Lua.
The big problem i see with that is that the main way preemptive multitasking works (to my knowledge) is that it generates regular interrupts which the manager uses to get control and determine what code it should be working on next. I also don't think Lua has any facility that can do that.
My question is: is it possible to write a pure-Lua library that allows people to have pre-emptive multitasking?
I can't see how to do it, although without a formal semantics of Lua (like the semantics of yield for example), it's really hard to come up with an ironclad argument why it can't be done. (I've been wanting a formal semantics for ages, but evidently Roberto and lhf have better things to do.)
If I wanted pre-emptive multitasking for Lua, I wouldn't even try to do it in pure Lua. Instead I'd use an old trick I first saw 20 years ago in Standard ML of New Jersey:
Interrupt sets a flag in the lua_State saying "current coroutine has been preempted".
Alter the VM so that on every loop and every function call, it checks the flag and yields if necessary.
This patch would be easy to write and easy to maintain. It doesn't solve the problem of the long-running C function that can't be pre-empted, but if you have to solve that problem, you are wandering into much harder territory, and you may as well do all your threading at the C level, not the Lua level.
No. It's not possible to write a preemptive scheduler in pure Lua. At some point a preemptive scheduler needs some mechanism like an interrupt service routine to take control away from the current thread and give it to the scheduler which can then give it to another thread. Pure Lua doesn't have this mechanism.
You mention that Windows is written in mostly C/C++. The keyword is mostly. You can't write a preemptive scheduler in pure ANSI C/C++. Usually, part of the interrupt service routine is written in assembly language. Or, the C/C++ compiler implements a non-standard extension that allows interrupt service routines to be written in C/C++. Some compilers allow you to declare a functions with an __interrupt modifier that that causes the compiler to generate a prolong / epilog that allows the function to be used as an interrupt service routine.
Also, code that sets up the interrupt service routine fiddles with CPU registers with memory mapped IO, or a IO instructions. None of this code is portable ANSI C/C++. And, depends on the CPU architecture.
Not that I know of, no. It would almost be absurdly simple if you could yield from hooks set on coroutines with debug.sethook though, but it doesn't work. You can yield from C hooks set from C (lua_sethook), but I couldn't figure out exactly to do that, and it's not pure Lua anyways.
Even if it were possible, it wouldn't be true threading. Everything would still run within the same operating system thread, for example. Your hook would take a variety of factors into account (such as time, perhaps memory, etc.) and then determine whether to yield. The yielded-to coroutine then would decide which child coroutine to run next. You'd also need to decide on when the hook should be called. Most frequent would be on every Lua instruction, but that carries a performance penalty. And if the coroutine calls into a C function, Lua has no jurisdiction. If that C call takes a long time, there's nothing you can do about it.
Here's a related thread from the Lua-L mailing list which you might find interesting.

Is it possible to create threads without system calls in Linux x86 GAS assembly?

Whilst learning the "assembler language" (in linux on a x86 architecture using the GNU as assembler), one of the aha moments was the possibility of using system calls. These system calls come in very handy and are sometimes even necessary as your program runs in user-space.
However system calls are rather expensive in terms of performance as they require an interrupt (and of course a system call) which means that a context switch must be made from your current active program in user-space to the system running in kernel-space.
The point I want to make is this: I'm currently implementing a compiler (for a university project) and one of the extra features I wanted to add is the support for multi-threaded code in order to enhance the performance of the compiled program. Because some of the multi-threaded code will be automatically generated by the compiler itself, this will almost guarantee that there will be really tiny bits of multi-threaded code in it as well. In order to gain a performance win, I must be sure that using threads will make this happen.
My fear however is that, in order to use threading, I must make system calls and the necessary interrupts. The tiny little (auto-generated) threads will therefore be highly affected by the time it takes to make these system calls, which could even lead to a performance loss...
my question is therefore twofold (with an extra bonus question underneath it):
Is it possible to write assembler
code which can run multiple threads
simultaneously on multiple cores at
once, without the need of system
calls?
Will I get a performance gain if I have really tiny threads (tiny as in the total execution time of the thread), performance loss, or isn't it worth the effort at all?
My guess is that multithreaded assembler code is not possible without system calls. Even if this is the case, do you have a suggestion (or even better: some real code) for implementing threads as efficient as possible?
The short answer is that you can't. When you write assembly code it runs sequentially (or with branches) on one and only one logical (i.e. hardware) thread. If you want some of the code to execute on another logical thread (whether on the same core, on a different core on the same CPU or even on a different CPU), you need to have the OS set up the other thread's instruction pointer (CS:EIP) to point to the code you want to run. This implies using system calls to get the OS to do what you want.
User threads won't give you the threading support that you want, because they all run on the same hardware thread.
Edit: Incorporating Ira Baxter's answer with Parlanse. If you ensure that your program has a thread running in each logical thread to begin with, then you can build your own scheduler without relying on the OS. Either way, you need a scheduler to handle hopping from one thread to another. Between calls to the scheduler, there are no special assembly instructions to handle multi-threading. The scheduler itself can't rely on any special assembly, but rather on conventions between parts of the scheduler in each thread.
Either way, whether or not you use the OS, you still have to rely on some scheduler to handle cross-thread execution.
"Doctor, doctor, it hurts when I do this". Doctor: "Don't do that".
The short answer is you can do multithreaded programming without
calling expensive OS task management primitives. Simply ignore the OS for thread
scheduling operations. This means you have to write your own thread
scheduler, and simply never pass control back to the OS.
(And you have to be cleverer somehow about your thread overhead
than the pretty smart OS guys).
We chose this approach precisely because windows process/thread/
fiber calls were all too expensive to support computation
grains of a few hundred instructions.
Our PARLANSE programming langauge is a parallel programming language:
See http://www.semdesigns.com/Products/Parlanse/index.html
PARLANSE runs under Windows, offers parallel "grains" as the abstract parallelism
construct, and schedules such grains by a combination of a highly
tuned hand-written scheduler and scheduling code generated by the
PARLANSE compiler that takes into account the context of grain
to minimimze scheduling overhead. For instance, the compiler
ensures that the registers of a grain contain no information at the point
where scheduling (e.g., "wait") might be required, and thus
the scheduler code only has to save the PC and SP. In fact,
quite often the scheduler code doesnt get control at all;
a forked grain simply stores the forking PC and SP,
switches to compiler-preallocated stack and jumps to the grain
code. Completion of the grain will restart the forker.
Normally there's an interlock to synchronize grains, implemented
by the compiler using native LOCK DEC instructions that implement
what amounts to counting semaphores. Applications
can fork logically millions of grains; the scheduler limits
parent grains from generating more work if the work queues
are long enough so more work won't be helpful. The scheduler
implements work-stealing to allow work-starved CPUs to grab
ready grains form neighboring CPU work queues. This has
been implemented to handle up to 32 CPUs; but we're a bit worried
that the x86 vendors may actually swamp use with more than
that in the next few years!
PARLANSE is a mature langauge; we've been using it since 1997,
and have implemented a several-million line parallel application in it.
Implement user-mode threading.
Historically, threading models are generalised as N:M, which is to say N user-mode threads running on M kernel-model threads. Modern useage is 1:1, but it wasn't always like that and it doesn't have to be like that.
You are free to maintain in a single kernel thread an arbitrary number of user-mode threads. It's just that it's your responsibility to switch between them sufficiently often that it all looks concurrent. Your threads are of course co-operative rather than pre-emptive; you basically scatted yield() calls throughout your own code to ensure regular switching occurs.
If you want to gain performance, you'll have to leverage kernel threads. Only the kernel can help you get code running simultaneously on more than one CPU core. Unless your program is I/O bound (or performing other blocking operations), performing user-mode cooperative multithreading (also known as fibers) is not going to gain you any performance. You'll just be performing extra context switches, but the one CPU that your real thread is running will still be running at 100% either way.
System calls have gotten faster. Modern CPUs have support for the sysenter instruction, which is significantly faster than the old int instruction. See also this article for how Linux does system calls in the fastest way possible.
Make sure that the automatically-generated multithreading has the threads run for long enough that you gain performance. Don't try to parallelize short pieces of code, you'll just waste time spawning and joining threads. Also be wary of memory effects (although these are harder to measure and predict) -- if multiple threads are accessing independent data sets, they will run much faster than if they were accessing the same data repeatedly due to the cache coherency problem.
Quite a bit late now, but I was interested in this kind of topic myself.
In fact, there's nothing all that special about threads that specifically requires the kernel to intervene EXCEPT for parallelization/performance.
Obligatory BLUF:
Q1: No. At least initial system calls are necessary to create multiple kernel threads across the various CPU cores/hyper-threads.
Q2: It depends. If you create/destroy threads that perform tiny operations then you're wasting resources (the thread creation process would greatly exceed the time used by the tread before it exits). If you create N threads (where N is ~# of cores/hyper-threads on the system) and re-task them then the answer COULD be yes depending on your implementation.
Q3: You COULD optimize operation if you KNEW ahead of time a precise method of ordering operations. Specifically, you could create what amounts to a ROP-chain (or a forward call chain, but this may actually end up being more complex to implement). This ROP-chain (as executed by a thread) would continuously execute 'ret' instructions (to its own stack) where that stack is continuously prepended (or appended in the case where it rolls over to the beginning). In such a (weird!) model the scheduler keeps a pointer to each thread's 'ROP-chain end' and writes new values to it whereby the code circles through memory executing function code that ultimately results in a ret instruction. Again, this is a weird model, but is intriguing nonetheless.
Onto my 2-cents worth of content.
I recently created what effectively operate as threads in pure assembly by managing various stack regions (created via mmap) and maintaining a dedicated area to store the control/individualization information for the "threads". It is possible, although I didn't design it this way, to create a single large block of memory via mmap that I subdivide into each thread's 'private' area. Thus only a single syscall would be required (although guard pages between would be smart these would require additional syscalls).
This implementation uses only the base kernel thread created when the process spawns and there is only a single usermode thread throughout the entire execution of the program. The program updates its own state and schedules itself via an internal control structure. I/O and such are handled via blocking options when possible (to reduce complexity), but this isn't strictly required. Of course I made use of mutexes and semaphores.
To implement this system (entirely in userspace and also via non-root access if desired) the following were required:
A notion of what threads boil down to:
A stack for stack operations (kinda self explaining and obvious)
A set of instructions to execute (also obvious)
A small block of memory to hold individual register contents
What a scheduler boils down to:
A manager for a series of threads (note that processes never actually execute, just their thread(s) do) in a scheduler-specified ordered list (usually priority).
A thread context switcher:
A MACRO injected into various parts of code (I usually put these at the end of heavy-duty functions) that equates roughly to 'thread yield', which saves the thread's state and loads another thread's state.
So, it is indeed possible to (entirely in assembly and without system calls other than initial mmap and mprotect) to create usermode thread-like constructs in a non-root process.
I only added this answer because you specifically mention x86 assembly and this answer was entirely derived via a self-contained program written entirely in x86 assembly that achieves the goals (minus multi-core capabilities) of minimizing system calls and also minimizes system-side thread overhead.
System calls are not that slow now, with syscall or sysenter instead of int. Still, there will only be an overhead when you create or destroy the threads. Once they are running, there are no system calls. User mode threads will not really help you, since they only run on one core.
First you should learn how to use threads in C (pthreads, POSIX theads). On GNU/Linux you will probably want to use POSIX threads or GLib threads.
Then you can simply call the C from assembly code.
Here are some pointers:
Posix threads: link text
A tutorial where you will learn how to call C functions from assembly: link text
Butenhof's book on POSIX threads link text

Resources