Proper handling of context data in libaio callbacks? - linux

I'm working with kernel-level async I/O (i.e. libaio.h). Prior to submitting a struct iocb using io_submit I set the callback using io_set_callback that sticks a function pointer in iocb->data. Finally, I get the completed events using io_getevents and run each callback.
I'd like to be able to use some context information within the callback (e.g. a submission timestamp). The only method by which I can think of doing this is to continue using io_getevents, but have iocb->data point to a struct with context and the callback.
Is there any other methods for doing something like this, and is iocb->data guaranteed to be untouched when using io_getevents? My understanding is that there is another method by which libaio will automatically run callbacks which would be an issue if iocb->data wasn't pointing to a function.
Any clarification here would be nice. The documentation on libaio seems to really be lacking.

One solution, which I would imagine is typical, is to "derive" from iocb, and then cast the pointer you get back from io_getevents() to your struct. Something like this:
struct my_iocb {
iocb cb;
void* userdata;
// ... anything else
When you issue your jobs, whether you do it one at a time or in a batch, you provide an array of pointers to iocb structs, which means they may point to my_iocb as well.
When you retrieve the notifications back from io_getevents(), you simply cast the io_event::obj pointer to your own type:
io_event events[512];
int num_events = io_getevents(ioctx, 1, 512, events, NULL);
for (int i = 0; i < num_events; ++i) {
my_iocb* job = (my_iocb*)events[i].obj;
// .. do stuff with job
If you don't want to block in io_getevents, but instead be notified via a file descriptor (so that you can block in select() or epoll(), which might be more convenient), I would recommend using the (undocumented) eventfd integration.
You can tie an aiocb to an eventfd file descriptor with io_set_eventfd(iocb* cb, int fd). Whenever the job completes, it increments the eventfd by one.
Note, if you use this mechanism, it is very important to never read more jobs from the io context (with io_getevents()) than what the eventfd counter said there were, otherwise you introduce a race condition from when you read the eventfd counter and reap the jobs.


Spawning a new thread for each object load

I have a system which runs multiple service (long lived) and worker (short lived) threads. They all share a state which contains objects. Any thread can request an object an any time, through a singleton-of-sorts class called ObjectManager. If the object is not available it needs to be loaded.
Here's some pseudo-code of how object loading looks now:
class ObjectManager {
getLoadinData(path) {
if (hasLoadingDataFor(path))
return whatWeHave()
else {
loadingData = createNewLoadingData();
loadingData.path = path;
return loadingData;
// loads object and blocks until it's loaded
loadObjectSync(path) {
loadingData = getLoadinData(path);
return loadingData.loadedObject;
// initiates a load and calls a callback when done
loadObjectAsync(path, callback) {
loadingData = getLoadinData(path);
// dedicated loading thread
loadingThread() {
while (running) {
loadingData = waitForLoadingData();
object = readObjectFromDisk(loadingData.path);
object.onLoaded(); // !!!!
loadingData.object = object;
// unblock cv waiters
// call callbacks
The problem is the line object.onLoaded. I have no control over this function. Some objects might decide that they need other objects to be valid. So in their onLoaded method they might call loadObjectSync. Uh-oh! This (naturally) dead locks. It blocks the loading loop until the loading loop makes more iterations.
What I could do to solve this is leave the onLoaded call to the initiating threads. This will change loadObjectSync to something like:
loadObjectSync(path) {
loadingData = getLoadinData(path);
if (loadingData.wasCreatedInThisThread()) {
else {
// wait more
return loadingData.loadedObject;
... but then the problem is that if I have no calls for loadSync and only for loadAsync or simply the loadAsync call was the first to create the loading data, there will be no one to finalize the object. So to make this work, I have to introduce another thread finalizes objects whose loadingData was created by loadObjectAsync.
It seems that it would work. But I have a simpler idea! What if I change getLoadingData instead. What if it does this:
getLoadinData(path) {
if (hasLoadingDataFor(path))
return whatWeHave()
else {
loadingData = createNewLoadingData();
loadingData.path = path;
thread = spawnLoadingThread(loadingData);
return loadingData;
Spawn a new thread for every object load. Thus there is no dead lock. Every loading thread can safely block until it's done. The rest of the code remains exactly as it is.
This means potentially tens (or why not thousands in certain edge cases) active threads, waiting on condition variables. I know that spawning threads has its overhead but I think it would be negligible compared to the cost of I/O from readObjectFromDisk
So my question is: Is this terrible? Can this somehow backfire?
The target platform is conventional desktop machines. But this software is supposed to run for a long time without stopping: weeks, maybe months.
Alternatively... even though I have an idea how to solve this if the thread-per-load turns out to be terrible, can this be solved in another way?
Very interesting! This is a problem I have bumped into a couple of times, trying to add a synchronous interface to a fundamentally asynchronous operation (i.e. file load, or in my case, network write) that is performed by a service thread.
My own preference would be to not provide the synchronous interface. Why? Because it keeps the code simpler in design & implementation and easier to reason about -- always important for multi-threading.
Benefits of sticking to single thread & async only is that you only have 1 service thread, so resource growth is not a concern, plus the user callbacks are always invoked on this same thread, which simplifies thread-safety concerns for users of ObjectManager (if you have multiple callback threads, every user callback must be thread safe, so it's an important choice to make). However sticking to only an async interface does mean the user of ObjectManager has more work to do.
But if you do want to keep the synchronous interface, then another approach that I have taken could work for you. You stick to a single service thread but inside the implementation of loadObjectSync you check the thread-ID to determine if the invoker is the service thread or any-other thread. If it is any-other thread you queue the request and safely block. But if it is the service thread, you can immediately load the object, say by calling a new function loadObjectImpl. You will need to grab the thread-ID of the service thread during initialization and store it within the ObjectManager instance, and use that for thread identification. And you will need a new function which is basically just the internal scope of the loadingThread function -- i.e. a new function called something like loadObjectImpl.

get pthread_t from thread id

I am unable to find a function to convert a thread id (pid_t) into a pthread_t which would allow me to call pthread_cancel() or pthread_kill().
Even if pthreads doesn't provide one is there a Linux specific function?
I don't think such a function exists but I would be happy to be corrected.
I am well aware that it is usually preferable to have threads manage their own lifetimes via condition variables and the like.
This use is for testing purposes. I am trying to find a way to test how an application behaves when one of its threads 'dies'. So I'm really looking for a way to kill a thread. Using syscall(tgkill()) kills the process, so instead I provided a means for a tester to give the process the id of the thread to kill. I now need to turn that id into a pthread_t so that I can then:
use pthread_kill(tid,0) to check for its existence followed by
calling pthread_kill() or pthread_cancel() as appropriate.
This is probably taking testing to an unnecessary extreme. If I really want to do that some kind of mock pthreads implementation might be better.
Indeed if you really want robust isolation you are typically better off using processes rather than threads.
I don't think such a function exists but I would be happy to be corrected.
As a workaround I can create a table mapping &pthread_t to pid_t and ensure that I always invoke pthread_create() via a wrapper that adds an entry to this table. This works very well and allows me to convert an OS thread id to a pthread_t which I can then terminate using pthread_cancel(). Here is a snippet of the mechanism:
typedef void* (*threadFunc)(void*);
static void* start_thread(void* arg)
threadFunc threadRoutine = routine_to_start;
routine_to_start = NULL; //let creating thread know its safe to continue
return threadRoutine(arg);
Sensible conversion requires there to be a 1:1 mapping between pthread_t and pid_t tid, which is the case with NPTL, but hasn't always been the case, and won't be the case on every pthread platform. That said...
Two options:
A) override the actual pthread_create, using LD_PRELOAD and dlsym, and keep track of each pthread_t and their corresponding pid_t there. To get the thread pid_t you can either take advantage of the pthread private headers to de-opaque the pthread_t and access the pid_t inside there, or if you want to stick to documented APIs pthread_sigqueue each pthread_t thread as it is created and have a sigaction signal handler call gettid and pass you back the thread pid_t, with appropriate synchronisation between your new pthread_create and the signal handler[1].
B) You can read the all of the thread pid_t from /proc/<process pid_t>/task/. Then use the SYS_rt_tgsigqueueinfo[2] syscall to implement a new function thread_sigqueue, a pid_t variant of pthread_sigqueue so that you can signal the pid_t thread, and from the sigaction signal handler call pthread_self passing out the value with suitable synchronization, etc.
1 - I think it's worth writing 2 executeOnThread variants (one for pthread_t and one for pid_t style thread ids) that take a std::function<void()> (for C++), or a void(*)(void*) function pointer and void* parameter (for C), and SIGUSR1 that thread to execute the passed function in a sigaction that you also setup to perform relevant synchronization with the calling thread. It's handy to be able to use the thread-dependent APIs like pthread_self, gettid, backtrace, getrusage, etc. without devising a custom execution scheme each time.
2 - SYS_rt_tgsigqueueinfo is a low level syscall meant for implementing sigqueue/pthread_sigqueue, rather than application use, but is still a documented API, and we're using it to implement another variant of sigqueue, so fair game IMHO.

The difference between wait_queue_head and wait_queue in linux kernel

I can find many examples regarding wait_queue_head.
It works as a signal, create a wait_queue_head, someone
can sleep using it until someother kicks it up.
But I can not find a good example of using wait_queue itself, supposedly very related to it.
Could someone gives example, or under the hood of them?
From Linux Device Drivers:
The wait_queue_head_t type is a fairly simple structure, defined in
<linux/wait.h>. It contains only a lock variable and a linked list
of sleeping processes. The individual data items in the list are of
type wait_queue_t, and the list is the generic list defined in
Normally the wait_queue_t structures are allocated on the stack by
functions like interruptible_sleep_on; the structures end up in the
stack because they are simply declared as automatic variables in the
relevant functions. In general, the programmer need not deal with
Take a look at A Deeper Look at Wait Queues part.
Some advanced applications, however, can require dealing with
wait_queue_t variables directly. For these, it's worth a quick look at
what actually goes on inside a function like interruptible_sleep_on.
The following is a simplified version of the implementation of
interruptible_sleep_on to put a process to sleep:
void simplified_sleep_on(wait_queue_head_t *queue)
wait_queue_t wait;
init_waitqueue_entry(&wait, current);
current->state = TASK_INTERRUPTIBLE;
add_wait_queue(queue, &wait);
remove_wait_queue (queue, &wait);
The code here creates a new wait_queue_t variable (wait, which gets
allocated on the stack) and initializes it. The state of the task is
set to TASK_INTERRUPTIBLE, meaning that it is in an interruptible
sleep. The wait queue entry is then added to the queue (the
wait_queue_head_t * argument). Then schedule is called, which
relinquishes the processor to somebody else. schedule returns only
when somebody else has woken up the process and set its state to
TASK_RUNNING. At that point, the wait queue entry is removed from the
queue, and the sleep is done
The internals of the data structures involved in wait queues:
for the users who think the image is my own - here is one more time the link to the Linux Device Drivers where the image is taken from
Wait queue is simply a list of processes and a lock.
wait_queue_head_t represents the queue as a whole. It is the head of the waiting queue.
wait_queue_t represents the item of the list - a single process waiting in the queue.

How to run hrtimer handler in softirq context?

I have found this tutorial about hrtimer:
I believe the way it uses will run the callback handler in hardirq context,right? But it also said "One interesting aspect is the ability to define the execution context of the callback function (such as in softirq or hardiirq context)"
I have checked the hrtimer.h file but it's really not that intuitive. Does anyone know how to run it in softirq context? Is it similiar to run it in hardirq?
This information is regarding an old kernel - in recent releases this feature have been removed to reduce the code complexity and avoid bugs. Now hrtimer always runs in hardirq context with disabled IRQs.
One possible approach is to use a tasklet_hrtimer
#include <linux/interrupt.h>
struct tasklet_hrtimer mytimer;
enum hrtimer_restart callback(struct hrtimer *t) {
struct tasklet_hrtimer *mytime=container_of(t,struct tasklet_hrtimer,timer);
In the example above you should replace clock, mode and time with appropriate values.
If you want to pass data to your callback, then you have to embed the tasklet_hrtimer variable in some struct of yours and use the container_of trick to get to your data.
Not quite apparently, your struct will contain a tasklet_hrtimer, which will contain a hrtimer struct. When you get a pointer to the inner most element and you know that it have a fixed offset from the parent element, you can get to the parent.

Asynchronous Sockets in Linux-- Polling vs. Callback via

In deciding to implement asynchronous sockets in my simple server (linux), I have run into a problem. I was going to continually poll(), and do some cleanup and caching between calls. Now this seems wastefull, so I did more digging and found a way to possibly implement some callbacks on i/o.
Would I incur a performace penalty, and more importantly would it work, if I created a socket with O_NONBLOCK, use SIOCSPGRP ioctl() to send a SIGIO on i/o, and use sigaction() to define a callback function during i/o.
In addition, can I define different functions for different sockets?
"I was going to continually poll(), and do some cleanup and caching between calls. Now this seems wasteful"
Wasteful how? Did you actually try and implement this?
You have your fd list. You call poll or (better) epoll() with the list. When it triggers, you walk the fd list and deal with each one appropriately. You need to cache incoming and outgoing data, so each fd needs some kind of struct. When I've done this, I've used a hash table for the fd structs (generating a key from the fd), but you are probably fine, at least initially, just using a fixed length array and checking in case the OS issues you a weirdly high fd (nb, I have never seen that happen and I've squinted thru more logs than I can count). The structs hold pointers to incoming and outgoing buffers, perhaps a state variable, eg:
struct connection {
int fd; // mandatory for the hash table version
unsigned char *dataOut;
unsigned char *dataIn;
int state; // probably from an enum
struct connection connected[1000]; // your array, or...
...probably a linked list is actually best for the fd's, I had an unrelated requirement for the hash table.
Start there and refine stepwise. I think you are just trying to find an easy way out -- that you may pay for later by making other things harder ;) $0.02.
