Are callbacks serialised to a DriverKit driver? - multithreading

I need to enqueue in and out requests on bulk, interrupt and isochronous endpoints at the same time. Can I expect that all callbacks from these requests to come one by one? Or can expect multiple callbacks at the same time?

They should come in, serially, on the dispatch queue which is set for the relevant handler function.
So if you set all the handler functions to the same queue, they'll come in on the same thread, one after the other. This is also what happens by default, which leaves them on the default queue, aka kIOServiceDefaultQueueName.
DriverKit IODispatchQueues are currently always serial, so will only run on one thread at a time. (This is unlike regular user space GCD dispatch_queue_t dispatch queues, where concurrent variants exist.)
At least, that's my understanding and experience so far.

Related

How to unblock all threads waiting on a semaphore?

I am dealing with a standard producer and consumer problem with finite array (or finitely many buffers ). I tried implementing it using semaphores and I have run into a problem. I want the producer to 'produce' only say 50 times. After that I want the producer thread to join the main thread. This part is easy, but what I am unable to do is to join the consumer threads. They are stuck on the semaphore signaling that there is no data. How do I solve this problem?
One possible option is to have a flag variable which becomes True when producer joins main and after that, the main thread would do post(semaphore) as many times as the number of worker threads. The worker threads would check the flag variable every time after waking up and if True, it would exit the function.
I think my method is pretty inefficient because of the many post semaphore calls. It would be great if I can unblock all threads at once!
Edit: I tried implementing whatever I said and it doesn't work due to deadlock
One option is the "poison pill" method. It assumes that you know how many consumer threads exist. Assuming there are N consumers, then after the producer has done it's thing, it puts N "poison pills" into the queue. A "poison pill" simply is an object/value that is type-compatible with whatever the producer normally produces, but which is distinguishable from a normal object/value.
When a consumer recognizes that it has eaten a poison pill, it dies. Problem solved.
I've done producer consumer structures in C++ in FreeRTOS operating system only, so keep that in mind. That has been my only experience so far with multitasking. I would say that I only used one producer in that program and one consumer. And I've done multitasking in LabView, but this is little bit different from what you might have, I think.
I think that one option could be to have a queue structure, so that the producer enqueues elements into the queue but if it's full of data, then you can hopefully implement it so that you can make some kind of queue policy as follows.
producer can either
block itself until space is available in the queue to enqueue,
block itself for certain time period, and continue elsewhere if time spent and didnt succeed in enqueuing data
immediately go elsewhere
So it looks like you have your enqueuing policy in order...
The queue readers are able to have similar three type of policies at least in FreeRTOS.
In general if you have a binary semaphore, then you have it so that the sender is sending it, and the receiver is waiting on it. It is used for synchronization or signalling.
In my opinion you have chosen the wrong approach with the "many semaphores" (???)
What you need to have is a queue structure where the producer inputs stuff...
Then, the consumers read from the queue whatever they must do...
If the queue is empty then you need a policy on what the queue reader threads should do.
Policy choice is needed also for those queue readers and semaphore readers on what they should do, when the queue is empty, or if they havent gotten the semaphore received. I would not use semaphores for this kind of problem...
I think the boolean variable idea could work, because you are only writing into that variable in the producer thread. Then the other threads should be able to read and poll that boolean variable if the producer is active...
But I think that you should provide more details what you are trying to do, especially with the consumer threads, how many threads of what kind you have, and what language you are programming in etc...

How does event-driven programming help a webserver that only does IO?

I'm considering a few frameworks/programming methods for our new backend project. It regards a BackendForFrontend implementation, which aggregates downstream services. For simplicity, these are the steps it goes trough:
Request comes into the webserver
Webserver makes downstream request
Downstream request returns result
Webserver returns request
How is event-driven programming better than "regular" thread-per-request handling? Some websites try to explain, and it often comes down to something like this:
The second solution is a non-blocking call. Instead of waiting for the answer, the caller continues execution, but provides a callback that will be executed once data arrives.
What I don't understand: we need a thread/handler to await this data, right? Its nice that the event handler can continue, but we still need (in this example) a thread/handler per request that awaits each downstream request, right?
Consider this example: the downstream requests take n seconds to return. In this n seconds, r requests come in. In the thread-per-request we need r threads: one for every request. After n seconds pass, the first thread is done processing and available for a new request.
When implementing a event-driven design, we need r+1 threads: an event loop and r handlers. Each handler takes a request, performs it, and calls the callback once done.
So how does this improve things?
What I don't understand: we need a thread/handler to await this data,
right?
Not really. The idea behind NIO is that no threads ever get blocked.
It is interesting because the operating system already works in a non-blocking way. It is our programming languages that were modeled in a blocking manner.
As an example, imagine that you had a computer with a single CPU. Any I/O operation that you do will be orders of magnitude slower than the CPU, right?. Say you want to read a file. Do you think the CPU will stay there, idle, doing nothing while the disk head goes and fetches a few bytes and puts them in the disk buffer? Obviously not. The operating system will register an interruption (i.e. a callback) and will use the valuable CPU for something else in the mean time. When the disk head has managed to read a few bytes and made them available to be consumed, an interruption will be triggered and the OS will then give attention to it, restore the previous process block and allocate some CPU time to handle the available data.
So, in this case, the CPU is like a thread in your application. It is never blocked. It is always doing some CPU-bound stuff.
The idea behind NIO programming is the same. In the case you're exposing, imagine that your HTTP server has a single thread. When you receive a request from your client you need to make an upstream request (which represents I/O). So what a NIO framework would do here is to issue the request and register a callback for when the response is available.
Immediately after that your valuable single thread is released to attend yet another request, which is going to register another callback, and so on, and so on.
When the callback resolves, it will be automatically scheduled to be processed by your single thread.
As such, that thread works as an event loop, one in which you're supposed to schedule only CPU bound stuff. Every time you need to do I/O, that's done in a non-blocking way and when that I/O is complete, some CPU-bound callback is put into the event loop to deal with the response.
This is a powerful concept, because with a very small amount threads you can process thousands of requests and therefore you can scale more easily. Do more with less.
This feature is one of the major selling points of Node.js and the reason why even using a single thread it can be used to develop backend applications.
Likewise this is the reason for the proliferation of frameworks like Netty, RxJava, Reactive Streams Initiative and the Project Reactor. They all are seeking to promote this type of optimization and programming model.
There is also an interesting movement of new frameworks that leverage this powerful features and are trying to compete or complement one another. I'm talking of interesting projects like Vert.x and Ratpack. And I'm pretty sure there are many more out there for other languages.
The whole idea of non-blocking paradigm is achieved by this idea called
"Event Loop"
Interesting references:
http://www.masterraghu.com/subjects/np/introduction/unix_network_programming_v1.3/ch06lev1sec2.html
Understanding the Event Loop
https://www.youtube.com/watch?v=8aGhZQkoFbQ

Relative merits between one thread per client and queuing thread models for a threaded server?

Let's say we're building a threaded server intended to run on a system with four cores. The two thread management schemes I can think of are one thread per client connection and a queuing system.
As the first system's name implies, we'll spawn one thread per client that connects to our server. Assuming one thread is always dedicated to our program's main thread of execution, we'll be able to handle up to three clients concurrently and for any more simultaneous clients than that we'll have to rely on the operating system's preemptive multitasking functionality to switch among them (or the VM's in the case of green threads).
For our second approach, we'll make two thread-safe queues. One is for incoming messages and one is for outgoing messages. In other words, requests and replies. That means we'll probably have one thread accepting incoming connections and placing their requests into the incoming queue. One or two threads will handle the processing of the incoming requests, resolving the appropriate replies, and placing those replies on the outgoing queue. Finally, we'll have one thread just taking replies off of that queue and sending them back out to the clients.
What are the pros and cons of these approaches? Notice that I didn't mention what kind of server this is. I'm assuming that which one has a better performance profile depends on whether the server handles short connections like a web servers and POP3 servers, or longer connections like a WebSocket servers, game servers, and messaging app servers.
Are there other thread management strategies besides these two?
I believe I've done both organizations at one time or another.
Method 1
Just so we're on the same page, the first has the main thread do a listen. Then, in a loop, it does accept. It then passes off the return value to a pthread_create and the client thread's loop does recv/send in loop processing all commands the remote client wants. When done, it cleans up and terminates.
For an example of this, see my recent answer: multi-threaded file transfer with socket
This has the virtues that the main thread and client threads are straightforward and independent. No thread waits on anything another thread is doing. No thread is waiting on anything that it doesn't have to. Thus, the client threads [plural] can all run at maximum line speed. Also, if a client thread is blocked on a recv or send, and another thread can go, it will. It is self balancing.
All thread loops are simple: wait for input, process, send output, repeat. Even the main thread is simple: sock = accept, pthread_create(sock), repeat
Another thing. The interaction between the client thread and its remote client can be anything they agree on. Any protocol or any type of data transfer.
Method 2
This is somewhat akin to an N worker model, where N is fixed.
Because the accept is [usually] blocking, we'll need a main thread that is similar to method 1. Except, that instead of firing up a new thread, it needs to malloc a control struct [or some other mgmt scheme] and put the socket in that. It then puts this on a list of client connections and then loops back to the accept
In addition to the N worker threads, you are correct. At least two control threads, one to do select/poll, recv, enqueue request and one to do wait for result, select/poll, send.
Two threads are needed to prevent one of these threads having to wait on two different things: the various sockets [as a group] and the request/result queues from the various worker threads. With a single control thread all actions would have to be non-blocking and the thread would spin like crazy.
Here is an [extremely] simplified version of what the threads look like:
// control thread for recv:
while (1) {
// (1) do blocking poll on all client connection sockets for read
poll(...)
// (2) for all pending sockets do a recv for a request block and enqueue
// it on the request queue
for (all in read_mask) {
request_buf = dequeue(control_free_list)
recv(request_buf);
enqueue(request_list,request_buf);
}
}
// control thread for recv:
while (1) {
// (1) do blocking wait on result queue
// (2) peek at all result queue elements and create aggregate write mask
// for poll from the socket numbers
// (3) do blocking poll on all client connection sockets for write
poll(...)
// (4) for all pending sockets that can be written to
for (all in write_mask) {
// find and dequeue first result buffer from result queue that
// matches the given client
result_buf = dequeue(result_list,client_id);
send(request_buf);
enqueue(control_free_list,request_buf);
}
}
// worker thread:
while (1) {
// (1) do blocking wait on request queue
request_buf = dequeue(request_list);
// (2) process request ...
// (3) do blocking poll on all client connection sockets for write
enqueue(result_list,request_buf);
}
Now, a few things to notice. Only one request queue was used for all worker threads. The recv control thread did not try to pick an idle [or under utilized] worker thread and enqueue to a thread specific queue [this is another option to consider].
The single request queue is probably the most efficient. But, maybe, not all worker threads are created equal. Some may end up on CPU cores [or cluster nodes] that have special acceleration H/W, so some requests may have to be sent to specific threads.
And, if that is done, can a thread do "work stealing"? That is, a thread completes all its work and notices that another thread has a request in its queue [that is compatible] but hasn't been started. The thread dequeues the request and starts working on it.
Here's a big drawback to this method. The request/result blocks are of [mostly] fixed size. I've done an implementation where the control could have a field for a "side/extra" payload pointer that could be an arbitrary size.
But, if doing a large transfer file transfer, either upload or download, trying to pass this piecemeal through request blocks is not a good idea.
In the download case, the worker thread could usurp the socket temporarily and send the file data before enqueuing the result to the control thread.
But, for the upload case, if the worker tried to do the upload in a tight loop, it would conflict with recv control thread. The worker would have to [somehow] alert the control thread to not include the socket in its poll mask.
This is beginning to get complex.
And, there is overhead to all this request/result block enqueue/dequeue.
Also, the two control threads are a "hot spot". The entire throughput of the system depends on them.
And, there are interactions between the sockets. In the simple case, the recv thread can start one on one socket, but other clients wishing to send requests are delayed until the recv completes. It is a bottleneck.
This means that all recv syscalls have to be non-blocking [asynchronous]. The control thread has to manage these async requests (i.e. initiate one and wait for an async completion notification, and only then enqueue the request on the request queue).
This is beginning to get complicated.
The main benefit to wanting to do this is having a large number of simultaneous clients (e.g. 50,000) but keep the number of threads to a sane value (e.g. 100).
Another advantage to this method is that it is possible to assign priorities and use multiple priority queues
Comparison and hybrids
Meanwhile, method 1 does everything that method 2 does, but in a simpler, more robust [and, I suspect, higher throughput way].
After a method 1 client thread is created, it might split the work up and create several sub-threads. It could then act like the control threads of method 2. In fact, it might draw on these threads from a fixed N pool just like method 2.
This would compensate for a weakness of method 1, where the thread is going to do heavy computation. With a large number threads all doing computation, the system would get swamped. The queuing approach helps alleviate this. The client thread is still created/active, but it's sleeping on the result queue.
So, we've just muddied up the waters a bit more.
Either method could be the "front facing" method and have elements of the other underneath.
A given client thread [method 1] or worker thread [method 2] could farm out its work by opening [yet] another connection to a "back office" compute cluster. The cluster could be managed with either method.
So, method 1 is simpler and easier to implement and can easily accomodate most job mixes. Method 2 might be better for heavy compute servers to throttle the requests to limited resources. But, care must be taken with method 2 to avoid bottlenecks.
I don't think your "second approach" is well thought out, so I'll just see if I can tell you how I find it most useful to think about these things.
Rule 1) Your throughput is maximized if all your cores are busy doing useful work. Try to keep your cores busy doing useful work.
These are things that can keep you from keeping your cores busy doing useful work:
you are keeping them busy creating threads. If tasks are short-lived, then use a thread pool so you aren't spending all your time starting up and killing threads.
you are keeping them busy switching contexts. Modern OSes are pretty good at multithreading, but if you've gotta switch jobs 10000 times per second, that overhead is going to add up. If that's a problem for you you'll have to consider and event-driven architecture or other sort of more efficient explicit scheduling.
your jobs block or wait for a long time, and you don't have the resources to run enough threads threads to keep your cores busy. This can be a problem when you're serving protocols with persistent connections that hang around doing nothing most of the time, like websocket chat. You don't want to keep a whole thread hanging around doing nothing by tying it to a single client. You'll need to architect around that.
All your jobs need some other resource besides CPU, and you're bottlenecked on that -- that's a discussion for another day.
All that said... for most request/response kinds of protocols, passing each request or connection off to a thread pool that assigns it a thread for the duration of the request is easy to implement and performant in most cases.
Rule 2) Given maximized throughput (all your cores are usefully busy), getting jobs done on a first-come, first-served basis minimizes latency and maximizes responsiveness.
This is truth, but in most servers it is not considered at all. You can run into trouble here when your server is busy and jobs have to stop, even for short moments, to perform a lot of blocking operations.
The problem is that there is nothing to tell the OS thread scheduler which thread's job came in first. Every time your thread blocks and then becomes ready, it is scheduled on equal terms with all the other threads. If the server is busy, that means that the time it takes to process your request is roughly proportional to the number of times it blocks. That is generally no good.
If you have to block a lot in the process of processing a job, and you want to minimize the overall latency of each request, you'll have to do your own scheduling that keeps track of which jobs started first. In an event-driven architecture, for example, you can give priority to handling events for jobs that started earlier. In a pipelined architecture, you can give priority to later stages of the pipeline.
Remember these two rules, design your server to keep your cores busy with useful work, and do first things first. Then you can have a fast and responsive server.

How can akka actor interact between threads

I've read akka documentation and can't produce clean understanding of thread interaction while using akka. Docs may omit this thing as obvious but it is not so obvious for me.
All akka actors seemed to be run in same thread they are called. I see actors as co-procedures that just had own stack reset each time receive called.
You may perform a huge chain of actors switching in straight line. Each receive perform small non-blocking operation and force another receive to work further. There is no event loop, that can handle messages outside of the actor system.
I'd like to catch a request from other thread, perform control operations, and wait for another message.
There are some use cases that outline my needs.
There is thread that constantly polling data from some sources. Once data matches pattern it invokes event-driven handler based on actors. Logical controller makes a decision and passes it workers. There should be two persistent thread. One threads works constantly on polling and another works asynchronously to control it work. You should not let akka actors to first thread since they broke polling periods and first thread should not block actors so they need another thread.
There is some kind of two-side board game. One side has a controller thread that schedules calculation time works interacts with board server and etcetera. Other thread is a heavy calculating thread that loops over different variants and could not be written in akka since it has blocking nature
I aware of existing akka futures, but they represent a working task that run once fired and shutting down after performing their goal. The futures are well combined with akka actors, but can not express looped working threads.
Akka actor system incorporates different kinds of network event loops. You may use its built-in remote actor system or well known 0mq protocol. But using network for thread interactions seems like overdoing for me.
What is the supposed way to glue non-akka thread with akka one? Should I wrote a couple of special procedures to perform message passing in thread-safe way?
If you need polling, then the polling thread should just turn whatever is polled into a message and fire it off to an actor.
I find it more useful to use an Actor with a receiveTimeout to do non-blocking polling at an interval, and when there's something that gets polled, it will publish it to some other actor, or perhaps even its ActorSystems' EventStream, for true pub-sub action.

Threadpool multi-queue job dispatch algorithm

I'm curious to know if there is a widely accepted solution for managing thread resources in a threadpool given the following scenario/constraints:
Incoming jobs are all of the same
nature and could be processed by any
thread in the pool.
Incoming jobs
will be 'bucketed' into different
queues based on some attribute of
the incoming job such that all jobs
going to the same bucket/queue MUST
be processed serially.
Some buckets will be less busy than
others at different points during
the lifetime of the program.
My question is on the theory behind a threadpool's implementation. What algorithm could be used to efficiently allocate available threads to incoming jobs across all buckets?
Edit: Another design goal would be to eliminate as much latency as possible between a job being enqueued and it being picked up for processing, assuming there are available idle threads.
Edit2: In the case I'm thinking of there are a relatively large number of queues (50-100) which have unpredictable levels of activity, but probably only 25% of them will be active at any given time.
The first (and most costly) solution I can think of is to simply have 1 thread assigned to each queue. While this will ensure incoming requests are picked up immediately, it is obviously inefficient.
The second solution is to combine the queues together based on expected levels of activity so that the number of queues is inline with the number of threads in the pool, allowing one thread to be assigned to each queue. The problem here will be that incoming jobs, which otherwise could be processed in parallel, will be forced to wait on each other.
The third solution is to create the maximum number of queues, one for each set of jobs that must be processed serially, but only allocate threads based on the number of queues we expect to be busy at any given time (which could also be adjusted by the pool at runtime). So this is where my question comes in: Given that we have more queues than threads, how does the pool go about allocating idle threads to incoming jobs in the most efficient way possible?
I would like to know if there is a widely accepted approach. Or if there are different approaches - who makes use of which one? What are the advantages/disadvantages, etc?
Edit3:This might be best expressed in pseudo code.
You should probably eliminate nr. 2 from your specification. All you really need to comply to is that threads take up buckets and process the queues inside the buckets in order. It makes no sense to process a serialized queue with another threadpool or do some serialization of tasks in parallel. Thus your spec simply becomes that the threads iterate the fifo in the buckets and it's up to the poolmanager to insert properly constructed buckets. So your bucket will be:
struct task_bucket
{
void *ctx; // context relevant data
fifo_t *queue; // your fifo
};
Then it's up to you to make the threadpool smart enough to know what to do on each iteration of the queue. For example the ctx can be a function pointer and the queue can contain data for that function, so the worker thread simply calls the function on each iteration with the provided data.
Reflecting the comments:
If the size of the bucket list is known before hand and isn't likely to change during the lifetime of the program, you'd need to figure out if that is important to you. You will need some way for the threads to select a bucket to take. The easiest way is to have a FIFO queue that is filled by the manager and emptied by the threads. Classic reader/writer.
Another possibility is a heap. The worker removes the highest priority from the heap and processes the bucket queue. Both removal by the workers and insertion by the manager reorders the heap so that the root node is the highest priority.
Both these strategies assume that the workers throw away the buckets and the manager makes new ones.
If keeping the buckets is important, you run the risk of workers only attending to the last modified task, so the manager will either need to reorder the bucket list or modify priorities of each bucket and the worker iterates looking for the highest priority. It is important that memory of ctx remains relevant while threads are working or threads will have to copy this as well. Workers can simply assign the queue locally and set queue to NULL in the bucket.
ADDED: I now tend to agree that you might start simple and just keep a separate thread for each bucket, and only if this simple solution is understood to have problems you look for something different. And a better solution might depend on what exactly problems the simple one causes.
In any case, I leave my initial answer below, appended with an afterthought.
You can make a special global queue of "job is available in bucket X" signals.
All idle workers would wait on this queue, and when a signal is put into the queue one thread will take it and proceed to the corresponding bucket to process jobs there until the bucket becomes empty.
When an incoming job is submitted into an in-order bucket, it should be checked whether a worker thread is assigned to this bucket already. If assigned, the new job will be eventually processed by this worker thread, so no signal should be sent. If not worker is assigned, check whether the bucket is empty or not. If empty, place a signal into the global signal queue that a new job has arrived in this bucket; if not empty, such a signal should have been made already and a worker thread should soon arrive, so do nothing.
ADDED: I got a thought that my idea above can cause starvation for some jobs if the number of threads is less than the number of "active" buckets and there is a non-ending flow of incoming tasks. If all threads are already busy and a new job arrives into a bucket that is not yet served, it may take long time before a thread is freed to work on this new job. So there is a need to check if there are idle workers, and if not, create a new one... which adds more complexity.
Keep it Simple: I'd use 1 thread per queue. Simplicity is worth a lot, and threads are quite cheap. 100 threads won't be an issue on most OS's.
By using a thread per queue, you also get a real scheduler. If a thread blocks (depends on what you're doing), another thread can be queued. You won't get deadlock until every single one blocks. The same cannot be said if you use fewer threads - if the queues the threads happen to be servicing block, then even if other queues are "runnable" and even if these other queue's might unblock the blocked threads, you'll have deadlock.
Now, in particular scenarios, using a threadpool may be worth it. But then you're talking about optimizing a particular system, and the details matter. How expensive are threads? How good is the scheduler? What about blocking? How long are the queues, how frequently updated, etc.
So in general, with just the information that you have around 100 queues, I'd just go for a thread per queue. Yes, there's some overhead: all solutions will have that. A threadpool will introduce synchronization issues and overhead. And the overhead of a limited number of threads is fairly minor. You're mostly talking about around 100MB of address space - not necessarily memory. If you know most queues will be idle, you could further implement an optimization to stop threads on empty queues and start them when needed (but beware of race conditions and thrashing).

Resources