Calling accept() from multiple threads - linux

I'm writing a concurrent TCP server that has to handle multiple connections with the 'thread per connection' approach (using a thread pool). My doubt is about which is the most optimal way for every thread to get a different file descriptor.
I found that the next two methods are the most recommended:
A main thread that accepts() all the incoming connections and stores their descriptors on a data structure (e.g.: a queue). Then every thread is able to get an fd from the queue.
Accept() is called directly from every thread. (Recommended in Unix Network Programming V1 )
Problems I find to each of them:
The static data structure that stores all the fd's must be locked (mutex_lock) before a thread can read from it, so in the case that a considerable number of threads wants to read in exactly the same moment I don't know how much time would pass until all of them would get their goal.
I've been reading that the Thundering Herd problem related to simultaneous accept() calls has not been totally solved on Linux yet, so maybe I would need to create an artificial solution to it that would end up making the application at least as slow as with the approach 1.
Sources:
(Some links talking about approach 2: does-the-thundering-herd-problem-exist-on-linux-anymore - and one article I found about it (outdated) : linux-scalability/reports/accept.html
And an SO answer that recommends approach 1: can-i-call-accept-for-one-socket-from-several-threads-simultaneously
I'm really interested on the matter, so I will appreciate any opinion about it :)

As mentioned in the StackOverflow answer you linked, a single thread calling accept() is probably the way to go. You mention concerns about locking, but these days you will find lockfree queue implementations available in Boost.Lockfree, Intel TBB, and elsewhere. You could use one of those if you like, but you might just use a condition variable to let your worker threads sleep and wake one of them when a new connection is established.

Related

Semaphores & threads - what is the point?

I've been reading about semaphores and came across this article:
www.csc.villanova.edu/~mdamian/threads/posixsem.html
So, this page states that if there are two threads accessing the same data, things can get ugly. The solution is to allow only one thread to access the data at the same time.
This is clear and I understand the solution, only why would anyone need threads to do this? What is the point? If the threads are blocked so that only one can execute, why use them at all? There is no advantage. (or maybe this is a just a dumb example; in such a case please point me to a sensible one)
Thanks in advance.
Consider this:
void update_shared_variable() {
sem_wait( &g_shared_variable_mutex );
g_shared_variable++;
sem_post( &g_shared_variable_mutex );
}
void thread1() {
do_thing_1a();
do_thing_1b();
do_thing_1c();
update_shared_variable(); // may block
}
void thread2() {
do_thing_2a();
do_thing_2b();
do_thing_2c();
update_shared_variable(); // may block
}
Note that all of the do_thing_xx functions still happen simultaneously. The semaphore only comes into play when the threads need to modify some shared (global) state or use some shared resource. So a thread will only block if another thread is trying to access the shared thing at the same time.
Now, if the only thing your threads are doing is working with one single shared variable/resource, then you are correct - there is no point in having threads at all (it would actually be less efficient than just one thread, due to context switching.)
When you are using multithreading not everycode that runs will be blocking. For example, if you had a queue, and two threads are reading from that queue, you would make sure that no thread reads at the same time from the queue, so that part would be blocking, but that's the part that will probably take the less time. Once you have retrieved the item to process from the queue, all the rest of the code can be run asynchronously.
The idea behind the threads is to allow simultaneous processing. A shared resource must be governed to avoid things like deadlocks or starvation. If something can take a while to process, then why not create multiple instances of those processes to allow them to finish faster? The bottleneck is just what you mentioned, when a process has to wait for I/O.
Being blocked while waiting for the shared resource is small when compared to the processing time, this is when you want to use multiple threads.
This is of course a SSCCE (Short, Self Contained, Correct Example)
Let's say you have 2 worker threads that do a lot of work and write the result to a file.
you only need to lock the file (shared resource) access.
The problem with trivial examples....
If the problem you're trying to solve can be broken down into pieces that can be executed in parallel then threads are a good thing.
A slightly less trivial example - imagine a for loop where the data being processed in each iteration is different every time. In that circumstance you could execute each iteration of the for loop simultaneously in separate threads. And indeed some compilers like Intel's will convert suitable for loops to threads automatically for you. In that particular circumstances no semaphores are needed because of the iterations' data independence.
But say you were wanting to process a stream of data, and that processing had two distinct steps, A and B. The threadless approach would involve reading in some data then doing A then B and then output the data before reading more input. Or you could have a thread reading and doing A, another thread doing B and output. So how do you get the interim result from the first thread to the second?
One way would be to have a memory buffer to contain the interim result. The first thread could write the interim result to a memory buffer and the second could read from it. But with two threads operating independently there's no way for the first thread to know if it's safe to overwrite that buffer, and there's no way for the second to know when to read from it.
That's where you can use semaphores to synchronise the action of the two threads. The first thread takes a semaphore that I'll call empty, fills the buffer, and then posts a semaphore called filled. Meanwhile the second thread will take the filled semaphore, read the buffer, and then post empty. So long as filled is initialised to 0 and empty is initialised to 1 it will work. The second thread will process the data only after the first has written it, and the first won't write it until the second has finished with it.
It's only worth it of course if the amount of time each thread spends processing data outweighs the amount of time spent waiting for semaphores. This limits the extent to which splitting code up into threads yields a benefit. Going beyond that tends to mean that the overall execution is effectively serial.
You can do multithreaded programming without semaphores at all. There's the Actor model or Communicating Sequential Processes (the one I favour). It's well worth looking up JCSP on Wikipedia.
In these programming styles data is shared between threads by sending it down communication channels. So instead of using semaphores to grant another thread access to data it would be sent a copy of that data down something a bit like a network socket, or a pipe. The advantage of CSP (which limits that communication channel to send-finishes-only-if-receiver-has-read) is that it stops you falling into the many many pitfalls that plague multithreaded do programs. It sounds inefficient (copying data is inefficient), but actually it's not so bad with Intel's QPI architecture, AMD's Hypertransport. And it means hat the 'channel' really could be a network connection; scalability built in by design.

epoll_wait on several Threads faster?

I am thinking of programming a tcp server based on epoll.
To achieve the best performance i want to implement multi core support too. But during my researches the following question came up:
Is it faster to call two epoll_wait()-Calls from two different threads, each observing their own file descriptors on a dual core? Or is this as fast as calling just one single epoll_wait() which observes all file descriptors?
Since the Kernel observes the file descriptors, i think it doens't matter how much threads i use on user space calling epoll_wait()?
You can even call epoll_wait concurrently on multiple threads for the same epoll_fd as long as you use edge-triggered (EPOLLET) mode (and be careful about synchronisation). Using that approach you can get real performance benefits on multi-core machines compared to a single-threaded epoll event loop.
I have actually done performance measurements some time ago, see my blog postings for the results:
http://cmeerw.org/blog/748.html#748
http://cmeerw.org/blog/746.html#746
I suspect you're correct that the performance of epoll per se is no different in the two scenarios you mention. What could make a difference, OTOH, is that by having an event loop in each thread you don't need to context switch to a separate worker thread to handle the connection. This is e.g. how nginx works, see e.g. http://www.aosabook.org/en/nginx.html .

What logically is an event loop in a thread?

I came across node.js and python's tornado vs the Apache.
They say :
Apache makes a thread for every connection.
Node.js & tornado actually does event looping on a thread and a single thread can handle many connections.
I don't understand that what logically be a child of a thread.
In computer science terms:
Processes have isolated memory and share CPU with context switches.
Threads divides a process.
Therefore, a process with multiple control points is achieved by multiple threads.
Now,
What how does event loop works under a thread ?
How can it handle different connection under 1 control of a thread ?
Update :
I mean if there is communication with 3 sockets under 1 thread, how can 1 thread communicate with 3 sockets without keeping anyone on wait ?
An event loop at its basic level is something like:
while getNextEvent (&event) {
dispatchEvent (&event);
}
In other words, it's nothing more than a loop which continuously retrieves events from a queue of some description, then dispatches the event to an event handling procedure.
It's likely you know that already but I'm just explaining it for context.
In terms of how the different servers handle it, it appears that every new connection being made in Apache has a thread created for it, and that thread is responsible for that connection and nothing else.
For the other two, it's likely that there are a "set" number of threads running (though this may actually vary based on load) and a connection is handed off to one of those threads. That means any one thread may be handling multiple connections at any point in time.
So the event in that case would have to include some details as to what connection it applies to, so the thread can keep the different connections isolated from each other.
There are no doubt pros and cons to both options. A one-connection-per-thread optio n would have simplified code in the thread function since it didn't have to deal with multiple connections but it may end up with a lot of resource usage as the load got high.
In a multiple-connection-per-thread scenario, the code is a little more complex but you can generally minimise thread creation and destruction overhead by simply having the maximum number of threads running all the time. Outside of high-load periods, they'll just be sitting around doing nothing, waiting on a connection event to be given to them.
And, even under high load, it may be that each thread can quite easily process five concurrent connections without dropping behind which would mean the one-connection-per-thread option was a little wasteful.
Based on your update:
I mean if there is communication with 3 sockets under 1 thread, how can 1 thread communicate with 3 sockets without keeping anyone on wait ?
There are a great many ways to do this. For a start, it would generally all be abstracted behind the getNextEvent() call, which would probably be responsible for handling all connections and farming them out to the correct threads.
At the lowest levels, this could be done with something like a select call, a function that awaits activity on one of many file descriptors, and returns information relating to which file descriptor has something to say.
For example, you provide a file descriptor set of all currently open sockets and pass that to select. It will then give you back a modified set, containing only those that are of interest to you (such as ready-to-read-from).
You can then query that set and dispatch events to the corresponding thread.

fork in multi-threaded program

I've heard that mixing forking and threading in a program could be very problematic, often resulting with mysterious behavior, especially when dealing with shared resources, such as locks, pipes, file descriptors. But I never fully understand what exactly the dangers are and when those could happen. It would be great if someone with expertise in this area could explain a bit more in detail what pitfalls are and what needs to be care when programming in a such environment.
For example, if I want to write a server that collects data from various different resources, one solution I've thought is to have the server spawns a set of threads, each popen to call out another program to do the actual work, open pipes to get the data back from the child. Each of these threads responses for its own work, no data interexchange in b/w them, and when the data is collected, the main thread has a queue and these worker threads will just put the result in the queue. What could go wrong with this solution?
Please do not narrow your answer by just "answering" my example scenario. Any suggestions, alternative solutions, or experiences that are not related to the example but helpful to provide a clean design would be great! Thanks!
The problem with forking when you do have some threads running is that the fork only copies the CPU state of the one thread that called it. It's as if all of the other threads just died, instantly, wherever they may be.
The result of this is locks aren't released, and shared data (such as the malloc heap) may be corrupted.
pthread does offer a pthread_atfork function - in theory, you could take every lock in the program before forking, release them after, and maybe make it out alive - but it's risky, because you could always miss one. And, of course, the stacks of the other threads won't be freed.
It is really quite simple. The problems with multiple threads and processes always arise from shared data. If there is not shared data then there can be no possible issues arising.
In your example the shared data is the queue owned by the main thread - any potential contention or race conditions will arise here. Typical methods for "solving" these issues involve locking schemes - a worker thread will lock the queue before inserting any data, and the main thread will lock the queue before removing it.

Many threads or as few threads as possible?

As a side project I'm currently writing a server for an age-old game I used to play. I'm trying to make the server as loosely coupled as possible, but I am wondering what would be a good design decision for multithreading. Currently I have the following sequence of actions:
Startup (creates) ->
Server (listens for clients, creates) ->
Client (listens for commands and sends period data)
I'm assuming an average of 100 clients, as that was the max at any given time for the game. What would be the right decision as for threading of the whole thing? My current setup is as follows:
1 thread on the server which listens for new connections, on new connection create a client object and start listening again.
Client object has one thread, listening for incoming commands and sending periodic data. This is done using a non-blocking socket, so it simply checks if there's data available, deals with that and then sends messages it has queued. Login is done before the send-receive cycle is started.
One thread (for now) for the game itself, as I consider that to be separate from the whole client-server part, architecturally speaking.
This would result in a total of 102 threads. I am even considering giving the client 2 threads, one for sending and one for receiving. If I do that, I can use blocking I/O on the receiver thread, which means that thread will be mostly idle in an average situation.
My main concern is that by using this many threads I'll be hogging resources. I'm not worried about race conditions or deadlocks, as that's something I'll have to deal with anyway.
My design is setup in such a way that I could use a single thread for all client communications, no matter if it's 1 or 100. I've separated the communications logic from the client object itself, so I could implement it without having to rewrite a lot of code.
The main question is: is it wrong to use over 200 threads in an application? Does it have advantages? I'm thinking about running this on a multi-core machine, would it take a lot of advantage of multiple cores like this?
Thanks!
Out of all these threads, most of them will be blocked usually. I don't expect connections to be over 5 per minute. Commands from the client will come in infrequently, I'd say 20 per minute on average.
Going by the answers I get here (the context switching was the performance hit I was thinking about, but I didn't know that until you pointed it out, thanks!) I think I'll go for the approach with one listener, one receiver, one sender, and some miscellaneous stuff ;-)
use an event stream/queue and a thread pool to maintain the balance; this will adapt better to other machines which may have more or less cores
in general, many more active threads than you have cores will waste time context-switching
if your game consists of a lot of short actions, a circular/recycling event queue will give better performance than a fixed number of threads
To answer the question simply, it is entirely wrong to use 200 threads on today's hardware.
Each thread takes up 1 MB of memory, so you're taking up 200MB of page file before you even start doing anything useful.
By all means break your operations up into little pieces that can be safely run on any thread, but put those operations on queues and have a fixed, limited number of worker threads servicing those queues.
Update: Does wasting 200MB matter? On a 32-bit machine, it's 10% of the entire theoretical address space for a process - no further questions. On a 64-bit machine, it sounds like a drop in the ocean of what could be theoretically available, but in practice it's still a very big chunk (or rather, a large number of pretty big chunks) of storage being pointlessly reserved by the application, and which then has to be managed by the OS. It has the effect of surrounding each client's valuable information with lots of worthless padding, which destroys locality, defeating the OS and CPU's attempts to keep frequently accessed stuff in the fastest layers of cache.
In any case, the memory wastage is just one part of the insanity. Unless you have 200 cores (and an OS capable of utilizing) then you don't really have 200 parallel threads. You have (say) 8 cores, each frantically switching between 25 threads. Naively you might think that as a result of this, each thread experiences the equivalent of running on a core that is 25 times slower. But it's actually much worse than that - the OS spends more time taking one thread off a core and putting another one on it ("context switching") than it does actually allowing your code to run.
Just look at how any well-known successful design tackles this kind of problem. The CLR's thread pool (even if you're not using it) serves as a fine example. It starts off assuming just one thread per core will be sufficient. It allows more to be created, but only to ensure that badly designed parallel algorithms will eventually complete. It refuses to create more than 2 threads per second, so it effectively punishes thread-greedy algorithms by slowing them down.
I write in .NET and I'm not sure if the way I code is due to .NET limitations and their API design or if this is a standard way of doing things, but this is how I've done this kind of thing in the past:
A queue object that will be used for processing incoming data. This should be sync locked between the queuing thread and worker thread to avoid race conditions.
A worker thread for processing data in the queue. The thread that queues up the data queue uses semaphore to notify this thread to process items in the queue. This thread will start itself before any of the other threads and contain a continuous loop that can run until it receives a shut down request. The first instruction in the loop is a flag to pause/continue/terminate processing. The flag will be initially set to pause so that the thread sits in an idle state (instead of looping continuously) while there is no processing to be done. The queuing thread will change the flag when there are items in the queue to be processed. This thread will then process a single item in the queue on each iteration of the loop. When the queue is empty it will set the flag back to pause so that on the next iteration of the loop it will wait until the queuing process notifies it that there is more work to be done.
One connection listener thread which listens for incoming connection requests and passes these off to...
A connection processing thread that creates the connection/session. Having a separate thread from your connection listener thread means that you're reducing the potential for missed connection requests due to reduced resources while that thread is processing requests.
An incoming data listener thread that listens for incoming data on the current connection. All data is passed off to a queuing thread to be queued up for processing. Your listener threads should do as little as possible outside of basic listening and passing the data off for processing.
A queuing thread that queues up the data in the right order so everything can be processed correctly, this thread raises the semaphore to the processing queue to let it know there's data to be processed. Having this thread separate from the incoming data listener means that you're less likely to miss incoming data.
Some session object which is passed between methods so that each user's session is self contained throughout the threading model.
This keeps threads down to as simple but as robust a model as I've figured out. I would love to find a simpler model than this, but I've found that if I try and reduce the threading model any further, that I start missing data on the network stream or miss connection requests.
It also assists with TDD (Test Driven Development) such that each thread is processing a single task and is much easier to code tests for. Having hundreds of threads can quickly become a resource allocation nightmare, while having a single thread becomes a maintenance nightmare.
It's far simpler to keep one thread per logical task the same way you would have one method per task in a TDD environment and you can logically separate what each should be doing. It's easier to spot potential problems and far easier to fix them.
What's your platform? If Windows then I'd suggest looking at async operations and thread pools (or I/O Completion Ports directly if you're working at the Win32 API level in C/C++).
The idea is that you have a small number of threads that deal with your I/O and this makes your system capable of scaling to large numbers of concurrent connections because there's no relationship between the number of connections and the number of threads used by the process that is serving them. As expected, .Net insulates you from the details and Win32 doesn't.
The challenge of using async I/O and this style of server is that the processing of client requests becomes a state machine on the server and the data arriving triggers changes of state. Sometimes this takes some getting used to but once you do it's really rather marvellous;)
I've got some free code that demonstrates various server designs in C++ using IOCP here.
If you're using unix or need to be cross platform and you're in C++ then you might want to look at boost ASIO which provides async I/O functionality.
I think the question you should be asking is not if 200 as a general thread number is good or bad, but rather how many of those threads are going to be active.
If only several of them are active at any given moment, while all the others are sleeping or waiting or whatnot, then you're fine. Sleeping threads, in this context, cost you nothing.
However if all of those 200 threads are active, you're going to have your CPU wasting so much time doing thread context switches between all those ~200 threads.

Resources