Threadpool multi-queue job dispatch algorithm

Threadpool multi-queue job dispatch algorithm - multithreading

I'm curious to know if there is a widely accepted solution for managing thread resources in a threadpool given the following scenario/constraints:
Incoming jobs are all of the same
nature and could be processed by any
thread in the pool.
Incoming jobs
will be 'bucketed' into different
queues based on some attribute of
the incoming job such that all jobs
going to the same bucket/queue MUST
be processed serially.
Some buckets will be less busy than
others at different points during
the lifetime of the program.
My question is on the theory behind a threadpool's implementation. What algorithm could be used to efficiently allocate available threads to incoming jobs across all buckets?
Edit: Another design goal would be to eliminate as much latency as possible between a job being enqueued and it being picked up for processing, assuming there are available idle threads.
Edit2: In the case I'm thinking of there are a relatively large number of queues (50-100) which have unpredictable levels of activity, but probably only 25% of them will be active at any given time.
The first (and most costly) solution I can think of is to simply have 1 thread assigned to each queue. While this will ensure incoming requests are picked up immediately, it is obviously inefficient.
The second solution is to combine the queues together based on expected levels of activity so that the number of queues is inline with the number of threads in the pool, allowing one thread to be assigned to each queue. The problem here will be that incoming jobs, which otherwise could be processed in parallel, will be forced to wait on each other.
The third solution is to create the maximum number of queues, one for each set of jobs that must be processed serially, but only allocate threads based on the number of queues we expect to be busy at any given time (which could also be adjusted by the pool at runtime). So this is where my question comes in: Given that we have more queues than threads, how does the pool go about allocating idle threads to incoming jobs in the most efficient way possible?
I would like to know if there is a widely accepted approach. Or if there are different approaches - who makes use of which one? What are the advantages/disadvantages, etc?
Edit3:This might be best expressed in pseudo code.

You should probably eliminate nr. 2 from your specification. All you really need to comply to is that threads take up buckets and process the queues inside the buckets in order. It makes no sense to process a serialized queue with another threadpool or do some serialization of tasks in parallel. Thus your spec simply becomes that the threads iterate the fifo in the buckets and it's up to the poolmanager to insert properly constructed buckets. So your bucket will be:
struct task_bucket
{
void *ctx; // context relevant data
fifo_t *queue; // your fifo
};
Then it's up to you to make the threadpool smart enough to know what to do on each iteration of the queue. For example the ctx can be a function pointer and the queue can contain data for that function, so the worker thread simply calls the function on each iteration with the provided data.
Reflecting the comments:
If the size of the bucket list is known before hand and isn't likely to change during the lifetime of the program, you'd need to figure out if that is important to you. You will need some way for the threads to select a bucket to take. The easiest way is to have a FIFO queue that is filled by the manager and emptied by the threads. Classic reader/writer.
Another possibility is a heap. The worker removes the highest priority from the heap and processes the bucket queue. Both removal by the workers and insertion by the manager reorders the heap so that the root node is the highest priority.
Both these strategies assume that the workers throw away the buckets and the manager makes new ones.
If keeping the buckets is important, you run the risk of workers only attending to the last modified task, so the manager will either need to reorder the bucket list or modify priorities of each bucket and the worker iterates looking for the highest priority. It is important that memory of ctx remains relevant while threads are working or threads will have to copy this as well. Workers can simply assign the queue locally and set queue to NULL in the bucket.

ADDED: I now tend to agree that you might start simple and just keep a separate thread for each bucket, and only if this simple solution is understood to have problems you look for something different. And a better solution might depend on what exactly problems the simple one causes.
In any case, I leave my initial answer below, appended with an afterthought.
You can make a special global queue of "job is available in bucket X" signals.
All idle workers would wait on this queue, and when a signal is put into the queue one thread will take it and proceed to the corresponding bucket to process jobs there until the bucket becomes empty.
When an incoming job is submitted into an in-order bucket, it should be checked whether a worker thread is assigned to this bucket already. If assigned, the new job will be eventually processed by this worker thread, so no signal should be sent. If not worker is assigned, check whether the bucket is empty or not. If empty, place a signal into the global signal queue that a new job has arrived in this bucket; if not empty, such a signal should have been made already and a worker thread should soon arrive, so do nothing.
ADDED: I got a thought that my idea above can cause starvation for some jobs if the number of threads is less than the number of "active" buckets and there is a non-ending flow of incoming tasks. If all threads are already busy and a new job arrives into a bucket that is not yet served, it may take long time before a thread is freed to work on this new job. So there is a need to check if there are idle workers, and if not, create a new one... which adds more complexity.

Keep it Simple: I'd use 1 thread per queue. Simplicity is worth a lot, and threads are quite cheap. 100 threads won't be an issue on most OS's.
By using a thread per queue, you also get a real scheduler. If a thread blocks (depends on what you're doing), another thread can be queued. You won't get deadlock until every single one blocks. The same cannot be said if you use fewer threads - if the queues the threads happen to be servicing block, then even if other queues are "runnable" and even if these other queue's might unblock the blocked threads, you'll have deadlock.
Now, in particular scenarios, using a threadpool may be worth it. But then you're talking about optimizing a particular system, and the details matter. How expensive are threads? How good is the scheduler? What about blocking? How long are the queues, how frequently updated, etc.
So in general, with just the information that you have around 100 queues, I'd just go for a thread per queue. Yes, there's some overhead: all solutions will have that. A threadpool will introduce synchronization issues and overhead. And the overhead of a limited number of threads is fairly minor. You're mostly talking about around 100MB of address space - not necessarily memory. If you know most queues will be idle, you could further implement an optimization to stop threads on empty queues and start them when needed (but beware of race conditions and thrashing).

Related

How to unblock all threads waiting on a semaphore?

I am dealing with a standard producer and consumer problem with finite array (or finitely many buffers ). I tried implementing it using semaphores and I have run into a problem. I want the producer to 'produce' only say 50 times. After that I want the producer thread to join the main thread. This part is easy, but what I am unable to do is to join the consumer threads. They are stuck on the semaphore signaling that there is no data. How do I solve this problem?
One possible option is to have a flag variable which becomes True when producer joins main and after that, the main thread would do post(semaphore) as many times as the number of worker threads. The worker threads would check the flag variable every time after waking up and if True, it would exit the function.
I think my method is pretty inefficient because of the many post semaphore calls. It would be great if I can unblock all threads at once!
Edit: I tried implementing whatever I said and it doesn't work due to deadlock

One option is the "poison pill" method. It assumes that you know how many consumer threads exist. Assuming there are N consumers, then after the producer has done it's thing, it puts N "poison pills" into the queue. A "poison pill" simply is an object/value that is type-compatible with whatever the producer normally produces, but which is distinguishable from a normal object/value.
When a consumer recognizes that it has eaten a poison pill, it dies. Problem solved.

I've done producer consumer structures in C++ in FreeRTOS operating system only, so keep that in mind. That has been my only experience so far with multitasking. I would say that I only used one producer in that program and one consumer. And I've done multitasking in LabView, but this is little bit different from what you might have, I think.
I think that one option could be to have a queue structure, so that the producer enqueues elements into the queue but if it's full of data, then you can hopefully implement it so that you can make some kind of queue policy as follows.
producer can either
block itself until space is available in the queue to enqueue,
block itself for certain time period, and continue elsewhere if time spent and didnt succeed in enqueuing data
immediately go elsewhere
So it looks like you have your enqueuing policy in order...
The queue readers are able to have similar three type of policies at least in FreeRTOS.
In general if you have a binary semaphore, then you have it so that the sender is sending it, and the receiver is waiting on it. It is used for synchronization or signalling.
In my opinion you have chosen the wrong approach with the "many semaphores" (???)
What you need to have is a queue structure where the producer inputs stuff...
Then, the consumers read from the queue whatever they must do...
If the queue is empty then you need a policy on what the queue reader threads should do.
Policy choice is needed also for those queue readers and semaphore readers on what they should do, when the queue is empty, or if they havent gotten the semaphore received. I would not use semaphores for this kind of problem...
I think the boolean variable idea could work, because you are only writing into that variable in the producer thread. Then the other threads should be able to read and poll that boolean variable if the producer is active...
But I think that you should provide more details what you are trying to do, especially with the consumer threads, how many threads of what kind you have, and what language you are programming in etc...

Single Producer, Multiple Consumers with a few unusual twists

I have a (Posix) server that acts as a proxy for many clients to another upstream server. Messages typically flow down from the upstream server, are then matched against, and pushed out to some subset of the clients interested in that traffic (maintaining the FIFO order from the upstream server). Currently, this proxy server is single threaded using an event loop (e.g. - select, epoll, etc.), but now I'd like to make it multithreaded so that the proxy can more fully utilize an entire machine and achieve much higher throughput.
My high level design is to have a pool of N worker pthreads (where N is some small multiple of the number of cores on the machine) who each run their own event loop. Each client connection will be assigned to a specific worker thread who would then be responsible for servicing all of that client's I/O + timeout needs for the duration of that client connection. I also intend to have a single dedicated thread who pulls in the messages in from the upstream server. Once a message is read in, its contents can be considered constant / unchanging, until it is no longer needed and reclaimed. The workers never alter the message contents -- they just pass them along to their clients as needed.
My first question is: should the matching of client interests preferably be done by the producer thread or the worker threads?
In the former approach, for each worker thread, the producer could check the interests (e.g. - group membership) of the worker's clients. If the message matched any clients, then it could push the message onto a dedicated queue for that worker. This approach requires some kind of synchronization between the producer and each worker about their client's rarely changing interests.
In the latter approach, the producer just pushes every message onto some kind of queue shared by all of the worker threads. Then each worker thread checks ALL of the messages for a match against their clients' interests and processes each message that matches. This is a twist on the usual SPMC problem where a consumer is usually assumed to unilaterally take an element for themselves, rather than all consumers needing to do some processing on every element. This approach distributes the matching work across multiple threads, which seems desirable, but I worry it may cause more contention between the threads depending on how we implement their synchronization.
In both approaches, when a message is no longer needed by any worker thread, it then needs to be reclaimed. So, some tracking needs to be done to know when no worker thread needs a message any longer.
My second question is: what is a good way of tracking whether a message is still needed by any of the worker threads?
A simple way to do this would be to assign to each message a count of how many worker threads still need to process the message when it is first produced. Then, when each worker is done processing a message it would decrement the count in a thread-safe manner and if/when the count went to zero we would know it could be reclaimed.
Another way to do this would be to assign 64b sequence numbers to the messages as they came in, then each thread could track and record the highest sequence number up through which they have processed somehow. Then we could reclaim all messages with sequence numbers less than or equal to the minimum processed sequence number across all of the worker threads in some manner.
The latter approach seems like it could more easily allow for a lazy reclamation process with less cross-thread synchronization necessary. That is, you could have a "clean-up" thread that only runs periodically who goes and computes the minimum across the worker threads, with much less inter-thread synchronization being necessary. For example, if we assume that reads and writes of a 64b integer are atomic and a worker's fully processed sequence number is always monotonically increasing, then the "clean-up" thread can just periodically read the workers' fully processed counts (maybe with some memory barrier) and compute the minimum.
Third question: what is the best way for workers to realize that they have new work to do in their queue(s)?
Each worker thread is going to be managing its own event loop of client file descriptors and timeouts. Is it best for each worker thread to just have their own pipe to which signal data can be written by the producer to poke them into action? Or should they just periodically check their queue(s) for new work? Are there better ways to do this?
Last question: what kind of data structure and synchronization should I use for the queue(s) between the producer and the consumer?
I'm aware of lock-free data structures but I don't have a good feel for whether they'd be preferable in my situation or if I should instead just go with a simple mutex for operations that affect the queue. Also, in the shared queue approach, I'm not entirely sure how a worker thread should track "where" it is in processing the queue.
Any insights would be greatly appreciated! Thanks!

Based on your problem description, matching of client interests needs to be done for each client for each message anyway, so the work in matching is the same whichever type of thread it occurs in. That suggests the matching should be done in the client threads to improve concurrency. Synchronization overhead should not be a major issue if the "producer" thread ensures the messages are flushed to main memory (technically, "synchronize memory with respect to other threads") before their availability is made known to the other threads, as the client threads can all read the information from main memory simultaneously without synchronizing with each other. The client threads will not be able to modify messages, but they should not need to.
Message reclamation is probably better done by tracking the current message number of each thread rather than by having a message specific counter, as a message specific counter presents a concurrency bottleneck.
I don't think you need formal queueing mechanisms. The "producer" thread can simply keep a volatile variable updated which contains the number of the most recent message that has been flushed to main memory, and the client threads can check the variable when they are free to do work, sleeping if no work is available. You could get more sophisticated on the thread management, but the additional efficiency improvement would likely be minor.
I don't think you need sophisticated data structures for this. You need volatile variables for the number of the latest message that is available for processing and for the number of the most recent message that have been processed by each client thread. You need to flush the messages themselves to main memory. You need some way of finding the messages in main memory from the message number, perhaps using a circular buffer of pointers, or of messages if the messages are all of the same length. You don't really need much else with respect to the data to be communicated between the threads.

Why are message queues used insted of mulithreading?

I have the following query which i need someone to please help me with.Im new to message queues and have recently started looking at the Kestrel message queue.
As i understand,both threads and message queues are used for concurrency in applications so what is the advantage of using message queues over multitreading ?
Please help
Thank you.

message queues allow you to communicate outside your program.
This allows you to decouple your producer from your consumer. You can spread the work to be done over several processes and machines, and you can manage/upgrade/move around those programs independently of each other.
A message queue also typically consists of one or more brokers that takes care of distributing your messages and making sure the messages are not lost in case something bad happens (e.g. your program crashes, you upgrade one of your programs etc.)
Message queues might also be used internally in a program, in which case it's often just a facility to exchange/queue data from a producer thread to a consumer thread to do async processing.

Actually, one facilitates the other. Message queue is a nice and simple multithreading pattern: when you have a control thread (usually, but not necessarily an application's main thread) and a pool of (usually looping) worker threads, message queues are the easiest way to facilitate control over the thread pool.
For example, to start processing a relatively heavy task, you submit a corresponding message into the queue. If you have more messages, than you can currently process, your queue grows, and if less, it goes vice versa. When your message queue is empty, your threads sleep (usually by staying locked under a mutex).
So, there is nothing to compare: message queues are part of multithreading and hence they're used in some more complicated cases of multithreading.

Creating threads is expensive, and every thread that is simultaneously "live" will add a certain amount of overhead, even if the thread is blocked waiting for something to happen. If program Foo has 1,000 tasks to be performed and doesn't really care in what order they get done, it might be possible to create 1,000 threads and have each thread perform one task, but such an approach would not be terribly efficient. An second alternative would be to have one thread perform all 1,000 tasks in sequence. If there were other processes in the system that could employ any CPU time that Foo didn't use, this latter approach would be efficient (and quite possibly optimal), but if there isn't enough work to keep all CPUs busy, CPUs would waste some time sitting idle. In most cases, leaving a CPU idle for a second is just as expensive as spending a second of CPU time (the main exception is when one is trying to minimize electrical energy consumption, since an idling CPU may consume far less power than a busy one).
In most cases, the best strategy is a compromise between those two approaches: have some number of threads (say 10) that start performing the first ten tasks. Each time a thread finishes a task, have it start work on another until all tasks have been completed. Using this approach, the overhead related to threading will be cut by 99%, and the only extra cost will be the queue of tasks that haven't yet been started. Since a queue entry is apt to be much cheaper than a thread (likely less than 1% of the cost, and perhaps less than 0.01%), this can represent a really huge savings.
The one major problem with using a job queue rather than threading is that if some jobs cannot complete until jobs later in the list have run, it's possible for the system to become deadlocked since the later tasks won't run until the earlier tasks have completed. If each task had been given a separate thread, that problem would not occur since the threads associated with the later tasks would eventually manage to complete and thus let the earlier ones proceed. Indeed, the more earlier tasks were blocked, the more CPU time would be available to run the later ones.

It makes more sense to contrast message queues and other concurrency primitives, such as semaphores, mutex, condition variables, etc. They can all be used in the presence of threads, though message-passing is also commonly used in non-threaded contexts, such as inter-process communication, whereas the others tend to be confined to inter-thread communication and synchronisation.
The short answer is that message-passing is easier on the brain. In detail...
Message-passing works by sending stuff from one agent to another. There is generally no need to coordinate access to the data. Once an agent receives a message it can usually assume that it has unqualified access to that data.
The "threading" style works by giving all agent open-slather access to shared data but requiring them to carefully coordinate their access via primitives. If one agent misbehaves, the process becomes corrupted and all hell breaks loose. Message passing tends to confine problems to the misbehaving agent and its cohort, and since agents are generally self-contained and often programmed in a sequential or state-machine style, they tend not to misbehave as often — or as mysteriously — as conventional threaded code.

Optimal sleep time in multiple producer / single consumer model

I'm writing an application that has a multiple producer, single consumer model (multiple threads send messages to a single file writer thread).
Each producer thread contains two queues, one to write into, and one for a consumer to read out of. Every loop of the consumer thread, it iterates through each producer and lock that producer's mutex, swaps the queues, unlocks, and writes out from the queue that the producer is no longer using.
In the consumer thread's loop, it sleeps for a designated amount of time after it processes all producer threads. One thing I immediately noticed was that the average time for a producer to write something into the queue and return increased dramatically (by 5x) when I moved from 1 producer thread to 2. As more threads are added, this average time decreases until it bottoms out - there isn't much difference between the time taken with 10 producers vs 15 producers. This is presumably because with more producers to process, there is less contention for the producer thread's mutex.
Unfortunately, having < 5 producers is a fairly common scenario for the application and I'd like to optimize the sleep time so that I get reasonable performance regardless of how many producers exist. I've noticed that by increasing the sleep time, I can get better performance for low producer counts, but worse performance for large producer counts.
Has anybody else encountered this, and if so what was your solution? I have tried scaling the sleep time with the number of threads, but it seems somewhat machine specific and pretty trial-and-error.

You could pick the sleep time based on the number of producers or even make the sleep time adapt based on some dyanmic scheme. If the consumer wakes up and has no work, double the sleep time, otherwise halve it. But constrain the sleep time to some minimum and maximum.
Either way you're papering over a more fundamental issue. Sleeping and polling is easy to get right and sometimes is the only approach available, but it has many drawbacks and isn't the "right" way.
You can head in the right direction by adding a semaphore which is incremented whenever a producer adds an item to a queue and decremented when the consumer processes an item in a queue. The consumer will only wake up when there are items to process and will do so immediately.
Polling the queues may still be a problem, though. You could add a new queue that refers to any queue which has items on it. But it rather raises the question as to why you don't have a single queue that the consumer processes rather than a queue per producer. All else being equal that sounds like the best approach.

Instead of sleeping, I would recommend that your consumer block on a condition signaled by the producers. On a posix-compliant system, you could make it work with pthread_cond. Create an array of pthread_cond_t, one for each producer, then create an additional one that is shared between them. The producers first signal their individual condition variable, and then the shared one. The consumer waits on the shared condition and then iterates over the elements of the array, performing a pthread_cond_timed_wait() on each element of the array (use pthread_get_expiration_np() to get the absolute time for "now"). If the wait returns 0, then that producer has written data. The consumer must reinitialize the condition variables before waiting again.
By using blocking waits, you'll minimize the amount time the consumer is needlessly locking-out the producers. You could also make this work with semaphores, as stated in a previous answer. Semaphores have simplified semantics compared to conditions, in my opinion, but you'd have to be careful to decrement the shared semaphore once for each producer that was processed on each pass through the consumer loop. Condition variables have the advantage that you can basically use them like boolean semaphores if you reinitialize them after they are signaled.

Try to find an implementation of a Blocking Queue in the language that you use for programming. No more than one queue will be enough for any number of producers and one consumer.

To me it sounds like you are accidentally introducing some buffering by having the consumer thread be busy somewhere else, either sleeping or doing actual work. (the queue acting as the buffer) Maybe doing some simple buffering on the producer side will reduce your contention.
It seems that your system is highly sensitive to lock-contention between the producer and consumer, but I'm baffled as to why such a simple swap operation would occupy enough cpu time to show up in your run stats.
Can you show some code?
edit: maybe you are taking your lock and swapping queues even when there is no work to do?

Many threads or as few threads as possible?

As a side project I'm currently writing a server for an age-old game I used to play. I'm trying to make the server as loosely coupled as possible, but I am wondering what would be a good design decision for multithreading. Currently I have the following sequence of actions:
Startup (creates) ->
Server (listens for clients, creates) ->
Client (listens for commands and sends period data)
I'm assuming an average of 100 clients, as that was the max at any given time for the game. What would be the right decision as for threading of the whole thing? My current setup is as follows:
1 thread on the server which listens for new connections, on new connection create a client object and start listening again.
Client object has one thread, listening for incoming commands and sending periodic data. This is done using a non-blocking socket, so it simply checks if there's data available, deals with that and then sends messages it has queued. Login is done before the send-receive cycle is started.
One thread (for now) for the game itself, as I consider that to be separate from the whole client-server part, architecturally speaking.
This would result in a total of 102 threads. I am even considering giving the client 2 threads, one for sending and one for receiving. If I do that, I can use blocking I/O on the receiver thread, which means that thread will be mostly idle in an average situation.
My main concern is that by using this many threads I'll be hogging resources. I'm not worried about race conditions or deadlocks, as that's something I'll have to deal with anyway.
My design is setup in such a way that I could use a single thread for all client communications, no matter if it's 1 or 100. I've separated the communications logic from the client object itself, so I could implement it without having to rewrite a lot of code.
The main question is: is it wrong to use over 200 threads in an application? Does it have advantages? I'm thinking about running this on a multi-core machine, would it take a lot of advantage of multiple cores like this?
Thanks!
Out of all these threads, most of them will be blocked usually. I don't expect connections to be over 5 per minute. Commands from the client will come in infrequently, I'd say 20 per minute on average.
Going by the answers I get here (the context switching was the performance hit I was thinking about, but I didn't know that until you pointed it out, thanks!) I think I'll go for the approach with one listener, one receiver, one sender, and some miscellaneous stuff ;-)

use an event stream/queue and a thread pool to maintain the balance; this will adapt better to other machines which may have more or less cores
in general, many more active threads than you have cores will waste time context-switching
if your game consists of a lot of short actions, a circular/recycling event queue will give better performance than a fixed number of threads

To answer the question simply, it is entirely wrong to use 200 threads on today's hardware.
Each thread takes up 1 MB of memory, so you're taking up 200MB of page file before you even start doing anything useful.
By all means break your operations up into little pieces that can be safely run on any thread, but put those operations on queues and have a fixed, limited number of worker threads servicing those queues.
Update: Does wasting 200MB matter? On a 32-bit machine, it's 10% of the entire theoretical address space for a process - no further questions. On a 64-bit machine, it sounds like a drop in the ocean of what could be theoretically available, but in practice it's still a very big chunk (or rather, a large number of pretty big chunks) of storage being pointlessly reserved by the application, and which then has to be managed by the OS. It has the effect of surrounding each client's valuable information with lots of worthless padding, which destroys locality, defeating the OS and CPU's attempts to keep frequently accessed stuff in the fastest layers of cache.
In any case, the memory wastage is just one part of the insanity. Unless you have 200 cores (and an OS capable of utilizing) then you don't really have 200 parallel threads. You have (say) 8 cores, each frantically switching between 25 threads. Naively you might think that as a result of this, each thread experiences the equivalent of running on a core that is 25 times slower. But it's actually much worse than that - the OS spends more time taking one thread off a core and putting another one on it ("context switching") than it does actually allowing your code to run.
Just look at how any well-known successful design tackles this kind of problem. The CLR's thread pool (even if you're not using it) serves as a fine example. It starts off assuming just one thread per core will be sufficient. It allows more to be created, but only to ensure that badly designed parallel algorithms will eventually complete. It refuses to create more than 2 threads per second, so it effectively punishes thread-greedy algorithms by slowing them down.

I write in .NET and I'm not sure if the way I code is due to .NET limitations and their API design or if this is a standard way of doing things, but this is how I've done this kind of thing in the past:
A queue object that will be used for processing incoming data. This should be sync locked between the queuing thread and worker thread to avoid race conditions.
A worker thread for processing data in the queue. The thread that queues up the data queue uses semaphore to notify this thread to process items in the queue. This thread will start itself before any of the other threads and contain a continuous loop that can run until it receives a shut down request. The first instruction in the loop is a flag to pause/continue/terminate processing. The flag will be initially set to pause so that the thread sits in an idle state (instead of looping continuously) while there is no processing to be done. The queuing thread will change the flag when there are items in the queue to be processed. This thread will then process a single item in the queue on each iteration of the loop. When the queue is empty it will set the flag back to pause so that on the next iteration of the loop it will wait until the queuing process notifies it that there is more work to be done.
One connection listener thread which listens for incoming connection requests and passes these off to...
A connection processing thread that creates the connection/session. Having a separate thread from your connection listener thread means that you're reducing the potential for missed connection requests due to reduced resources while that thread is processing requests.
An incoming data listener thread that listens for incoming data on the current connection. All data is passed off to a queuing thread to be queued up for processing. Your listener threads should do as little as possible outside of basic listening and passing the data off for processing.
A queuing thread that queues up the data in the right order so everything can be processed correctly, this thread raises the semaphore to the processing queue to let it know there's data to be processed. Having this thread separate from the incoming data listener means that you're less likely to miss incoming data.
Some session object which is passed between methods so that each user's session is self contained throughout the threading model.
This keeps threads down to as simple but as robust a model as I've figured out. I would love to find a simpler model than this, but I've found that if I try and reduce the threading model any further, that I start missing data on the network stream or miss connection requests.
It also assists with TDD (Test Driven Development) such that each thread is processing a single task and is much easier to code tests for. Having hundreds of threads can quickly become a resource allocation nightmare, while having a single thread becomes a maintenance nightmare.
It's far simpler to keep one thread per logical task the same way you would have one method per task in a TDD environment and you can logically separate what each should be doing. It's easier to spot potential problems and far easier to fix them.

What's your platform? If Windows then I'd suggest looking at async operations and thread pools (or I/O Completion Ports directly if you're working at the Win32 API level in C/C++).
The idea is that you have a small number of threads that deal with your I/O and this makes your system capable of scaling to large numbers of concurrent connections because there's no relationship between the number of connections and the number of threads used by the process that is serving them. As expected, .Net insulates you from the details and Win32 doesn't.
The challenge of using async I/O and this style of server is that the processing of client requests becomes a state machine on the server and the data arriving triggers changes of state. Sometimes this takes some getting used to but once you do it's really rather marvellous;)
I've got some free code that demonstrates various server designs in C++ using IOCP here.
If you're using unix or need to be cross platform and you're in C++ then you might want to look at boost ASIO which provides async I/O functionality.

I think the question you should be asking is not if 200 as a general thread number is good or bad, but rather how many of those threads are going to be active.
If only several of them are active at any given moment, while all the others are sleeping or waiting or whatnot, then you're fine. Sleeping threads, in this context, cost you nothing.
However if all of those 200 threads are active, you're going to have your CPU wasting so much time doing thread context switches between all those ~200 threads.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string