Maximum number of disruptors in an application - disruptor-pattern

Using LMAX Disruptor,we have observed that if we use like 5-10 disruptors together in an application (sort of like a chain of disruptors with every disruptor having one consumer on it performing a specified task and then handing over the message to the next disruptor/ ringbuffer), What happens is the CPU utilization reaches 90% and above and system becomes unresponsive until we bring down the application, We feel it's because of so many active disruptor threads. This happens even when the disruptors are not really processing anything. Can anyone comment on as to what should be the optimum number of disruptors to be used in an application?

It could be that you need to change the wait strategies you are using on the consumers. If you are using the busy-wait strategy on all of them, even if no inputs have been provided to the ring buffers, the polling threads could still tie up CPU resources because they'll be in tight loops where they're constantly checking the buffer for new values to read.

Related

Java Threads and number of Cores

Is it recommended that the number of threads in a java application should be less than the number of cpu cores?
If so why is this the case and what are the implications of using threads greater than the number of cpu cores ?
You will probably not get any definitive answer on the question of knowing, generally speaking, how many threads an app should have, in relation to the number of core(s) the underlying computer has.
One may also argue that, at the time of PaaS software design and/or elastic clusters, the notion of a fixed number of cores for any given process might be overrated.
Still, the first part of your question :
Is it recommended that the number of threads in a java application should be less than the number of cpu cores?
This has a definitive answer, which is a "no" (once more : as a general rule). And the reason why, shortly, it that all created threads are not typically running (and maybe more importantly runable) at once, meaning there is an opportunity to optimize here.
As a support to this discussion, I'll oppose two ways of creating apps, you could call it "classical" versus "reactive", although this is not a generally acceptable division. Yet, let's use this as a support.
Classical application design
I label as classical applications that rely mostly on "blocking" calls and/or "thread per request" pattern. Consider the traditional way I/Os are done (socket communication like HTTP or Database connection, hard drive based file reading, ...) : your app thread calls some kind of read or write method, which usually triggers an OS level call, that blocks your app thread, fills some device buffer at the OS level (say, read from a disk). Once the buffer has received enough data, the OS signals your java app and thread to resume activity, and the read method returns with the data from the buffer.
The whole time the OS is working (usually just a tiny fraction of a second, but still some large amount of time compared to your typical GHz CPU speed), your Java thread is in state BLOCKED_WAITING, waiting for the OS to signal it can resume. This happens all the time. A code profiler tool, like JProfiler, or YourKit, can help you measure this time. If you do so, you'll notice that in many apps doing I/O, this is a significant part of the so-called "wall time" or "clock time" that is spent... waiting.
So we have one thread waiting, meaning it is not using any CPU time. It can be scheduled out, and the OS is free to give CPU time to anybody else.
Suppose this is a one core CPU, then NOW would be a good time to have another thread to feed the CPU. Meaning having two or more threads could be a good design to maximize CPU usage even on a single core CPU, and get the most out of your hardware.
Most "classical" web applications are typically subject to this type of CPU underuse if you follow the rule of "one thread per CPU core", because Socket communications (or more typically : the time spent waiting for a response to your SQL queries) will incur so much blocking.
If you raise the number of threads your app has, then even if one or two long running requests remain waiting, other faster requests will have runnable threads to run them, and you'll get better CPU usage, and better performance (number of concurrent requests). That is... untill something else reaches saturation (too many heavy requests on your DB, too many simultaneous hard drive reads/writes...)
Reactive app design
Recognizing this typical behavior of apps, and using different sets of OS features, some application frameworks now use non blocking patterns (even for I/O) to mitigate the above issues. Examples in the Java ecosystem are NIO based networking stacks like Netty, or actor pattern implementations like Akka.
In a typical "reactive" app, one usually abandons the "thread per request" pattern that we have in classical apps (meaning one thread is responsible for handling everything from start to finish of a given user request, and waiting when need be for external resources to become available), in favor of a vastly more modular, and non-blocking approach.
Threads are given more technical-grained bits of work to do, and each thread will hand-off work to one another and callbacks to hear back when work they depend upon is done. This "handing of" of units of work means each thread can quickly grab new units of work it is able to handle. Meaning one of two things : you achieve higher CPU usage with far fewer threads in your app (because each can grab work more efficiently, instead of just sitting "waiting") ; or you can instantiate many many more threads because they'll mostly be waiting (not saturating the CPUs), and the dynamical hand-off will still allow for good CPU usage.
Conclusion
Anyway, you don't design the number of threads solely based on the number of available cores. The nature of your implementation and work dictates the number of optimal threads to create.
On a classical app-design philosophy, the two numbers are more closely related than on a reactive one, but still, we have different profiles :
a very simple server app can accomodate many more threads than CPU cores, because it will allow for better throughput (the limit being, say, the output network bandwidth).
a SQL heavy app, should be scalled to the point where your app server will saturate the SQL backend. As your app server will be mostly waiting for your SQL server, then this is the limit
a mixed application consisting of some SQL heavy work, and some lightweight work, will need precision tuning, because you don't want the stuck threads (those blocked waiting for the DB) starving the light requests that would be served more rapidly
a compute intensive program (say, a cryptography service) will probably benefit from a number of threads close to the number CPU cores (if your algorithm is implemented in a classical way), because creating more threads than you are able to run is pointless. In an actor based implementation, creating more threads could actually be a win.

How is processor speed distributed across threads?

Objective:
I am trying to estimate how fast my code will execute when run concurrently in multiple threads.
Question 1)
If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?
Question 2)
What impact, if any, does the presence of other threads effect the execution speed of each other thread?
My Situation:
I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time. Takes Half a second on 1 thread and I was worried how this will scale with multiple users performing the same query. Every user requests is handled by a separate thread so 100 simultaneous users will require 100 simultaneous threads. Each thread is sharing the same resource but read only. No writing. Is there any chance I could get each user to see roughly the same execution time?
Note: I know it will depend upon a number of factors but surely there must be some way of identifying whether or not your code will scale if you find it takes x amount of time for a single thread given x hardware. As final note I'd like to add I have limited experience with computer hardware architecture and how multi-threading works under the hood.
These are all interesting questions, but there is, unfortunately, no straightforward answer, because the answer will depend on a lot of different factors.
Most modern machines are multi-core: in an ideal situation, a four-thread process has the ability to scale up almost linearly in a four-core machine (i.e. run four times as fast).
Most programs, though, spend most of their time waiting for things: disk or database access, the memory bus, network I/O, user input, and other resources. Faster machines don't generally make these things appreciably faster.
The way that most modern operating systems, including Windows, Unix/Linux, and MacOS, use the processor is by scheduling processor time to processes and threads in a more-or-less round-robin manner: at any given time there may be threads that are waiting for processor time (this is a bit simplistic, as they all have some notions of process prioritization, so that high-criticality processes get pushed up the queue earlier than less important ones).
When a thread is using a processor core, it gets it all for as long as its time slice lasts: indeed, only one thing at a time is actually running on a single core. When the process uses up its time slice, or requests some resource that isn't immediately available, it its turn at the processor core is ended, and the next scheduled task will begin. This tends to make pretty optimal use of the processor resources.
So what are the factors that determine how well a process will scale up?
What portion of its run time does a single process spend waiting for
I/O and user input?
Do multiple threads hit the same resources, or different ones?
How much communication has to happen between threads? Between individual threads and your processes main thread? This takes synchronization, and introduces waiting.
How "tight" are the hotspots of the active thread? Can the body of it fit into the processor's memory, or does the (much slower) bus memory have to be accessed?
As a general rule, the more independent individual threads are of one another, the more linearly your application will scale. In real-world business applications, though, that is far from the case. The best way to increase the scaling ability of your process is to understand it--and its dependencies--well, and then use a profiler to find out where the most waiting occurs, and see if you can devise technical strategies to obviate them.
If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?
No, you should determine it empirically.
What impact, if any, does the presence of other threads effect the execution speed of each other thread?
Computation-bound tasks will likely scale very well and be mostly independent of other threads. Interestingly enough, some CPU manufacturers implement features which can increase the clock of a lone-busy CPU core to compensate for the all the idle cores. This sort of feature might confound your measurements and expectations about scaling.
Cache/Memory/disk-bound tasks will start to contend with each other except for where resource partitions exist.
I know it will depend upon a number of factors
Absolutely! So I recommend that you prototype it and measure it. And then find out why it didn't scale as well as you'd hoped and try a different algorithm. Iterate.
but surely there must be some way of identifying whether or not your code will scale
Yes, but unfortunately it requires a detailed description of the algorithm implemented by the code. Your results will be heavily dependent on the ratio of your code's activity among these general regions, and your target's capability for these:
disk I/O
network I/O
memory I/O
computation
My Situation: My application runs in an app server that assigns one thread for every user request. If my application executes in 2 seconds for 1 user I can't assume it will be always take 2 seconds if say 100 users are simultaneously running the same operation correct?
If your app server computes pi to 100 digits for each user request, it will likely scale reasonably well until you encounter the core limit of your target.
If your app server does database queries for each user request, it will likely scale only as well as the target hardware can sustain the necessary load.
EDIT given specifics:
I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time.
Your problem sounds memory+cache-bound. You should study the details of your target CPU/mem deployment or if you are designing it, opt for high memory throughput.
A NUMA system ("resource partitioning" for memory) can likely maximize your overall concurrent memory throughput. Note that since your problem seems to dictate concurrent access to the same memory pages, a NUMA system would penalize the process doing remote memory accesses. In this case, consider creating multiple copies of the data at initialization time.
Depending on the pattern of traversal, TLB pressure might be a factor. Consider experimenting with huge (aka "large") pages.
Cache contention may be a factor in scaling as well.
Your specific algorithm could easily end up dominating over any of the specific system effects, depending on how far apart the best and worst cases are.
limited experience with computer hardware architecture and how multi-threading works under the hood.
Profile the query using CPU performance counters with a tool like Intel's VTune, perf, or oprofile. It can tell you where expensive operations are executing in your code. With this information you can optimize your query to perform well (individually and in aggregate).

Why would I have to use multiple threads for one processing task if i can turn up the priority of the program?

Earlier I asked about processing a datastream and someone suggested to put data in a queue and processing this data on a different thead. If this was to slow, I should use multiple threads.
However, i'm using a system that has one core.
So my question is: why not up the prio of my app, so it gets more CPU time from the OS?
I'm writing a server based app and it will be the only big thing running on there.
What would be the pro's and con's of putting the prio up?:)
If you have only one core, then the only way that multi-threading can help you is if chunks of that work depends on something other than CPU, so one thread can get some work done while another is waiting for data from a disk or network connection.
If your application has a GUI, then it can benefit from multi-threading in that while it would be no quicker to do the processing (slower in fact, though probably negligibly so if the task is very long), it can still react to user input in the meantime.
If you have two or more cores, then you can also gain in CPU-bound operations though doing so varies from trivial to impossible depending on just what that operation is. This is irrelevant to your case, but worth considering generally if code you write could later be run on a multi-core system.
Upping the priority is probably a bad idea though, especially if you have only one core (one advantage of multi-core systems is that people who up priorities can't do as much damage).
All threads have priorities which is a factor of both their process' priority and their priority within that process. A low-priority thread in a high priority process trumps a high-priority thread in a low-priority process.
The scheduler doles out CPU slices in a round-robin fashion to the highest priority threads that have work to do. If there are CPUs left over (which in your case means if there are zero threads at that priority that need to run), then it doles out slices to the next lowest priority, and so on.
Most of the time, most threads aren't doing much anyway, which can be seen from the fact that most of the time CPU usage on most systems is below the 100% mark (hyperthreading skews this, the internal scheduling within the cores means a hyperthreaded system can be fully saturated and seem to be only running at as little as 70%). Anyway, generally stuff gets done and a thread that suddenly has lots to do will do so at normal priority in pretty much the same time it would at a higher.
However, while the benefit to that busy thread of higher priority is generally little or nothing, the decrement is great. Since it's the only thread that gets any CPU time, all other threads are stuck. All other processes therefore hang for a while. Eventually the scheduler notices that they've all been waiting for around 3seconds, and fixes this by boosting them all to highest priority and giving them larger slices than normal. Now we have a burst of activity as threads that got no time are all suddenly highest-priority threads that all want CPU time. There's a spurt of every thread except the high-priority one running, and the system stops from keeling over, though there's likely still a lot of applications showing "Not Responding" in their title bars. It's far from ideal, but it is an effective way to deal with a thread of higher than usual priority grabbing the core for so long.
The threads gradually drop down in priority, and eventually we're back to the situation where the single higher priority thread is the only one that can work.
For extra fun, if our high priority thread in any way depended upon services provided by the lower priority threads, it would have ended up being stuck waiting on them. Hopefully in a way that made it block and stopped itself from doing any damage, but probably not.
In all, thread priorities are to be approached with great caution, and process priorities even more so. They're only really valid if they'll yield quickly and are either essential to the workings of other threads (e.g. some OS processes will be done at a higher priority, finaliser threads in .NET will be higher than the rest of the process, etc) or if sub-millisecond delays can mess things up (some intensive media work requires this).
If you have multiple cores/processors in your system, upping the priority of a single threaded program will not improve your performance by much, because the other cores would still be unused.
The only way to take advantage of multiple processing units is to write your program using multiple threads/processes.
Having said this, setting your multithreaded application to very high priority may lead to some performance improvement, but I really never saw it to be significant, at least in my own tests.
Edit: I see now that you are using only one core. Basically your program will be able to run more often on the CPU than the rest of the processes that are of lower priority. This may bring you a marginal improvement, but not a dramatic one. Since we cannot know what other applications are running at the same time on your system, the golden rule here is to try it yourself with various priority levels and see what happens. It's the only valid way to see if things will be faster or not.
It all depends on why the data processing is slow.
If the data processing is slow because it is a genuinely cpu intensive operation then splitting it out into multiple threads on a single core system is not going to get you any benefit. In this case increasing the task priority would provide some benefit, assuming that there is (user) cpu time being used by other processes.
However, if the data processing operation is slow because of some non-cpu restriction (eg. if it is I/O bound, or relying on another process), then:
Increasing the task priority is going to have negligible impact. Task priority won't affect I/O times and if there is a dependency on another process on the system you may actually harm performance.
Splitting the data processing out into multiple threads can allow the cpu intensive areas to continue processing while waiting for the non-cpu intensive (eg. I/O) areas to complete.
Increasing the priority of a single-threaded process just gives you more (or bigger) time slices on the one core the process is running on. The core can still only do one thing at a time.
If you spin off a thread to handle the data processing, it can run on a different processor core (assuming a multi-core system), and it and your main thread are actually executing at the same time. Much more efficient.
If you use only one thread your server app will only be able to service one request at a time, no matter what its priority. If you use multiple threads you could service many at the same time.

Why are message queues used insted of mulithreading?

I have the following query which i need someone to please help me with.Im new to message queues and have recently started looking at the Kestrel message queue.
As i understand,both threads and message queues are used for concurrency in applications so what is the advantage of using message queues over multitreading ?
Please help
Thank you.
message queues allow you to communicate outside your program.
This allows you to decouple your producer from your consumer. You can spread the work to be done over several processes and machines, and you can manage/upgrade/move around those programs independently of each other.
A message queue also typically consists of one or more brokers that takes care of distributing your messages and making sure the messages are not lost in case something bad happens (e.g. your program crashes, you upgrade one of your programs etc.)
Message queues might also be used internally in a program, in which case it's often just a facility to exchange/queue data from a producer thread to a consumer thread to do async processing.
Actually, one facilitates the other. Message queue is a nice and simple multithreading pattern: when you have a control thread (usually, but not necessarily an application's main thread) and a pool of (usually looping) worker threads, message queues are the easiest way to facilitate control over the thread pool.
For example, to start processing a relatively heavy task, you submit a corresponding message into the queue. If you have more messages, than you can currently process, your queue grows, and if less, it goes vice versa. When your message queue is empty, your threads sleep (usually by staying locked under a mutex).
So, there is nothing to compare: message queues are part of multithreading and hence they're used in some more complicated cases of multithreading.
Creating threads is expensive, and every thread that is simultaneously "live" will add a certain amount of overhead, even if the thread is blocked waiting for something to happen. If program Foo has 1,000 tasks to be performed and doesn't really care in what order they get done, it might be possible to create 1,000 threads and have each thread perform one task, but such an approach would not be terribly efficient. An second alternative would be to have one thread perform all 1,000 tasks in sequence. If there were other processes in the system that could employ any CPU time that Foo didn't use, this latter approach would be efficient (and quite possibly optimal), but if there isn't enough work to keep all CPUs busy, CPUs would waste some time sitting idle. In most cases, leaving a CPU idle for a second is just as expensive as spending a second of CPU time (the main exception is when one is trying to minimize electrical energy consumption, since an idling CPU may consume far less power than a busy one).
In most cases, the best strategy is a compromise between those two approaches: have some number of threads (say 10) that start performing the first ten tasks. Each time a thread finishes a task, have it start work on another until all tasks have been completed. Using this approach, the overhead related to threading will be cut by 99%, and the only extra cost will be the queue of tasks that haven't yet been started. Since a queue entry is apt to be much cheaper than a thread (likely less than 1% of the cost, and perhaps less than 0.01%), this can represent a really huge savings.
The one major problem with using a job queue rather than threading is that if some jobs cannot complete until jobs later in the list have run, it's possible for the system to become deadlocked since the later tasks won't run until the earlier tasks have completed. If each task had been given a separate thread, that problem would not occur since the threads associated with the later tasks would eventually manage to complete and thus let the earlier ones proceed. Indeed, the more earlier tasks were blocked, the more CPU time would be available to run the later ones.
It makes more sense to contrast message queues and other concurrency primitives, such as semaphores, mutex, condition variables, etc. They can all be used in the presence of threads, though message-passing is also commonly used in non-threaded contexts, such as inter-process communication, whereas the others tend to be confined to inter-thread communication and synchronisation.
The short answer is that message-passing is easier on the brain. In detail...
Message-passing works by sending stuff from one agent to another. There is generally no need to coordinate access to the data. Once an agent receives a message it can usually assume that it has unqualified access to that data.
The "threading" style works by giving all agent open-slather access to shared data but requiring them to carefully coordinate their access via primitives. If one agent misbehaves, the process becomes corrupted and all hell breaks loose. Message passing tends to confine problems to the misbehaving agent and its cohort, and since agents are generally self-contained and often programmed in a sequential or state-machine style, they tend not to misbehave as often — or as mysteriously — as conventional threaded code.

Many threads or as few threads as possible?

As a side project I'm currently writing a server for an age-old game I used to play. I'm trying to make the server as loosely coupled as possible, but I am wondering what would be a good design decision for multithreading. Currently I have the following sequence of actions:
Startup (creates) ->
Server (listens for clients, creates) ->
Client (listens for commands and sends period data)
I'm assuming an average of 100 clients, as that was the max at any given time for the game. What would be the right decision as for threading of the whole thing? My current setup is as follows:
1 thread on the server which listens for new connections, on new connection create a client object and start listening again.
Client object has one thread, listening for incoming commands and sending periodic data. This is done using a non-blocking socket, so it simply checks if there's data available, deals with that and then sends messages it has queued. Login is done before the send-receive cycle is started.
One thread (for now) for the game itself, as I consider that to be separate from the whole client-server part, architecturally speaking.
This would result in a total of 102 threads. I am even considering giving the client 2 threads, one for sending and one for receiving. If I do that, I can use blocking I/O on the receiver thread, which means that thread will be mostly idle in an average situation.
My main concern is that by using this many threads I'll be hogging resources. I'm not worried about race conditions or deadlocks, as that's something I'll have to deal with anyway.
My design is setup in such a way that I could use a single thread for all client communications, no matter if it's 1 or 100. I've separated the communications logic from the client object itself, so I could implement it without having to rewrite a lot of code.
The main question is: is it wrong to use over 200 threads in an application? Does it have advantages? I'm thinking about running this on a multi-core machine, would it take a lot of advantage of multiple cores like this?
Thanks!
Out of all these threads, most of them will be blocked usually. I don't expect connections to be over 5 per minute. Commands from the client will come in infrequently, I'd say 20 per minute on average.
Going by the answers I get here (the context switching was the performance hit I was thinking about, but I didn't know that until you pointed it out, thanks!) I think I'll go for the approach with one listener, one receiver, one sender, and some miscellaneous stuff ;-)
use an event stream/queue and a thread pool to maintain the balance; this will adapt better to other machines which may have more or less cores
in general, many more active threads than you have cores will waste time context-switching
if your game consists of a lot of short actions, a circular/recycling event queue will give better performance than a fixed number of threads
To answer the question simply, it is entirely wrong to use 200 threads on today's hardware.
Each thread takes up 1 MB of memory, so you're taking up 200MB of page file before you even start doing anything useful.
By all means break your operations up into little pieces that can be safely run on any thread, but put those operations on queues and have a fixed, limited number of worker threads servicing those queues.
Update: Does wasting 200MB matter? On a 32-bit machine, it's 10% of the entire theoretical address space for a process - no further questions. On a 64-bit machine, it sounds like a drop in the ocean of what could be theoretically available, but in practice it's still a very big chunk (or rather, a large number of pretty big chunks) of storage being pointlessly reserved by the application, and which then has to be managed by the OS. It has the effect of surrounding each client's valuable information with lots of worthless padding, which destroys locality, defeating the OS and CPU's attempts to keep frequently accessed stuff in the fastest layers of cache.
In any case, the memory wastage is just one part of the insanity. Unless you have 200 cores (and an OS capable of utilizing) then you don't really have 200 parallel threads. You have (say) 8 cores, each frantically switching between 25 threads. Naively you might think that as a result of this, each thread experiences the equivalent of running on a core that is 25 times slower. But it's actually much worse than that - the OS spends more time taking one thread off a core and putting another one on it ("context switching") than it does actually allowing your code to run.
Just look at how any well-known successful design tackles this kind of problem. The CLR's thread pool (even if you're not using it) serves as a fine example. It starts off assuming just one thread per core will be sufficient. It allows more to be created, but only to ensure that badly designed parallel algorithms will eventually complete. It refuses to create more than 2 threads per second, so it effectively punishes thread-greedy algorithms by slowing them down.
I write in .NET and I'm not sure if the way I code is due to .NET limitations and their API design or if this is a standard way of doing things, but this is how I've done this kind of thing in the past:
A queue object that will be used for processing incoming data. This should be sync locked between the queuing thread and worker thread to avoid race conditions.
A worker thread for processing data in the queue. The thread that queues up the data queue uses semaphore to notify this thread to process items in the queue. This thread will start itself before any of the other threads and contain a continuous loop that can run until it receives a shut down request. The first instruction in the loop is a flag to pause/continue/terminate processing. The flag will be initially set to pause so that the thread sits in an idle state (instead of looping continuously) while there is no processing to be done. The queuing thread will change the flag when there are items in the queue to be processed. This thread will then process a single item in the queue on each iteration of the loop. When the queue is empty it will set the flag back to pause so that on the next iteration of the loop it will wait until the queuing process notifies it that there is more work to be done.
One connection listener thread which listens for incoming connection requests and passes these off to...
A connection processing thread that creates the connection/session. Having a separate thread from your connection listener thread means that you're reducing the potential for missed connection requests due to reduced resources while that thread is processing requests.
An incoming data listener thread that listens for incoming data on the current connection. All data is passed off to a queuing thread to be queued up for processing. Your listener threads should do as little as possible outside of basic listening and passing the data off for processing.
A queuing thread that queues up the data in the right order so everything can be processed correctly, this thread raises the semaphore to the processing queue to let it know there's data to be processed. Having this thread separate from the incoming data listener means that you're less likely to miss incoming data.
Some session object which is passed between methods so that each user's session is self contained throughout the threading model.
This keeps threads down to as simple but as robust a model as I've figured out. I would love to find a simpler model than this, but I've found that if I try and reduce the threading model any further, that I start missing data on the network stream or miss connection requests.
It also assists with TDD (Test Driven Development) such that each thread is processing a single task and is much easier to code tests for. Having hundreds of threads can quickly become a resource allocation nightmare, while having a single thread becomes a maintenance nightmare.
It's far simpler to keep one thread per logical task the same way you would have one method per task in a TDD environment and you can logically separate what each should be doing. It's easier to spot potential problems and far easier to fix them.
What's your platform? If Windows then I'd suggest looking at async operations and thread pools (or I/O Completion Ports directly if you're working at the Win32 API level in C/C++).
The idea is that you have a small number of threads that deal with your I/O and this makes your system capable of scaling to large numbers of concurrent connections because there's no relationship between the number of connections and the number of threads used by the process that is serving them. As expected, .Net insulates you from the details and Win32 doesn't.
The challenge of using async I/O and this style of server is that the processing of client requests becomes a state machine on the server and the data arriving triggers changes of state. Sometimes this takes some getting used to but once you do it's really rather marvellous;)
I've got some free code that demonstrates various server designs in C++ using IOCP here.
If you're using unix or need to be cross platform and you're in C++ then you might want to look at boost ASIO which provides async I/O functionality.
I think the question you should be asking is not if 200 as a general thread number is good or bad, but rather how many of those threads are going to be active.
If only several of them are active at any given moment, while all the others are sleeping or waiting or whatnot, then you're fine. Sleeping threads, in this context, cost you nothing.
However if all of those 200 threads are active, you're going to have your CPU wasting so much time doing thread context switches between all those ~200 threads.

Resources