Performance implications of distribution of threads amongst processes in a server

Performance implications of distribution of threads amongst processes in a server - multithreading

The question title is pretty awkward, sorry about that.
I am currently working on the design of a server, and a comment came up from one of my co-workers that we should use multiple processes, since the was some performance hit to having too many threads in a single process (as opposed to having that same number of threads spread over multiple processes on the same machine)
The only thing I can think of which would cause this (other than bad OS scheduling), would be from increased contention (for example on the memory allocator), but I'm not sure how much that matters.
Is this a 'best practice'? Does anyone have some benchmarks they could share with me? Of course the answer may depend on the platform (I'm interested mostly in windows/linux/osx, although I need to care about HP-UX, AIX, and Solaris to some extent)
There are of course other benefits to using a multi-process architecture, such as process isolation to limit the effect of a crash, but I'm interested about performance for this question.
For some context, the server is going to service long-running, stateful connections (so they cannot be migrated to other server processes) which send back a lot of data, and can also cause a lot of local DB processing on the server machine. It's going to use the proactor architecture in-process and be implemented in C++. The server will be expected to run for weeks/months without need of restart (although this may be implemented by rotating new instances transparently under some proxy).
Also, we will be using a multi-process architecture, my concern is more about scheduling connections to processes.

Related

Should each website be its own `node.js` process

We host about 150 websites (possibly scaling to 300+) that we are considering migrating to node.js. Most of the sites are fairly low traffic <1mil pageviews per month.
Should each website be it's own node.js process, or should we serve all websites using the same node.js process (or small set of load balanced processes). Is there a technical limit or a reasonable limit to the number of node processes per server?
Process per site: Feels inefficient, but I don't know if it actually is inefficient. Would ensure one buggy site doesn't affect other sites.
Process per core/small set of processes: Likely higher performance, but what happens when I need to update a sites codebase, won't it take down other sites? Also, code failures in one site would affect other sites.
Ideally, I would prefer one process per site so that we could host all sites from each worker server. That way when load increases we can just spin up another identical worker server and load balance between the two without having to arbitrarily say SiteA goes to ServerA and SiteB goes to ServerB. Any node.js gurus available to offer some wisdom?
All static file requests will be handled likely by Nginx or something like Varnish.

There are a lot of issues at play here. The big picture answer is, it depends... as it always does when you bring in the whole "performance" discussion. That being said, the simplest way to get a solid Node set up is to note the following basic facts about NodeJS, and I will also comment on their implications as they pertain to your questions.
The concurrency you get with Node works really good in certain situations, namely IO heavy operations. What we're really talking about here is minimizing the amount of downtime to wait for the next request. Because of this, Node works really well in an environment where there is one process per core on a machine. Node does really well at maximizing the amount of CPU available to serve requests under heavy load. This being said, if you have literally ZERO other work going on in your even loop, you can see minor performance increases (in terms of max requests/second/processor core) by having multiple node processes per core. But, I've never seen any benefit from increasing this number past 3. Even under circumstances where the entire event loop was literally just a file server.
On the process per site comment. This is a bad idea for many reasons. For one, a well put together node server can process thousands of requests per second. Our (company name omitted) servers, hosted through Amazon EC2 on medium clusters (lots of ram, mid CPU clock, 4 cores), typically fail around 3000 requests per second per cluster. Our servers do a fair bit of CPU work, for simple file servers I'm sure you can do much better. Strictly speaking, sure, per site, you will be able to serve more requests by launching each site in its own process/core/escalating quickly here! But it's not necessary from a cost and over complication of your architecture point of view. What I WOULD recommend, is investing in a setup with a lot of RAM. The ability for your server to cache often requested files will effect your performance infinitely more than launching an abundance of processes for a given machine.
On the whole RAM thing. The number of processes you want to launch for a given core is dependant on two things. One is how much synchronous work done in your event loop. The more synchronous work, the more time between a given request coming in and the event loop being ready to adress the next one. If you have a busy event loop, you will be in a situation where you require more processes/CPU Core. The other thing that can effect this, particularly relevant for file servers, is the amount of RAM. Node runs much better in a high ram environment, but you can say this about ANY file server really... What this has to do with, is the number of active asynchronous operations. One downside of the way node works, is under heavy loads, you can get a large number of event handlers active at once. This is great for concurrency/simplicity, however, if your server is busy waiting around for a lot of async disk/IO to happen it will slow down and crash much sooner than if you had plenty of RAM. If you don't have enough RAM to handle all of these event handlers, you will want to keep to the 1 process/core arrangement. Otherwise, it is easier for Node to spin up many event handlers simultaneously, and again cause you to crash sooner than you would otherwise.
I don't really have enough information to tell you what you SHOULD do. This depends entirely too much on the architecture of your specific server, sites, size of your sites, amount of data... etc. But these three pieces of knowledge are the basic things that help you get the most out of your Node server. To be honest, your idea about load balancing mixed with the considerations above, should do nicely for you. Surely, microoptimizations are possible, but if you do these things, you should easily see requests/second in the thousands before you start experiencing crashes because of DDOS type of conditions.

No, don't do it. Keep it simple! And check out http://12factor.net/.
A few hundred processes is nothing compared to the simplicity you otherwise lose. It would be a terrible decision, on so many levels, to have more than one site (or, "logical application unit") served by a single Node process.
If you're asking this question, you may want to explore Node more before you "migrate" to Node. Error handling and separation of concerns are more complicated in Node than in other situations. Specifically, neither the domain nor cluster APIs are mature. But really it's the philosophy of clean and simple application deployment that you'd be violating. I could go on and on.

When to use processes, multithreading or neither

While using Chrome I notice a great many processes. As I understand it Chrome's design entails each plugin, tab and extension to be in its own process.
Is this a more effective approach to just using multithreading such as internet explorer does, or having everything in one process such as Firefox?
I feel the different browsers are a good example for my question. When is it appropriately to use a process for everything vs multithreading vs nothing?

In Chrome, this multi-processing is done to achieve robustness and security. If one process hangs or crashes, it has absolutely no affect on the other processes (unless it's the process responsible for the UI). This is crucial in an environment where 3rd party extensions might hang, crash or even try to hack your program (different processes don't share the same address space, as opposed to different threads).
Using this approach adds a significant overhead for Inter-Process Communication. When multithreading, passing messages is not a problem, because all threads share the same address space. Passing messages between processes is much harder. Also, the total memory consumption of the program is higher when using processes instead of threads. And then there's the state of the program which is also harder to maintain - if the UI crashes, for instance, all other processes must be shut down. This is not a problem in multithreading, where the main thread exits.
Bottom line - on most cases, using multiple processes is an overkill. However, for better separation of the various tasks, it might be the solution on some cases.

What Use are Threads Outside of Parallel Problems on MultiCore Systems?

Threads make the design, implementation and debugging of a program significantly more difficult.
Yet many people seem to think that every task in a program that can be threaded should be threaded, even on a single core system.
I can understand threading something like an MPEG2 decoder that's going to run on a multicore cpu ( which I've done ), but what can justify the significant development costs threading entails when you're talking about a single core system or even a multicore system if your task doesn't gain significant performance from a parallel implementation?
Or more succinctly, what kinds of non-performance related problems justify threading?
Edit
Well I just ran across one instance that's not CPU limited but threads make a big difference:
TCP, HTTP and the Multi-Threading Sweet Spot
Multiple threads are pretty useful when trying to max out your bandwidth to another peer over a high latency network connection. Non-blocking I/O would use significantly less local CPU resources, but would be much more difficult to design and implement.

Performing a CPU intensive task without blocking the user interface, for example.

Any application in which you may be waiting around for a resource (for example, blocking I/O from network sockets or disk devices) can benefit from threading.
In that case the thread blocking on the slow operation can be put to sleep while other threads continue to run (including, under some operating systems, the GUI thread which, if the OS cannot contact it for a while, will offer the use the chance to destroy it, thinking it's deadlocked somehow).
So it's not just for multi-core machines at all.

An interesting example is a webserver - you need to be able to handle multiple incoming connections that have nothing to do with each other.

what kinds of non-performance related
problems justify threading?
Web applications are the classic example. Each user request is conceptually a new thread. Nothing to do with performance, it's just a natural fit for the design.

Blocking code is usually much simpler to write and easier to read (and therefore maintain) than non-blocking code. Yet, using blocking code limits you to a single execution path and also locks out things like user interface (mentioned) and other IO ports. Threading is an elegant solution in these cases.
Another case when multithreading is to be considered is when you have several near-synchronous IO channels that should be managed: using multiple threads (and usually a local message queue) allows for much clearer code.

Here are a couple of specific and simple scenarios where I have launched threads...
A long running report request by the user. When the report is submitted, it is placed in a queue to be processed by a separate thread. The user can then go on within the application and check back later to see the status of their report, they aren't left with a "Processing..." page or icon.
A thread that iterates cache storage, removing data that has expired or no longer needed. The thread's job within the application is independent of any processing for a specific user, but part of the overall application run-time maintenance.
although, not specifically a threading scenario, logging within our web site is handed off to a parallel process, so the throughput of the web site isn't hindered by the time it takes to record log data.
I agree that threading just for threadings sake isn't a good idea and it can introduce problems within your application if isn't done properly, but it is an extremely useful tool for solving some problems.

Whenever you need to call some external component (be it a database query, a 3. party library, an operating system primitive etc.) that only provides a synchronous/blocking interface or using the asynchronous interface not worth the extra trouble and pain - and you also need some form of concurrency - e.g. serving multiple clients in a server or keep the GUI still responsive.

Well, how do you know if you're app is going to run on a multi-core system or not?
Beyond that, there are a lot of processes that take up time, but don't require the CPU. Such as writing to a disk or networking. Who wants to push a button in a GUI and then have to sit there and wait for a network connection. Even on a single core machine, having a separate IO thread greatly improves user experience. You always at least want a separate thread for the UI.

Yet many people seem to think that
every task in a program that can be
threaded should be threaded, even on a
single core system.
"Many people"... Who?
Also from my experience many many programs that should be multithreaded aren't (especially games.. I have an i7 and yet most games still use only 1 of my cores), so I'm not sure what you're talking about. Definitely programs like calc.exe are not multithread (or, if they are, 1 thread does 99% of the work).
Performing a CPU intensive task
without blocking the user interface,
for example.
Yes, this is true but this is fairly easy to implement and it's not what the OP is referring to (since, in this case, 1 thread does almost all the work and you only need very few mutexes)

Testing performance of parallel programs on a single core machine

I would like to start playing with concurrency in the programs I write (mostly for fun), but I don't own a multi-core system and can't afford one any time soon. I run linux. Is there a way to, for example with a Virtual Machine, compare the performance of a multi-threaded implementation of a program with a single-threaded version, without actually running it on hardware with multiple processors or cores?
That is, I would like to be able to implement parallel algorithms and be able to say that, yes, this multithreaded implementation is better-performing than the single-threaded.
Thanks

You can not test multithreaded programs reliably on a single core machine. Race conditions will show up very differently or even be totally hidden on a single core machine. The performance will decrease etc.
If you want to LEARN how to program multiple threads, you can do so on a single core machine for the first steps (i.e how works the API etc.). But you'll have to test on a multicore machine and its very likely that you will see faults on a multicore machine that you dont see on a single core machine.
Virtual machines are by my experience no help with this. They introduce new bugs, that didnt show up before, but they CANT simulate real concurrency with multiple cores.

Depending on what you're benchmarking you might be able to use an Amazon EC2 node. It's not free, but it's cheaper than buying a computer.

If you have only one core/cpu and your algorithm is cpu intensive, you will probably see multi-threaded program is actually slower than the single-threaded one. But if you have program use i/o in one thread and cpu in another for example, then you can see the multi-threaded program is faster.

To observe effects other than potentially improved locality, you'll need hardware or a simulator that actually models the communication/interaction that occurs when the program runs in parallel. There's no magic to be had.

How to most efficently handle large numbers of file descriptors?

There appear to be several options available to programs that handle large numbers of socket connections (such as web services, p2p systems, etc).
Spawn a separate thread to handle I/O for each socket.
Use the select system call to multiplex the I/O into a single thread.
Use the poll system call to multiplex the I/O (replacing the select).
Use the epoll system calls to avoid having to repeatedly send sockets fd's through the user/system boundaries.
Spawn a number of I/O threads that each multiplex a relatively small set of the total number of connections using the poll API.
As per #5 except using the epoll API to create a separate epoll object for each independent I/O thread.
On a multicore CPU I would expect that #5 or #6 would have the best performance, but I don't have any hard data backing this up. Searching the web turned up this page describing the experiences of the author testing approaches #2, #3 and #4 above. Unfortunately this web page appears to be around 7 years old with no obvious recent updates to be found.
So my question is which of these approaches have people found to be most efficient and/or is there another approach that works better than any of those listed above? References to real life graphs, whitepapers and/or web available writeups will be appreciated.

Speaking with my experience with running large IRC servers, we used to use select() and poll() (because epoll()/kqueue() weren't available). At around about 700 simultaneous clients, the server would be using 100% of a CPU (the irc server wasn't multithreaded). However, interestingly the server would still perform well. At around 4,000 clients, the server would start to lag.
The reason for this was that at around 700ish clients, when we'd get back to select() there would be one client available for processing. The for() loops scanning to find out which client it was would be eating up most of the CPU. As we got more clients, we'd start getting more and more clients needing processing in each call to select(), so we'd become more efficient.
Moving to epoll()/kqueue(), similar spec'd machines would trivially deal with 10,000 clients, with some (admitidly more powerful machines, but still machines that would be considered tiny by todays standards), have held 30,000 clients without breaking a sweat.
Experiments I've seen with SIGIO seem to suggest it works well for applications where latency is extremely important, where there are only a few active clients doing very little individual work.
I'd recommend using epoll()/kqueue() over select()/poll() in almost any situation. I've not experimented with splitting clients between threads. To be honest, I've never found a service that needed more optimsation work done on the front end client processing to justify the experimentation with threads.

I have spent the 2 last years working on that specific issue (for the G-WAN web server, which comes with MANY benchmarks and charts exposing all this).
The model that works best under Linux is epoll with one event queue (and, for heavy processing, several worker threads).
If you have little processing (low processing latency) then using one thread will be faster using several threads.
The reason for this is that epoll does not scale on multi-Core CPUs (using several concurrent epoll queues for connection I/O in the same user-mode application will just slow-down your server).
I did not look seriously at epoll's code in the kernel (I only focussed on user-mode so far) but my guess is that the epoll implementation in the kernel is crippled by locks.
This is why using several threads quickly hit the wall.
It goes without saying that such a poor state of things should not last if Linux wants to keep its position as one of the best performing kernels.

From my experience, you'll have the best perf with #6.
I also recommend you look into libevent to deal with abstracting some of these details away. At the very least, you'll be able to see some of their benchmark .
Also, about how many sockets are you talking about? Your approach probably doesn't matter too much until you start getting at least a few hundred sockets.

I use epoll() extensively, and it performs well. I routinely have thousands of sockets active, and test with up to 131,072 sockets. And epoll() can always handle it.
I use multiple threads, each of which poll on a subset of sockets. This complicates the code, but takes full advantage of multi-core CPUs.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string